Base on Encoder models

Author

차상진

Published

March 30, 2025

1. Classification model

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "klue/bert-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
model

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at klue/bert-base and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
)

- 클래스가 잘 설정되었는지 확인

model.config.id2label

{0: 'LABEL_0', 1: 'LABEL_1'}

1-2. Dataset

from datasets import load_dataset

dataset = load_dataset('klue','sts')
dataset['train']

Dataset({
    features: ['guid', 'source', 'sentence1', 'sentence2', 'labels'],
    num_rows: 11668
})

def process_data(batch):
    result = tokenizer(batch['sentence1'], text_pair=batch['sentence2'])
    # 데이터 셋이 sentence1,sentence2으로 이루어져 있다는 것을 정확하게 알아야 이런 코드를 작성할 수 있다.
    # text_pair은 두 문장간의 문장 관계 분석을 위해서 추가되는 옵션이다.
    result['labels'] = [l['binary-label'] for l in batch['labels']]
    return result

dataset = dataset.map(
    process_data,
    batched = True,
    remove_columns = dataset['train'].column_names)

from transformers import DataCollatorWithPadding

collator = DataCollatorWithPadding(tokenizer)
batch = collator([dataset['train'][l] for l in range(10)])

1-3. Prediction

with torch.no_grad():
    logits = model(**batch).logits

logits

tensor([[-0.6647,  0.5699],
        [ 0.2459, -0.5885],
        [-0.3337,  0.2817],
        [ 0.1649,  0.0235],
        [-0.5286,  0.4380],
        [ 0.7408,  0.0513],
        [-0.3665,  0.6204],
        [ 0.6414, -0.6416],
        [ 0.1279, -0.2553],
        [-0.3801,  0.3907]])

pred_labels = logits.argmax(dim=1).cpu().numpy()
true_labels = batch['labels'].numpy()
print(pred_labels)
print(true_labels)

[1 0 1 0 1 0 1 0 0 1]
[1 0 0 0 1 0 1 0 0 1]

import evaluate

f1 = evaluate.load('f1')
f1.compute(predictions = pred_labels, references = true_labels, average='micro')

{'f1': 0.9}

2. Regression model

- 분류 모델이 반환하는 것은 ’확률’이 아니라 ’로짓(logits)’이다.

일반적으로 분류모델은 마지막 층에서 softmax를 적용하지 않는다. 대신 logits, 즉 스케일링되지 않은 점수 값을 반환한다. 위에서 말한 스케일링의 의미는 통계학에서의 표준화와는 다른 개념이다. 여기서 말하는 스케일링은 0과 1 사이의 확률값으로 변환하는 softmax이다. 통계학에서 표준화는 평균이 0이고 분산이 1이 되도록 만드는 것이다. 머신러닝에서는 sciket-learn의 MinMaxScaler는 softmax이고, StandardScaler은 통계학의 표준화이다.

그렇다면 왜 softmax를 적용하지 않고 logits 값만 반환할까? - 수치적으로 더 안정적이고 loss 계산이 더 쉽기 때문이다. - 만약 로짓 값이 매우 큰 경우(예: 1000, 2000)라면, 소프트맥스 계산 중에 지수 함수(exp(x))로 변환하면 매우 큰 숫자가 나오고, 이는 컴퓨터에서 처리할 수 있는 범위를 넘을 수 있다. - 손실함수(ex.Cross Entropy Loss)에 logits이 아니라 softmax로 계산된 확률값을 넣는다면 값이 너무 작아지는 underflow가 일어날 수 있기 때문이다.

- Sequence Classification에서 num_labels = 1으로 설정하면 연속적인 실수를 예측하는 회귀(Regression) 문제에 사용이 가능하다.

출력 차원이 1인 로짓 값이 나오는데, 이 값은 회귀 문제의 예측값이 될 수 있다.
num_labels=1으로 설정하면 자동으로 회귀 태스크로 인식하여 크로스 엔트로피가 아닌 MSE를 사용한다.

import torch
from transformers import AutoTokenizer, BertForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("klue/bert-base")
model = BertForSequenceClassification.from_pretrained("klue/bert-base", num_labels=1)
print(model)

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at klue/bert-base and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=1, bias=True)
)

2-2. Predcition

with torch.no_grad():
    logits = model(**batch).logits

logits

tensor([[-0.0012],
        [-0.2672],
        [ 0.1028],
        [-0.3430],
        [-0.0204],
        [-0.2547],
        [ 0.0901],
        [-0.6130],
        [-0.4369],
        [ 0.0066]])

3. Multiple Choice model

- 여러 개의 입력이 주어졌을 때, 주어진 문장 중 옳은 문장을 고르는 객관식 문제.

ex) 일반적인 객관식 문제

Q: 뉴턴의 운동 법칙 중 첫 번째 법칙은 무엇인가?

힘은 질량과 가속도의 곱이다.
모든 물체는 외부에서 힘이 가해지지 않는 한 정지 또는 등속 운동을 유지한다.
모든 작용에는 크기가 같고 반대 방향인 반작용이 있다.
에너지는 생성되거나 소멸되지 않고 변환될 뿐이다.

정답: (B)

Multiple Choice에서 트랜스포머 모델은 강력하다.

트랜스포머는 문장이 길어질수록 연산량이 제곱으로 증가한다 (self-attention 과정)

즉 트랜스포머는 긴 문장을 처리하는 것보다 여러 개의 짧은 문장을 처리하는 게 연산량 측면에서 유리하다.

그런데 Multiple Choice은 짧은 답변이 여러 개여서 계산이 쉽다.

import torch
from transformers import AutoTokenizer, AutoModelForMultipleChoice

model_name = 'klue/bert-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMultipleChoice.from_pretrained(model_name)
model

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForMultipleChoice: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMultipleChoice from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMultipleChoice were not initialized from the model checkpoint at klue/bert-base and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

BertForMultipleChoice(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=1, bias=True)
)

- 여러 개 중 하나를 고르는 분류임에도 num_labels을 설정하지 않았다.. 왜?

다른 task와 동일하게 모델 마지막 부분이 classifier 레이어가 하나 추가됐는데, Multiple Choice에서는 sample(문제)당 여러 개의 후보를 각각 문장으로 입력받기에 임베딩 과정을 거치면서 후보 개수, 문장 길이, 임베딩 사이즈로 총 3차원으로 이루어진다.
여기서 배치처리까지 하면 4차원 데이터를 가지므로 사용이 힘들어진다.
그래서 flatten을 적용하여 문장을 3차원으로 바꾸고 추론을 하고 다시 원상태 (4차원)로 복구한다.
flatten된 데이터는 문장당 0~1 사이 확률 값을 하나만 가지면 되므로 자동으로 num_labels 수는 1로 고정된다.

할 수 있는 질문

질문 1. flatten() 하면 1차원이 되는 거 아닌가? 어떻게 3차원이 돼? - 일반적으로 flatten()은 1차원이 되지만 여기서는 특정 축을 기준으로 차원을 줄이는 방식을 사용. - (batch_size, num_choices, seq_length, hidden_dim) -> (batch_size * num_choices, seq_length, hidden_dim)

질문 2. 문장당 0~1 사이 확률값을 하나만 가지면 되므로 num_labels은 1이다? - 보통 분류 태스크에서 num_labels = 3이라고 한다면 softmax를 적용해서 여러 클래스 중 하나를 선택 (클래스 별로 확률을 출력함) - Multiple Choice는 정답일 확률만 출력하면 된다. 그 중 가장 큰 것을 선택하면 됨. 즉, 문장 하나에 대해 스칼라 값 하나만 출력하면 되므로 num_labes = 1이다. (회귀 문제와 같다.)

문장분류 vs 다중 분류

문장 분류: 문장 한 개당 N개의 확률 출력 (N = 클래스의 수)

다중 분류: N개의 문장을 입력받아 문장당 한 개씩, 총 N개 확률 추출 (N = 객관식 보기 개수)

3-2. Dataset

수능 국어 문제

from datasets import load_dataset

dataset = load_dataset("HAERAE-HUB/csatqa", "full")
print(dataset["test"][0])

ending_names = ["option#1", "option#2", "option#3", "option#4", "option#5"]

def preprocess_function(examples): # examples 자리에 dataset의 batch가 들어간다.
  first_sentences = [
      [context] * 5 for context in examples["context"] # 각 문항에 5개의 선택지가 있다. 각 선택지마다 동일한 context를 사용해야함.
  ]
  question_headers = examples["question"]
  second_sentences = [
      [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers) # 각 질문과 선택지를 하나로 합치는 역할
      # 1. enumerate(question_headers) → question_headers에서 (index, question_text) 쌍을 가져옴.
      # 2. for end in ending_names → "option#1" ~ "option#5"까지 돌면서 해당 선택지를 가져옴.
      # 3. f"{header} {examples[end][i]}" → 각 질문(header)과 해당 선택지를 합친 새로운 문장을 생성.
  ]
  # 토큰화를 위해 1차원으로 평활화
  first_sentences = sum(first_sentences, []) # flatten()과 같은 효과. flatten()은 numpy에서 동작하므로 리스트에서는 sum(리스트, []) 사용
  second_sentences = sum(second_sentences, [])

  # None 데이터 처리
  first_sentences = [i if i else "" for i in first_sentences] # sentences에서 None을 공백으로 바꾸는 코드. 즉, None 데이터를 처리해서 모델이 학습할 수 있게 함
  second_sentences = [i if i else "" for i in second_sentences]

  tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
  # Multiple Choice 문제에서는 질문 + 각 답변을 결합해야한다. 결합하고 싶은 문장을 이어서 작성한다면 알아서 결합된다.

  # 토큰화 후 다시 2차원으로 재배열
  result = {
      k: [v[i:i+5] for i in range(0, len(v), 5)] for k, v in tokenized_examples.items()
  }
  result["labels"] = [i-1 for i in examples["gold"]]  # k는 문제(문제와 보기), v는 선택지 5개이다. 보기 좋은 2차원 배열로 재배열

  return result

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset["test"].column_names)

{'question': ' 이 이야기에서 얻을 수 있는 교훈으로 가장 적절한 것은?', 'context': '이제 한 편의 이야기를 들려 드립니다. 잘 듣고 물음에 답하십시오.\n자, 여러분! 안녕하십니까? 오늘은 제가 어제 꾼 꿈 이야기 하날 들려 드리겠습니다. 전 꿈속에서 낯선 거리를 걷고 있었습니다. 그러다가 홍미로운 간판을 발견했답니다. 행 복을 파는 가게. 그렇게 쓰여 있었습니다. 전 호기심으로 문을 열고 들어갔답니다. 그곳 에서는 한 노인이 물건을 팔고 있었습니다. 전 잠시 머뭇거리다가 노인에게 다가가서 물 었습니다. 여기서는 무슨 물건을 파느냐고요. 노인은 미소를 지으며, 원하는 것은 뭐든 다 살 수 있다고 말했습니다. 저는 제 귀를 의심했습니다. \'무엇이든 다?\' 전 무엇을 사야 할까 생각하다가 말했답니다. "사랑, 부귀 그리고 지혜하고 건강도 사고 싶습니다. 저 자신뿐 아니라 우리 가족 모두 를 위해서요. 지금 바로 살 수 있나요?" 그러자 노인은 빙긋이 웃으며 대답했습니다. "젊은이, 한번 잘 보게나. 여기에서 팔고 있는 것은 무르익은 과일이 아니라 씨앗이라 네. 앞으로 좋은 열매를 맺으려면 이 씨앗들을 잘 가꾸어야 할 걸세."', 'option#1': '새로운 세계에 대한 열망을 가져야 한다.', 'option#2': '주어진 기회를 능동적으로 활용해야 한다.', 'option#3': '큰 것을 얻으려면 작은 것은 버려야 한다.', 'option#4': '물질적 가치보다 정신적 가치를 중시해야 한다.', 'option#5': '소망하는 바를 성취하기 위해서는 노력을 해야 한다.', 'gold': 5, 'category': 'N/A', 'human_performance': 0.0}

다중 분류 task에서는 일반적으로 사용하는 DataCollatorWithPadding을 사용하기 어렵다.

이를 위해 패딩 등 필요한 작업을 진행하는 콜레이터를 직접 작성해야한다.

그 전 작성된 콜레이터를 이해하기 위해선 아래의 문법을 알아야한다. 간략하게 설명할테니 숙지하고 넘어가도록 하자.

번외1. 파이썬 문법

__init__()

객체 지향 프로그래밍에서 클래스를 만들면 해당 클래스의 객체(인스턴스)를 생성할 때 자동으로 호출되는 메서드가 __init__()이다.

class Example:
    def __init__(self, a, b):
        self.a = a
        self.b = b

예를 들어 위와 같은 클래스를 만든다고 했을 때, __init__() 메서드는 클래스를 처음 만들 때 자동으로 실행된다.

self.a = a -> a값을 객체 내부에 저장

self.b = b -> b값을 객체 내부에 저장

obj = Example(3,5)
print(obj.a)
print(obj.b)

3
5

즉 __init__()은 클래스를 만들 때 필요한 변수를 초기화하는 역할을 한다.

데코레이터 + dataclass

데코레이터(Decorator) 는 함수나 클래스를 꾸며주는(변형하는) 함수이다. @을 붙혀서 사용한다. @dataclass는 클래스에서 __init__()을 자동으로 만들어주는 역할을 한다.

from dataclasses import dataclass

@dataclass
class Example:
    a: int
    b: int

위의 코드에서 __init__()을 따로 만들지 않았음에도 자동 생성되었고 내부적으로는 아래의 코드와 같은 방식이다.

def __init__(self, a:int, b:int):
    self.a = a
    self.b = b

추가적으로 a: int, b: int 와 같이 쓴 이유는 a와 b는 int 타입을 기대한다는 것을 알리기 위해 사용한 것이다.

하지만 a는 정수여야 한다 는 아니므로 float을 입력해도 에러는 나지 않는다.

즉, 권장사항이다.

@dataclass 말고도 @attrs 등 많은 기능을 제공하는 다른 라이브러리들이 많다. 필요한 것을 골라서 사용하면 된다.

번외 2.`Union`, `Optional`?

Union

Union과 Optional은 타입 힌트 에서 사용되는 개념이다. Python의 타입 시스템에서 변수나 함수가 가질 수 있는 값을 더 명확하게 지정하는데 사용된다.

Union
- Union은 “이 변수는 여러 타입 중 하나일 수 있다” 는 뜻이다. 예를 들어 Union[int,float]이라면 해당 변수나 값이 int일 수도 있고 float일 수도 있다는 것을 의미함

def foo(x: Union[int, float]) -> None:
    print(x)

print(foo(int)) # 당연히 가능
print(foo(float)) # 당연히 가능
print(foo(bool)) # int,float이 제한사항이 아니라 권장사항이므로 bool도 당연히 된다.

<class 'int'>
None
<class 'float'>
None
<class 'bool'>
None

Optional

Optional
- ‘Optional[X]’ = Union[X, None] 즉, 해당 값이 X일 수도 있고, None일 수도 있다는 의미이다.
- ’Optional’을 사용하면 값이 None일 수 있다는 것을 명시적으로 나타낼 수 있다.

def foo(x: Optional[int]) -> None:
    print(x)

print(foo(int)) # 당연히 가능
print(foo(float)) # 당연히 가능
print(foo(bool)) # int,float이 제한사항이 아니라 권장사항이므로 bool도 당연히 된다.

<class 'int'>
None
<class 'float'>
None
<class 'bool'>
None

3-3. Collator

Collator는 배치를 만들기 위한 객체이고 batch는 그 결과물이다.

batcg를 model(**batch)로 넣으면 콜레이터에서 변경된 데이터 형식도 그대로 반영된다.

from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch


@dataclass # 데코레이터
# @dataclass -> __init__을 자동으로 생성해주는 데이터 클래스
# Collator는 data loader에서 batch를 구성할 때 사용. 일반적인 Collator를 padding과 tensor변환 담당
# 하지만 다중 선택 문제에서는 input_ids가 2차원 구조이기에 일반적인 Collator를 사용할 수 없음.
class DataCollatorForMultipleChoice:
  tokenizer: PreTrainedTokenizerBase # 실제로 모델에 입력되는 데이터를 토크나이저로 변환하는 도구.
  padding: Union[bool, str, PaddingStrategy] = True 
  # 입력 데이터가 고정 길이를 가지도록 패딩을 추가하는 방법을 정의한다. 기본 값은 True, 필요하면 패딩을 추가한다.
  max_length: Optional[int] = None # 입력 시 최대 길이를 설정한다. max_length를 초과하는 토큰은 잘린다.
  pad_to_multiple_of: Optional[int] = None # 이 값은 패딩 길이가 특정 수의 배수가 되도록 설정할 수 있다.
  # 이 변수들은 클래스를 초기화할 때 설정할 값들로
  def __call__(self, features): # 클래스의 인스턴스를 함수처럼 호출할 수 있도록 만듦
    label_name = "label" if "label" in features[0].keys() else "labels" # label이나 labels중 하나를 사용해서 레이블 이름을 결정한다.
    labels = [feature.pop(label_name) for feature in features] # 각 샘플에서 레이블을 꺼내고 pop()으로 레이블을 분리

    batch_size = len(features) # features의 길이를 통해 한 번에 처리하는 샘플 수(batch_size)를 결정한다
    num_choices = len(features[0]["input_ids"]) # 각 샘플에 포함된 선택지 수를 결정.

    # multiple choice에서 여러 개의 선택지를 평탄화(flatten)하는 과정
    # 첫 번째 리스트 컴프리헨션은 각 샘플에 대해 선택지별로 분리한다.
    # 두 번째 리스트 컴프리헨션은 각 샘플에 대해 평탄화하여 하나의 리스트로 만든다.
    flattened_features = [
        [
            {k: v[i] for k, v in feature.items()}
            for i in range(num_choices)
        ]
        for feature in features
    ]
    flattened_features = sum(flattened_features, []) # 중첩된 리스트를 하나로 합친다.

    # 토큰화를 적용하고 다시 2차원 구조로 변환한다.
    # flattened_features 리스트를 self.tokenizer.pad(...)에 넣어서 토큰화 수행, return_tensors = 'pt'를 이용해 파이토치 형식으로 변환
    batch = self.tokenizer.pad(
        flattened_features,
        padding=self.padding,
        max_length=self.max_length,
        pad_to_multiple_of=self.pad_to_multiple_of,
        return_tensors="pt",
    ) # 이렇게 하면 각 선택지가 개별적으로 패딩되어, 입력 길이가 맞춰진다.

    # 다시 배치 크기 * 선택지 개수형태로 복구한다.
    batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()} # batch_size(문제) * num_choices(선택지)로 맞추고 -1으로 나머지는 자동으로 맞춘다.
    batch["labels"] = torch.tensor(labels, dtype=torch.int64) # 레이블을 추가하여 정답이 몇 번째 선택지인지 알 수 있게 한다.
    return batch

collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
batch = collator([tokenized_dataset["test"][i] for i in range(5)])

with torch.no_grad():
  logits = model(**batch).logits

logits

tensor([[0.0498, 0.0333, 0.2016, 0.0107, 0.1579],
        [0.0852, 0.0705, 0.0632, 0.0507, 0.0745],
        [0.1740, 0.1215, 0.2006, 0.2101, 0.2531],
        [0.1829, 0.2058, 0.1865, 0.1838, 0.3799],
        [0.2215, 0.2357, 0.2723, 0.2856, 0.3356]])

모델이 Dropout과 같은 랜덤 연산을 포함한다면 같은 모델에 같은 입력을 넣어도 logits 값은 달라진다.

3-4. evaluate

import evaluate

pred_labels = logits.argmax(dim=1).cpu().numpy()
true_labels = batch["labels"].numpy()
print(pred_labels)
print(true_labels)

f1 = evaluate.load("f1")
f1.compute(predictions=pred_labels, references=true_labels, average="micro")

[2 0 4 4 4]
[4 4 0 3 1]

{'f1': 0.0}

4.Token Classifiation

- 말 그대로 토큰 단위로 분류를 진행한다. 주로 문장 내에서 유호한 개체를 추출해 내는 개체명 인식 태스크에서 가장 많이 사용한다.

4-1. model

- 베이스 모델은 기본 모델인 모델명PreTrainedModel을 상속하며 모델명ForTokenClassification을 사용한다.

다만 문장 벡터 차원을 축소하는 풀링 작업을 진행하지 않고 입력된 각 토큰에 모두 출력 헤더를 달아 독립적으로 분류를 진행한다.

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = "klue/bert-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForTokenClassification: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at klue/bert-base and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
)

- (classifier): Linear(in_features=768, out_features=2, bias=True)에서 out_features=2인 이유는 분류되는 클래스의 개수가 2개이기 때문이다. 만약 더 세분화하여 구분되어야한다먼 out_features=?? ??의 수가 더 늘어나야한다.

4-2. Dataset

from datasets import load_dataset

dataset = load_dataset("klue", "ner")

sample = dataset["train"][0]
print("tokens : ", sample["tokens"][: 20])
print("ner tags : ", sample["ner_tags"][: 20])
print((len(sample["tokens"]), len(sample["tokens"])))

tokens :  ['특', '히', ' ', '영', '동', '고', '속', '도', '로', ' ', '강', '릉', ' ', '방', '향', ' ', '문', '막', '휴', '게']
ner tags :  [12, 12, 12, 2, 3, 3, 3, 3, 3, 12, 2, 3, 12, 12, 12, 12, 2, 3, 3, 3]
(66, 66)

for l in range(len(sample['ner_tags'])):
    print(sample['tokens'][l], '\t', sample['ner_tags'][l])

- 문자 단위로 분할된 tokens 칼럼은 이미 ’토큰화’되었다고 할 수 있다. 따라서 문장 인코딩을 진행할 때 평소처럼 토큰화 - 정수 인코딩 과정을 거치지 않고 정수 인코딩 과정만 거치도록 코드를 작성해야한다.

# 토큰화 x , 정수 인코딩 o
def tokenize_and_align_labels(examples): # examples : dataset.map()을 통해 받을 배치 데이터
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True) # s_split_into_words=True면 토크나이저는 토큰화가 이미 진행됐다고 인식함.
    # example['tokens'] -> [['Hello','world],['My','name','is','John']]
    # example['ner_tags'] -> [[0,0],[0,1,0,2]] (각 단어의 라벨)
    labels = []
    for i, label in enumerate(examples[f"ner_tags"]): 
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # 토큰을 해당 단어에 매핑, 추가적으로 word_ids 메서드는 word index의 줄임말이다.
        previous_word_idx = None # 이전 단어 인덱스를 저장하여 첫 번째 토큰인지 확인
        label_ids = []
        for word_idx in word_ids:  # 스페셜 토큰을 -100으로 세팅
            if word_idx is None: # 토큰이 None이라는 것은 현재 토큰이 특별한 토큰인 것을 나타내는 것이다. [CLS],[SEP],[PAD]일 때 None으로 출력되기 때문이다.
                label_ids.append(12) # 12는 의미없는 토큰이라는 의미, -100은 손실계산을 하지 않기 위함 즉 12, -100 모두 자주 사용되는 값이다.
                # label_ids.append(-100)
                # 그런데! None이라는 건 특별한 거라면서? 왜 12나 -100을 추가해서 손실계산에서 빼?
                # -> None은 단어에 속하지 않는 스페셜 토큰을 나타낸다. 실제 단어가 아니기에(실제 문장의 의미를 담지 않기에) 토큰화 후 해당 토큰들이 학습에서 계산에 포함되는 것은 부적절하다.
            elif word_idx != previous_word_idx:  # 주어진 단어의 첫 번째 토큰에만 레이블을 지정
                label_ids.append(label[word_idx])
            else: # playing에서 play , ##ing으로 나뉜다면 첫 번째 토큰인 play는 elif 구문에서 레이블을 넣고 ##ing은 -100으로 처리한다.
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

        # if-elif-else 구조가 필요한 이유
        # word_idx is None (스페셜 토큰 처리)
        # [CLS], [SEP], [PAD] 같은 특별한 토큰을 손실 계산에서 제외
        # word_idx != previous_word_idx (단어의 첫 번째 토큰)
        # 단어의 첫 번째 토큰에만 레이블을 할당
        # else (단어의 나머지 토큰들)
        # 단어의 나머지 토큰들은 손실 계산에서 제외 (-100 사용)
        
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True, remove_columns=dataset["train"].column_names)

from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
batch = data_collator([tokenized_dataset["train"][i] for i in range(10)])

id2label = {
    0: "B-DT",
    1: "I-DT",
    2: "B-LC",
    3: "I-LC",
    4: "B-OG",
    5: "I-OG",
    6: "B-PS",
    7: "I-PS",
    8: "B-QT",
    9: "I-QT",
    10: "B-TI",
    11: "I-TI",
    12: "O",
}
label2id = {v:k for k,v in id2label.items()} # k,v 뒤집기

from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    'klue/bert-base',
    num_labels = 13,
    id2label = id2label,
    label2id = label2id
)

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForTokenClassification: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at klue/bert-base and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

with torch.no_grad():
  logits = model(**batch).logits

predictions = torch.argmax(logits, dim=2)
predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]
predicted_token_class

['I-OG',
 'O',
 'O',
 'O',
 'O',
 'B-PS',
 'O',
 'B-TI',
 'B-TI',
 'O',
 'B-OG',
 'B-OG',
 'I-DT',
 'B-OG',
 'B-TI',
 'O',
 'I-DT',
 'O',
 'I-OG',
 'O',
 'B-OG',
 'I-QT',
 'I-OG',
 'B-OG',
 'I-QT',
 'O',
 'B-TI',
 'I-QT',
 'O',
 'I-DT',
 'O',
 'B-TI',
 'B-TI',
 'B-OG',
 'B-TI',
 'B-TI',
 'I-QT',
 'I-DT',
 'B-OG',
 'I-DT',
 'B-OG',
 'B-QT',
 'B-DT',
 'B-TI',
 'B-TI',
 'O',
 'B-OG',
 'I-OG',
 'O',
 'B-DT',
 'I-TI',
 'O',
 'B-TI',
 'O',
 'O',
 'B-PS',
 'B-OG',
 'O',
 'O',
 'O',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'O',
 'B-OG',
 'B-OG',
 'O',
 'O',
 'B-OG',
 'B-OG',
 'O',
 'I-QT',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'O',
 'B-OG',
 'B-OG',
 'B-OG',
 'O',
 'B-OG',
 'O',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-OG',
 'B-OG',
 'B-OG',
 'B-OG',
 'O',
 'O',
 'B-OG',
 'B-OG',
 'B-OG']

4-3. evaluate

import evaluate

pred_labels = logits.argmax(dim=2).flatten().cpu().numpy() # logits.argmax(dim=2)의 결과를 1차원 벡터로 변환
true_labels = batch["labels"].flatten().numpy() # batch의 레이블을 1차원 벡터로 변환

# evaluate 할 때는 데이터들을 1차원 텐서로 바꿔야한다.
f1 = evaluate.load("f1")
f1.compute(predictions=pred_labels, references=true_labels, average="micro")

{'f1': 0.06923076923076923}

5. Question Answering

추출: 주어진 context에서 답변을 추출한다. - 추출 질의 응답은 질문에 대한 답변을 입력된 context에서 말 그대로 추출하는 방식이다.

생성: 질문에 정확하게 답하는 맥락에서 답을 생성한다. - 문제에 대한 답을 입력 context를 참고하여 새로 작성하는 방식이다.

5-1. model

import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_name = "klue/bert-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
model

2025-03-28 02:48:07.896997: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1743130087.914973   41991 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743130087.920553   41991 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1743130087.934591   41991 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1743130087.934608   41991 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1743130087.934609   41991 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1743130087.934611   41991 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-03-28 02:48:07.939078: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForQuestionAnswering: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at klue/bert-base and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (qa_outputs): Linear(in_features=768, out_features=2, bias=True)
)

5-2. Dataset

from datasets import load_dataset

dataset = load_dataset("klue", "mrc")
sample = dataset["train"][0]

print(f"내용 : {sample['context'][:50]}") # context: 모델이 답변을 추출할 때, 필요한 배경 정보
print(f"질문 : {sample['question']}") # question: 모델이 대답해야 하는 질문
print(f"답변 : {sample['answers']}") # answers: 답변 토큰과 답변 텍스트 시작 위치

내용 : 올여름 장마가 17일 제주도에서 시작됐다. 서울 등 중부지방은 예년보다 사나흘 정도 늦은 
질문 : 북태평양 기단과 오호츠크해 기단이 만나 국내에 머무르는 기간은?
답변 : {'answer_start': [478, 478], 'text': ['한 달가량', '한 달']}

5-3. Data preprocssing

def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second", 
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset["train"].column_names)

보충 설명

truncation="only_second"

truncation을 only_second로 설정하면 두 번째 문장에 대해서만 max_length보다 긴 부분을 잘라낸다.
QA task에서는 보통 question과 context를 함께 모델에 입력함. 보통 context가 길기고 question은 짧기에 context가 max_length를 넘으면 자른다.

return_offsets_mapping=True

인코됭된 토큰이 원본 문장에서 몇 번째 글자인지를 알 수 있도록 인덱스를 반환하도록 설정하는 옵션이다.
QA task에서 answert이 context에서 추출되는 방식이다. 즉 answer 시작과 끝이 context 내에서 특정한 위치에 존재한다. 하지만 토큰화 과정에서 단어가 쪼개지기에 원본 문장에서 정확한 위치를 찾기 힘들다.
그래서 return_offsets_mapping=True를 설정하면 각 토큰이 원본 문장의 몇 번째 글자 범위에 해당하는지 매핑해줘서 모델이 정답을 원본 문장에서 찾을 수 있도록 도와준다.

5-4. Collator

input_ids, token_type_ids, attention_mask 칼럼을 입력 문장으로 만들고 각각 답변 시작과 끝 인덱스를 가리키는 start_positions과 end_positions이 출력(정답)이 된다.

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer)
batch = data_collator([tokenized_dataset["train"][i] for i in range(10)])
batch

{'input_ids': tensor([[    2,  1174, 18956,  ...,  2170,  2259,     3],
        [    2,  3920, 31221,  ...,  8055,  2867,     3],
        [    2,  8813,  2444,  ...,  3691,  4538,     3],
        ...,
        [    2,  6860, 19364,  ...,  2532,  6370,     3],
        [    2, 27463, 23413,  ..., 21786,  2069,     3],
        [    2,  3659,  2170,  ...,  2470,  3703,     3]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]]), 'start_positions': tensor([260,  31,   0,  80,  72,  81, 216, 348, 323, 348]), 'end_positions': tensor([263,  33,   0,  81,  78,  87, 221, 352, 328, 353])}

5-5. prediction

with torch.no_grad():
    outputs = model(**batch)

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

predict_answer_tokens = batch["input_ids"][0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

'. 서울 등 중부지방은 예년보다 사나흘 정도 늦은 이달 말께 장마가 시작될 전망이다. 17일 기상청에 따르면 제주도 남쪽 먼바다에 있는 장마전선의 영향으로 이날 제주도 산간 및 내륙지역에 호우주의보가 내려지면서 곳곳에 100㎜에 육박하는 많은 비가 내렸다. 제주의 장마는 평년보다 2 ~ 3일, 지난해보다는 하루 일찍 시작됐다. 장마는 고온다습한 북태평양 기단과 한랭 습윤한 오호츠크해 기단이 만나 형성되는 장마전선에서 내리는 비를 뜻한다. 장마전선은 18일 제주도 먼 남쪽 해상으로 내려갔다가 20일께 다시 북상해 전남 남해안까지 영향을 줄 것으로 보인다. 이에 따라 20 ~ 21일 남부지방에도 예년보다 사흘 정도 장마가 일찍 찾아올 전망이다. 그러나 장마전선을 밀어올리는 북태평양 고기압 세력이 약해 서울 등 중부지방은 평년보다 사나흘가량 늦은 이달 말부터 장마가 시작될 것이라는 게 기상청의 설명이다. 장마전선은 이후 한 달가량 한반도 중남부를 오르내리며 곳곳에 비를 뿌릴 전망이다. 최근 30년간 평균치에 따르면 중부지방의 장마 시작일은 6월24 ~ 25일이었으며 장마기간은 32일, 강수일수는 17. 2일이었다. 기상청은 올해 장마기간의 평균 강수량이 350 ~ 400㎜로 평년과 비슷하거나 적을 것으로 내다봤다. 브라질 월드컵 한국과 러시아의 경기가 열리는 18일 오전 서울은 대체로 구름'

5-6. evaluate

# evaluate.load('sqaud')

위의 코드로 진행이 가능하지만 상당한 양의 후처리가 필요하고 시간이 많이 걸리기에 생략한다.