Base on Encoder-Decoder models

Author

차상진

Published

April 1, 2025

1. Encoder-Decoder model

- 인코더-디코더 기반 모델은 완성된 문장을 이어받아 입력과는 완전히 다른 새로운 문장을 생성하는 것을 목적으로 한다.

- 디코더 기반 모델의 자연어 생성과 비슷하지만 입력된 문장을 이어 나가는 디코더 기반 모델과는 달리 완전히 새로운 문장을 작성한다는 차이가 있다.

일반적으로 기계 번역이나 요약에 사용된다.

1-1. BART

-BART의 학습 방법

1. 토큰 마스킹

BERT에서 사용헸던 일반적인 Masked LM과 동일하다.

2. 토큰 삭제

랜덤한 토큰을 삭제하고 이를 복구한다. 마스킹 방법은 특정 토큰을 [MASK]로 변경하기에 어떤 위치의 토큰이 사라졌는지 알지만 토큰 삭제는 어떤 위치의 토큰이 사라졌는지 알 수 없다.

3. 텍스트 채우기

입력 문장 중, 연속되는 토큰 몇 개를 묶어 토큰 뭉치를 생성하여 그 범위를 [MASK] 토큰으로 치환한다. 이때, 토큰 뭉치 길이는 포아송 분포를 따르며 길이가 0 or 2이상이다. 길이가 0인 경우 정상 문장에서 [MASK] 토큰만 생성되고 2 이상인 경우 여러 토큰이 하나의 [MASK] 토큰으로 바뀌게 된다. 따라서 모델이 범위에서 누락된 토큰 수에 대해서도 학습할 수 있도록 한다

3. 텍스트 채우기 보충 설명

- 왜 포아송 분포(Poisson Distribution)을 따르는가?

먼저 포아송 분포에 대한 이해를 해보자.

포아송 분포: 시간이 지남에 따라 일어나는 특정한 사건 A의 발생횟수의 분포

즉 포아송 분포는 평균 발생 횟수를 기반으로 사건이 발생할 횟수가 결정된다.

BART는 입력 문장에서 일부 단어나 토큰을 무작위로 마스킹 처리하는데 정말 무작위로 선택하여 Masking 하는 것이 아니고 포아송 분포를 따르면서 Masking 할 토큰을 찾는데 그 이유는 Masking 할 토큰의 수가 예측 불가능하고, 일정한 평균 빈도수로 선택되도록 하기 위해 사용된다.

위에서 언급했듯이 포아송 분포는 평균적으로 몇 개의 토큰이 Masking될지 예측할 수 있지만, 실제로 Masking 될 토큰의 수는 확률적으로 결정된다.

길이가 0인 경우: 정상 문장에서 [MASK] 토큰만 생성됨
길이가 2인 경우: 여러 개의 연속된 토큰이 하나의 [MASK] 토큰으로 치환된다. 이 때!! 마스킹된 범위의 길이는 포아송 분포에 따라 결정된다. (평균이 몇인지에 따라 다르겠지만 대부분 확률은 매우 낮음)

포아송 분포는 모델이 일부 연속적인 토큰들을 마스킹하면서도 문맥 정보를 이해하고, 누락된 토큰을 예측하는 데 도움이 되도록 설계되었다.

4. 문장 순서 바꾸기

입력 문서를 문장 단위로 분할하고 문장의 순서를 무작위로 섞는다.

5. 문서 회전

입력 문장 중, 토큰 하나를 무작위로 정해 해당 토큰이 문장의 시작이 되도록 해당 문장 토큰을 밀어낸다. 시작 토큰 앞에 있던 토큰은 맨 뒤로 이동한다.

2. Conditional Generation

어떤 문장이 주어졌을 때 해당 문장을 기반으로 새로운 문장을 작성하는 task를 허깅페이스에서는 조건부 생성(conditional generation)이라고 한다.

- 수식

수식이라고 할 것도 없을정도로 매우 간단하지만…

- 조건부 생성의 기본 idea

조건부 생성은 주어진 입력 조건 \(x\)에 대해 출력 \(y\)를 생성하는 문제이다. 즉, 입력 \(x\)가 주어졌을 때, \(y\)를 생성하는 확률 분포를 모델링하는 것이다.

조건부 생성은 일반적으로 조건부 확률을 모델링 한다. \[ P(y | x) \]

\(y\): 생성할 출력 데이터 (ex: 텍스트, 이미지)
\(x\): 주어진 정보 (ex: 텍스트의 시작 부분, 특정 키워드, 이미지 설명)

일반적으로 다음의 수식을 따른다. \[ \hat{y} = argmax_yP(y|x) \]

2-1. model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "hyunwoongko/kobart"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels will be overwritten to 2.
2025-04-01 07:39:17.432468: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1743493157.447473   60530 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743493157.452654   60530 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1743493157.466293   60530 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1743493157.466309   60530 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1743493157.466311   60530 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1743493157.466313   60530 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-04-01 07:39:17.470692: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

BartForConditionalGeneration(
  (model): BartModel(
    (shared): BartScaledWordEmbedding(30000, 768, padding_idx=3)
    (encoder): BartEncoder(
      (embed_tokens): BartScaledWordEmbedding(30000, 768, padding_idx=3)
      (embed_positions): BartLearnedPositionalEmbedding(1028, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
      )
      (layernorm_embedding): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
    (decoder): BartDecoder(
      (embed_tokens): BartScaledWordEmbedding(30000, 768, padding_idx=3)
      (embed_positions): BartLearnedPositionalEmbedding(1028, 768)
      (layers): ModuleList(
        (0-5): 6 x BartDecoderLayer(
          (self_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): GELUActivation()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (encoder_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
      )
      (layernorm_embedding): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
  )
  (lm_head): Linear(in_features=768, out_features=30000, bias=False)
)

2-2. Dataset

유의점

인코더-디코더 모델은 인코더에 들어가는 입력과 디코더에 들어가는 입력, 총 두 개 입력이 필요하고 이에 대한 정답이 따로 필요하다.

필수로 인코더 입력, 디코더 입력, 디코더 정답 이렇게 세 가지 데이터 특성이 포함되어야 한다. - 왜 인코더 정답은 필요 없지..?

우선 인코더 입력과 출력의 정답은 달라야하므로 정답을 text_target 파라미터로 입력해 정답 값까지 한 번에 들어야 한다.

- 인코더 정답이 필요없는 이유

결론부터 말하면 정답이 필요한 곳은 디코더 뿐이다! - 인코더는 입력을 벡터로 변환하는 역할만 하기에 정답이 따로 필요없다. - 하지만 디코더는 출력을 생성하므로 정답(labels)이 필요하다.

from datasets import load_dataset

dataset = load_dataset("msarmi9/korean-english-multitarget-ted-talks-task")
print(dataset)
dataset['train'][0]

DatasetDict({
    train: Dataset({
        features: ['korean', 'english'],
        num_rows: 166215
    })
    validation: Dataset({
        features: ['korean', 'english'],
        num_rows: 1958
    })
    test: Dataset({
        features: ['korean', 'english'],
        num_rows: 1982
    })
})

{'korean': '(박수) 이쪽은 Bill Lange 이고, 저는 David Gallo입니다',
 'english': "(Applause) David Gallo: This is Bill Lange. I'm Dave Gallo."}

tokenized_dataset = dataset.map(
    lambda batch: (
        tokenizer(
            batch["korean"],
            text_target=batch["english"],
            max_length=512,
            truncation=True,
        )
    ),
    batched=True,
    batch_size=1000,
    num_proc=2,
    remove_columns=dataset["train"].column_names,
)
tokenized_dataset["train"][0]

{'input_ids': [0,
  14338,
  10770,
  11372,
  240,
  14025,
  12471,
  12005,
  15085,
  29490,
  14676,
  24508,
  300,
  14025,
  14161,
  16530,
  15529,
  296,
  317,
  18509,
  15464,
  15585,
  20858,
  12049,
  20211,
  1],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'labels': [0,
  14338,
  264,
  311,
  311,
  17422,
  316,
  17223,
  240,
  15529,
  296,
  317,
  18509,
  15464,
  15585,
  20858,
  257,
  15054,
  303,
  15868,
  1700,
  15868,
  15085,
  29490,
  14676,
  24508,
  300,
  245,
  14943,
  238,
  308,
  15529,
  296,
  21518,
  15464,
  15585,
  20858,
  245,
  1]}

tokenized_dataset

# 데이터를 살펴보면 input_ids : 인코더 입력, labels : 디코더 정답은 존재한다.
# 디코더 입력인 decoder_input_ids가 없기에 모델에 데이터를 입력하면 오류가 발생한다.

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 166215
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1958
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1982
    })
})

2-3. Collator

디코더 입력값은 결국 정답 값인 labels을 앞으로 한 칸 이동한 데이터이다.

해당 작업을 패딩과 더불어 간편하게 처리할 수 있도록 DataCollatorForSeq2Seq를 사용한다.

패딩 작업과 함께 디코더에 입력으로 들어갈 부분까지 자동으로 설정하여 반환한다.

콜레이터는 batch 데이터를 준비하는데 사용되는 함수이다.

from transformers import DataCollatorForSeq2Seq

collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding="max_length",
    max_length=512,
)
batch = collator([tokenized_dataset["train"][i] for i in range(2)]) # 문장 2개만 뽑아서 처리
batch

{'input_ids': tensor([[    0, 14338, 10770,  ...,     3,     3,     3],
        [    0, 15496, 18918,  ...,     3,     3,     3]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([[    0, 14338,   264,  ...,  -100,  -100,  -100],
        [    0, 14603,   309,  ...,  -100,  -100,  -100]]), 'decoder_input_ids': tensor([[    1,     0, 14338,  ...,     3,     3,     3],
        [    1,     0, 14603,  ...,     3,     3,     3]])}

2-4. Generation

import torch

with torch.no_grad():
    logits = model(**batch).logits

logits

tensor([[[  5.4885,  18.7849,  -0.5489,  ...,   0.0465,   0.5813,  -2.2851],
         [  3.7287,  18.9676,  -1.1747,  ...,  -0.2600,  -3.4647,  -0.0973],
         [ -1.2976,   8.6322,  -5.0410,  ...,  -7.0689,  -6.1346,  -4.4141],
         ...,
         [ -9.2638,   4.4483,  -8.4506,  ..., -12.6961, -13.2625,  -7.7570],
         [ -8.4581,   4.9268,  -7.2172,  ..., -11.5650, -11.8799,  -6.8108],
         [ -8.3191,   5.2101,  -6.8817,  ..., -11.1563, -11.7052,  -6.7644]],

        [[  4.7748,  16.2666,  -3.0011,  ...,  -0.8965,  -3.3187,  -3.1041],
         [  0.6535,  19.3665,  -1.4506,  ...,   0.1562,  -4.3976,   0.1983],
         [ -5.0934,  10.8673,  -7.5637,  ...,  -6.3808,  -1.6471,  -7.2105],
         ...,
         [ -1.5132,  19.0760,   0.3272,  ...,  -2.6680,  -3.9969,   2.7315],
         [ -2.3757,  20.0047,  -0.5301,  ...,  -1.7740,  -5.1750,   0.8077],
         [ -2.2504,  19.9756,  -0.4519,  ...,  -0.6850,  -5.1072,   0.4720]]])

logits.shape

torch.Size([2, 512, 30000])

512의 단어 길이를 가지는 2개의 문장에서 30000개의 단어들이 나올 확률을 계산한 것.

from transformers import GenerationConfig

gen_cfg = GenerationConfig(
    max_new_tokens=100,
    do_sample=True,
    temperature=1.2,
    top_k=50,
    top_p=0.95,
)
outputs = model.generate(batch["input_ids"], generation_config=gen_cfg)
result = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(result[0])

- 문장이 제대로 생성되지 않은 이유

model.config.eos_token_id = 1로 설정되어 있어서, 모델이 처음 생성하는 토큰이 1번 토큰이면 곧바로 종료된다.

model.config.eos_token_id

2-5. Evaluate

문장 생성 태스크는 학습을 진행하며 평가 지표를 확인하기 어렵다.

따라서 학습 중에 일반적으로 크로스 엔트로피 손실을 사용하여 값이 감소 추이를 살피며 모델 학습이 원활하게 이뤄지는지 확인한다.

3. Sequence Classification

3-1. model

이번 실습에서는 인코더와 디코더에 동일한 문장을 입력하여서 문장 분류를 진행하려고 한다.

문장 구조가 바뀌지 않기에 이전 인코더, 디코더 기반 모델에서 실습했던 것과 같이 동일한 코드로 추론한다.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "hyunwoongko/kobart"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
model

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels will be overwritten to 2.
Some weights of BartForSequenceClassification were not initialized from the model checkpoint at hyunwoongko/kobart and are newly initialized: ['classification_head.dense.bias', 'classification_head.dense.weight', 'classification_head.out_proj.bias', 'classification_head.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

BartForSequenceClassification(
  (model): BartModel(
    (shared): BartScaledWordEmbedding(30000, 768, padding_idx=3)
    (encoder): BartEncoder(
      (embed_tokens): BartScaledWordEmbedding(30000, 768, padding_idx=3)
      (embed_positions): BartLearnedPositionalEmbedding(1028, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
      )
      (layernorm_embedding): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
    (decoder): BartDecoder(
      (embed_tokens): BartScaledWordEmbedding(30000, 768, padding_idx=3)
      (embed_positions): BartLearnedPositionalEmbedding(1028, 768)
      (layers): ModuleList(
        (0-5): 6 x BartDecoderLayer(
          (self_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): GELUActivation()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (encoder_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
      )
      (layernorm_embedding): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
  )
  (classification_head): BartClassificationHead(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (out_proj): Linear(in_features=768, out_features=2, bias=True)
  )
)

3-2. Dataset

from datasets import load_dataset

dataset = load_dataset("klue", "sts")

def process_data(batch):
  result = tokenizer(batch["sentence1"], text_pair=batch["sentence2"])
  result["labels"] = [x["binary-label"] for x in batch["labels"]]
  return result

tokenized_dataset = dataset.map(
    process_data,
    batched=True,
    remove_columns=dataset["train"].column_names
)

tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 11668
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 519
    })
})

dataset은 인코더에 입력으로 들어가는 input_ids만 포함하고 있지만 큰 문제는 없다.

decoder_input_ids가 입력되지 않았을 때, 인코더 입력인 input_ids를 오른쪽으로 한 칸 이동하여 디코더 입력으로 자동으로 사용한다.

3-3. Collator

위에서도 설명했지만 콜레이터는 모델이 해당 데이터셋을 바로 사용하도록 batch 작업을 해준다.

import torch
from transformers import DataCollatorWithPadding

collator = DataCollatorWithPadding(tokenizer)
batch = collator([tokenized_dataset['train'][i] for i in range(4)])

3-4. Generation

with torch.no_grad():
    logits = model(**batch).logits

logits

tensor([[ 0.0255, -0.1499],
        [ 0.4134, -0.2986],
        [-0.0575,  0.0541],
        [ 0.1218, -0.8607]])

3-5. Evaluate

import evaluate

f1 = evaluate.load('f1')
f1.compute(
    predictions = logits.argmax(axis = -1),
    references = batch['labels'],
    average = 'micro'
)

{'f1': 0.5}

- 생성 task가 아니라 분류이므로 평가 가능

4. Question Answering

추출 기반 질의 응답 태스크는 문장에서 시작과 끝 두 값만 추출하면 되는 간단한 태스크이다.

따라서 인코더-디코더 기반 모델에서도 해당 태스크를 수행할 수 있다.

4-1. model

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_name = "hyunwoongko/kobart"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
model

You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels will be overwritten to 2.
Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at hyunwoongko/kobart and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

BartForQuestionAnswering(
  (model): BartModel(
    (shared): BartScaledWordEmbedding(30000, 768, padding_idx=3)
    (encoder): BartEncoder(
      (embed_tokens): BartScaledWordEmbedding(30000, 768, padding_idx=3)
      (embed_positions): BartLearnedPositionalEmbedding(1028, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
      )
      (layernorm_embedding): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
    (decoder): BartDecoder(
      (embed_tokens): BartScaledWordEmbedding(30000, 768, padding_idx=3)
      (embed_positions): BartLearnedPositionalEmbedding(1028, 768)
      (layers): ModuleList(
        (0-5): 6 x BartDecoderLayer(
          (self_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): GELUActivation()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (encoder_attn): BartSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
      )
      (layernorm_embedding): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
  )
  (qa_outputs): Linear(in_features=768, out_features=2, bias=True)
)

- out_features=2

두 값만 출력하면 되기에 out_features = 2이다.

4-1. Dataset

from datasets import load_dataset

dataset = load_dataset("klue", "mrc") # klue: 데이터셋 mrc: 기계독해 데이터

def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=512,
        truncation="only_second", # 문맥이 길면 문맥만 잘라냄 (질문과 문맥에서 문맥이 길기때문에)
        return_offsets_mapping=True, # 원본 텍스트에서 각 토큰의 위치 정보 저장 (추출해야 하기에)
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"] # 실제 정답 정보 answer_start , text
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)
        # start, end 위치를 찾음

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
          idx += 1
        context_start = idx # 문맥이 시작하는 위치
        while sequence_ids[idx] == 1:
          idx += 1 
        context_end = idx - 1 # 문맥이 끝나는 위치

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0) # 정답이 문맥 밖에 있으면 start,end 위치를 0으로 설정
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1) # 정답의 시작 위치를 저장

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1) # 정답의 끝 위치를 저장

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

tokenized_dataset = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset["train"].column_names
)

4-2. Collator

from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()
batch = data_collator([tokenized_dataset["train"][i] for i in range(10)])
batch

{'input_ids': tensor([[    0, 14337, 26225,  ...,     3,     3,     3],
         [    0, 25092, 18001,  ..., 11270, 19903,     1],
         [    0, 25788, 13679,  ..., 19903, 15599,     1],
         ...,
         [    0, 20437, 17814,  ...,     3,     3,     3],
         [    0, 14154, 12061,  ...,     3,     3,     3],
         [    0, 14295, 14120,  ...,     3,     3,     3]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 'start_positions': tensor([233,  27,   0,  78,  60,  68, 202, 319, 306, 271]),
 'end_positions': tensor([235,  29,   0,  79,  66,  74, 210, 325, 312, 275])}

4-3. Generation

import torch

with torch.no_grad():
  outputs = model(**batch)

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

predict_answer_tokens = batch["input_ids"][0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

''

4-4. Evaluate

QA 평가지표는 evaluate.load('squad')를 통해 진행할 수 있었다. 하지만 상당한 양의 후처리가 필요하고 시간이 오래 걸리는 작업이다.