cau_dslab.log

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding(by 정다희)

Tue, 15 Mar 2022 13:09:49 GMT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding(by 정다희)

Tue, 15 Mar 2022 13:08:55 GMT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding(by 정다희)

Tue, 15 Mar 2022 13:08:50 GMT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding(by 정다희)

Tue, 15 Mar 2022 13:08:42 GMT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding(by 정다희)

Tue, 15 Mar 2022 13:08:18 GMT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding(by 정다희)

Tue, 15 Mar 2022 13:08:07 GMT

XLNet: Generalized Autoregressive Pretraining for Language Understanding(by 강유진)

Thu, 03 Mar 2022 15:11:58 GMT

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Abstract

BERT와 같이 양방향 문맥을 모델링하는 능력으로 사전학습을 기반으로 한 denoising autoencoding은 autoregressive LM을 기반으로 하는 사전학습된 접근보다 훨씬 좋은 성능을 보였다. 그러나 인풋을 마스크로 denoising하는 것에 기반하면서 BERT는 마스크된 자리 사이의 의존을 무시했고, pretrain과 finetune 사이 차이가 존재했다. 이러한 BERT의 장단점과 관련해 XLNet을 제안한다. XLNet은 일반화된 qutoregressive pretraining method로 다음과 같은 특징을 갖는다.

모든 계산을 인수분해로 더 간단히 변환시켜 예상되는 가능성을 극대화시킴으로 양방향 문맥을 학습하는 것을 가능하게 했다.
XLNet의 autoregressive formulation으로 BERT의 한계를 극복했다.
XLNet은 Transformer-XL의 아이디어와 pretraining을 결합시켰다.

Introduction

비지도 표현 학습은 자연어처리 분야에서 꽤 성공적.
- 일반적으로 이러한 방법들은 최초로 large-scale unlabeled text corpora로 사전학습된 신경망 네트워크(neural network)
- 여러 분야에 맞춰 모델을 fine-tune하거나 표현.
- 가장 성공적인 pre-training objectives
  - autoregressive LM
  - autoencoder
autoregressive Language Model
- autoregressive model과 함께 텍스트 코퍼스의 확률 분포를 평가하는 방법을 찾았다.
- 텍스트 순서 $x = (x_1,...,x_T)$가 주어졌을 때, AR LM은 확률을 행렬화했다.
  - forward product $p(x) = \Pi^T_{t=1} p(x_t | x_{
  - backward product $p(x) = \Pi^1_{t=T} p(x_t | x_{>t})$
- AR LM은 오직 단방향 문맥을 인코딩하는데 훈련되어졌다. 깊은 양방향 문맥을 모델링하는데 효과적이지 않음
- 세부적인 lanugage understanding tasks는 자주 양방향 문맥 정보를 요구하기 때문에 AR과 효과적인 사전학습과는 거리가 생길 수 밖에 없음
autoencoder based pretraining
- 명시적인 density estimation을 수행하지 않고 대신 원래 데이터를 오염된 인풋으로부터 재건하는데 목적을 둔다. BERT가 대표적인 예시이다.
  - 인풋 토큰 순서가 주어지면 특정한 토큰의 부분은 special 심볼인 [MASK]로 대체된다. 그리고 모델은 마스크된 버젼을 원래 토큰으로 다시 되돌리는 것을 목적으로 학습된다.
  - density estimation은 목적함수의 일이 아니기 때문에 BERT는 재건을 위해 양방향 문맥을 활용하는 것을 허락한다.
  - 이러한 이점으로, autoencoder는 앞서 언급한 auto-regressive LM의 양방향적 정보의 차이를 줄인다.
- 존재하는 언어 사전학습 objectives의 장단점을 마주하며 XLNet을 제안한다.
  - XLNet은 AR LM과 AE를 둘다 사용하여 위와 같은 한계를 피하고자하는 일반화된 autoregressive method이다.
XLNet
- 전통적인 AR 모델은 고정된 forward, backward factorization 순서를 사용하는데 그 대신 XLNet은 예상되는 sequence(모든 가능한 factorization 순서의 순열에 대해)의 log likelihood를 최대화한다.
  - 각 자리에 대한 문맥은 왼쪽과 오른쪽으로부터의 토큰들로 구성될 수 있다.
  - 각 자리는 모든 자리로부터 문맥적인 정보를 활용하도록 학습되어진다. = bidirectional context를 잡는다.
- 일반화된 AR LM로, XLNet은 data corruption에 의존하지 않는다.
  - 따라서 사전학습- 파인튜닝 간 차이에 의해 고통받지 않는다. ↔ BERT의 단점이었음
  - 한편, autoregressive objective는 예측된 토큰의 joint probability를 factorizing을 위해 product rule을 사용하는 방식을 제공한다.
- XLNet은 사전학습에 대한 architectural designs을 향상시킨다.
- AR LM에 영향을 받아 XLNet은 segment recurrence mechanism과 상대적인 encoding 체계를 사전훈련에 통합시켰다.
- factorization order는 제멋대로이고, target도 불분명하기 때문에 Transformer(-XL) architecture를 permutation-based LM에 적용한 것은 동작하지 않는다.

Realted Work

순열 기반의 AR modeling의 아이디어는 탐구되어져왔지만 이들에게는 몇몇의 중요한 다른점이 있다.
- 이전 모델들은 “순서 없는" 귀납적 편향을 적용해 density estimation을 향상시키는 것을 목표로 삼았다.
  - ↔ XLNet은 AR 언어모델이 양방향 컨텍스트를 학습할 수 있도록 motivated
- 이전의 순열기반의 AR 모델들은 그들의 MLP(Multi Layer Perceptron) 아키텍쳐로 부터 상속받은 명시적인 position awareness 에 의존했다.
  - ↔ 기술적으로, 유효한 target-aware prediction distribution을 세우기 위해 XLNet은 두 줄기의 attention을 통해 target position을 hide state에 포함시켰다.
- 순서없는 NADE(Neural Autoregressive Distribution Estimation), XLNet 에 대해 “순서 없는(orderless)”이 인풋 순서가 랜덤하게 순열되어 들어갈 수 있다는 것을 의미하는 것이 아니라, 모델이 분포의 다른 factorization orders을 허용한다는 것을 의미한다는 것을 강조.

2 Proposed Method

2.1 Background

먼저 전통적인 AR LM과 BERT를 비교하고자 한다.
- text sequence $x = [x_1,...x_T]$가 주어지면, AR LM은 아래의 forward autoregressie factorization의 확률을 최대화하며 사전학습을 수행한다.
  - $h_\theta(x_{1:t-1})$ = neural models(RNNs나 Transformers 등)에 의해 생산된 context representation
  - $e(x)$ = embedding of x.
  - x = text sequence
  - $\hat{x}$ = 오염된 text sequence ver.
  - $\bar{x}$ = masked tokens
- BERT는 denoising auto-encoding을 기반으로 한다.
- 특히, text sequence x에 대해 BERT는 오염된 version인 $\hat{x}$을 text seqence x안에 ****토큰의 랜덤하게 세팅한 부분에 따라(예를 들어 15%정도 ) 먼저 [MASK]로 만든다. masked tokens → $\bar{x}$
- training objective는 $\hat{x}$으로 부터 $\bar{x}$를 다시 만든다.

    $m_t = 1$은 $x_t$가 마스크되었다는 것을 의미하고, $H_\theta$은 T-길이의 text sequence **x** 를 hidden vectors의 sequence로 매핑하는 Transformer이다.

    $H_\theta(x) = [H_\theta(x)_1,H_\theta(x)_2,...,H_\theta(x)_T]$

두 사전학습 objevtives의 장단점은 아래의 관점에서 비교할 수 있다.
- Independene Assumption
  - 식 (2)에서 강조했듯이, BERT는 joint conditional probability $p(\bar{x} | \hat{x})$을 factorize(인수분해. 인수를 곱셈의 형식으로 만듬.)
    - 모든 masked tokens $\bar{x}$가 개별적으로 등장하지 않지만, product rule로 나타내기 위해 독립적이라는 가정이 필요하다.
  - AR LM objectivesms $p_\theta($x$)$를 인수분해한다.
    - 독립 가정과 같은 것없이 예외없이 유지하는 product rule 사용
  - 따라서, Independence Assumption이 필요없다는 측면에서는 AR LM이 더 좋음
- Input noise
  - BERT에 들어가는 [MASK]같은 심볼은 실제 Task에서는 발견할 수 없는 인공적인 작업.
    - 이것은 pretrain과 fine-tune 간의 차이를 유발한다.
  - [MASK]를 원래 토큰으로 다시 대체하는 작업은 문제를 해결하지 못한다.
  - 반면, AR LM은 인풋 corruption에 의존하지 않기 때문에 위와 같은 문제가 발생하지 않는다.
- Context dependency
  - AR의 context representaion인 $h_\theta(x_{1:t-1})$는 오직 단방향의 문맥만 담을 수 있지만, BERT의 context representation인 $H_\theta(x)_t$는 양방향 문맥을 모두 담을 수 있다.

2.2 Objective: Permutation Language Modeling

위에서 AR과 BERT(AE)의 장점과 단점 살펴봤고, 이들의 단점은 피하면서 장점만 가져올 수 있는 방법에 대해 고민
orderless NADE(Neural Autoregressive Distribution Estimation)에서 영감을 받아, permutation langauge modeling objective에 대해 제안한다.
- AR model의 장점(독립 가정 필요 없음, pretrain - finetune 간 차이 나타나지 않음), bidirectional contexts를 모두 유지하는 objective
- 길이 $T$의 sequence x에 대해, $T!$의 다른 순서를 사용한다.
  - autoregressive factorization을 수행하는 factorial이다.
  - 다른 순서: 5! = 35214( ↔ original order 5! = 54321)
- 만약, 모델 파라미터가 모든 인수분해 순서(factorization order)를 갖고 있는다면, 모델은 모든 위치에서 각 양 방향의 정보를 모을 수 있도록 학습될 것이다.

- x(text sequence)에 대해 factorization order z를 때마다 샘플링 할 것 이고, 이 factorization order z에 따라  $p_\theta(x)$를 순서에 맞게 분해할 것이다.
- $\theta$는 모든 factorization order에 거쳐 훈련동안 공유되기 때문에  $x_t$는 모든 가능한 요소 $x_i ≠ x_t$ 일 것이다. 그래서 양 방향의 문맥을 다 잡아낼 수 있다.
- 이 objective는 AR 프레임워크에 맞춰져있기 때문에 자연스럽게 독립 가정, 사전학습과 파인튜닝 사이의 간극을 피할 수 있게 된다.

### Remark on Permutation

- 제안한 objective는 오직 factorization order의 순서를 바꾸고, sequence의 순서를 바꾸지 않는다!
    - 원래 순서와 일치하는 positional encoding을 사용해 원래 sequence 순서는 유지하고,  적절한 Transformer의 attention mask를 사용해 factorization order의 순서 바꿈을 수행한다.
    - 원래 sequence를 건들이지 않는 것은 필수임.
        - 모델이 finetuning동안은 자연스러운 순서로 text sequence를 입력으로 받기 때문이다.
- 아래 그림은 token $x_3$를  예측하는 예시인데 같은 input sequence x이지만 다른 Factorization order를 사용하는 것을 보여주는 예시이다.
    ![](https://images.velog.io/images/cau_dslab/post/683fd305-1395-406e-8874-229ba48b55fc/%EC%8A%A4%ED%81%AC%EB%A6%B0%EC%83%B7%202022-02-18%20%EC%98%A4%ED%9B%84%201.40.21.png)



> 각 순서들에 대해 x3 를 예측하는 부분(초록색)을 보면 x3 를 예측하기 위해 [x4],[x2,x4],[x1,x2,x4] 를 이용합니다. 모든 permutation($Z_T$)에 대해 위 과정을 수행하면 x3를 제외한 [x1,x2,x4] 의 모든 부분집합에 conditional한 x3 의 probability를 계산할 수 있습니다. 그러므로 특정 token에 대해 양방향 context를 고려한 AR Modeling이 가능해지고 기존 AR방식의 한계를 극복할 수 있습니다. ([https://blog.pingpong.us/xlnet-review/](https://blog.pingpong.us/xlnet-review/))
>

2.3 Architecture: Two-Stream Self-Attention for Target-Aware Representations

permutation language modeling objective가 원하는 성질을 기본 Transformer parameteriation로는 구할 수 없다.
- AR LM의 objective는 이전 token을 보고, 다음 token을 예측함.
- 학습 할 때 $p(x) = p(x_2) p(x_3 \mid x_2) p(x_1 \mid x_2, x_4)...$라고 할 때, $x_3$을 학습하기 위해서는 $x_2$ 를 제외한 모든 input은 masking, $x_1$을 예측할 때는 $x_2,x_4$ 제외하고는 다 masking하는 방식을 사용
- next-token distribution $p_\theta(X_{z_t} | x_{z_{
  $p_{\theta}(x_{z_t} \mid x_{z
- 위의 값은 결국 target index position에 관계없이 같은 값을 갖게 되는 문제를 발생시킴.
  - 같은 형식의 AR LM은 단방향으로 진행되기 때문에 target index position이 명확함.
  - 그러나 XLNet의 object는 index를 permutation한 후 진행하기 때문에 이전까지 Index가 같더라도 여러 target index position이 존재할 수 있음.
    
    Input sequence [x1,x2,x3,x4]와 index의 permutation $Z_T$=[[1,2,3,4],[1,3,2,4],…[4,3,2,1]] 에 대해 학습을 진행한다고 가정해 보겠습니다.
    
    [2,3,1,4]의 경우 p(x1∣x2,x3)을 계산하기 위해 hθ(x2,x3) 과 같은 representation을 이용합니다.[2,3,4,1]의 경우에도 p(x4∣x2,x3)을 계산하기 위해 hθ(x2,x3) 과 같은 representation을 이용합니다.결과적으로 같은 representation을 이용하여 x4과 x1을 예측해야 하는 문제가 발생합니다.
    
    https://blog.pingpong.us/xlnet-review/
- 이러한 문제를 해결하기 위해 이전 context token들의 정보 ($x_{z
  $h_{\theta}(x_{z

Two-Stream Self-Attention

target 예측의 불확실성을 없애기 위해 어떻게 $g_\theta$는 아래의 두가지 조건을 충족해야한다.
1. 토큰 $x_{z_t}$를 예측하기 위해 $g_\theta(x_{z
2. t시점 이후의 토큰인 $x_{z_j}$를 예측하기 위해서, $g_\theta(x_{z
기본 transformer는 한 token 당 하나의 representation을 갖는다. 그러나 위의 hidden representation $g_\theta$은 따르면 두개의 representation이 요구된다.
- 1번은 t 시점의 hidden representation은 t시점의 context를 포함하지 않지만
- 2번 조건에서는 t 시점의 context를 포함해야한다.
각 layer에서 한 token이 두개의 Representation을 갖게 하기 위해 저자는 2개의 hidden representation을 갖게 하는 Two-Stream Self-Attention을 제안한다.

1. Content Stream
- $Content \space Stream: \space h_{z_t}^{(m)} \leftarrow Attention(Q=h_{z_t}^{(m-1)}, KV=h_{z_{\leq t}}^{(m-1)};\theta)$
- 2번 조건을 충족하기 위한 $h_\theta(x_{z≤t})$ = $h_{z_t}$는 Content representation로, 기존의 Transformer와 비슷한 역할을 한다. 이 representation은 context와 $x_{z_t}$ 둘다 인코딩한다.
- target position 정보인 $z_t$과 t시점의 content를 담고있는 토큰인 $x_{z_t}$를 사용한다.
- Content Stream은 Query로 이전 layer의 h state($z = t$), 
Key, Value로는 이전 Layerdml h state($z≤t$)를 사용한다.
- 기본 Transformer의 self attention과 Q,K, V가 동일, t index 이후의 token state들은 making하여 연산 진행.

2. Query Stream
- $Query \space Stream: \space g_{z_t}^{(m)} \leftarrow Attention(Q=g_{z_t}^{(m-1)}, KV=h_{z_{

 



Partial Prediction

permutation language modeling objective는 훨씬 더 최적화하는 것이 까다롭다.
permutation이 수렴하는데 느리기 때문.


이러한 문제를 줄이기 위해 특정 순서에서 마지막 몇 토큰만 예측하는 방법을 취함.
z를 cutting point c를 통해 타켓부분과 타겟이 아닌 부분으로 나눠 계산.




2.4  Incorporating Ideas from Transformer-XL

Transformer-XL에서 사용된 두가지 중요 기술을 사용.

relative positional encoding scheme
기본 Transformer는 단어의 position을 모델링하지 않는 대신 positional encoding을 함.
하지만, 이러한 Positinal encoding은 Transformer-XL과 같이 segment가 recurrent한 경우 문제가 발생한다.


segment recurrence mechanism


rcurrence mechanism을 제안한 permutation 방식에 통합하는 방법과 이전 segments의 hidden states를 모델이 재사용하는 방법에 대해 살펴볼 것이다.

긴 sequence에서 두 segments를 가져온다고 가정하자.

s; $\tilde{\mathbf{x}} = s_{1:T}$ and $\mathbf{x} = s_{T+1:2T}$, permutation $\tilde{\mathbf{z}} = [1...T], \mathbf{z} = [T+1 ... 2T]$

permutation $\tilde{\mathbf{z}}$가 주어졌을 때 첫번째 segment를 처리하고, 그리고 나서 각 layer m에 대한 content representations $\tilde{h}^{(m)}$를 갖게 된다. 그런다음 다음 segment $\mathbf{x}$에 대해  attention update는 다음과 같다.
  




- positional encoding은 원래 sequence에 실제 자리에만 의존하기 때문에 $\tilde{h}^{(m)}$의 attention update는 permutation에 독립적이다.
- 이전 segment의 factorization 순서를 아는 것이 필요없기 때문에 memory를 아낄 수 있다.
2.5  Modeling Multiple Segments

많은 세부 태스크들은 다수의 input segment를 갖기 때문에 이 다수의 segments를 받아 XLNet을 pretrain하는 것에 대해 다루고자 한다.

pretraining 단계에서, BERT와 같이 랜덤하게 두 segments를 샘플링하고 두 segment를 이어붙인 것을 하나의 sequence로 여겨 permutation langauge Modeling을 수행했다.

이때, 같은 문맥에서 두 segments가 샘플링 된 경우에만 메모리를 reuse할 수 있다.


XLNet의 인풋의 모양은 BERT(: [CLS, A, SEP, B, SEP])와 동일하다.

SEP, CLS 두개의 special symbol과 A, B라는 두개의 segments를 이용한다.
Relative Segment Encoding

BERT와 다른 것은 완전한 segment embedding을 각 자리에 word embedding에 더한다는 것이다.

Transformer-XL의 relative encoding의 idea를 이어받아 segments를 인코딩한다.

sequence 내에 position $i, j$가 쌍으로 주어졌을 때, 같은 segment 내에 $i,j$가 있다면 segment encoding를  $s_{ij} = s_+$로 인코딩하고, 같은 segment 안에 있지 않다면 $s_{ij} = s_-$로 인코딩해 두 position이 갖은 segments 내에 있는지 아닌지 따진다.
$s_+, s_-$는 각 attention head의 학습가능한 모델 파라미터이다.
segment encoding $s_{ij}$는 attention weight을 계산하는데 사용된다.
$a_{ij} = (q_i +b)^\top s_{ij}$, $q_i$는 query vector, b는 학습할 수 있는 head 특정 편향 벡터






relative segment encoding의 장점

relative encodings의 귀납 편향은 일반화를 향상시킨다. 
완전한 segment encoding을 사용하지 못하는 두개 이상의 segments를 갖는 태스크에 대한 파인튜닝의 가능성을 열어준다.





2.6  Discussion

XLNet은 BERT가 하지못하는 pair 간의 관계를 잡아낼 수 있다는 것을 주목해야 한다.

3  Experiments
3.4  Ablation Study

Ablation study를 하는 세가지 목적
특히 denoising auto-encoding objective인 BERT와 비교해 permutation language modeling objective의 효과를 보이기 위해
Transformer-XL를 신경망의 뼈대로 삼은 것의 중요성을 보이기 위해
몇가지 실행에 대한 자세함을 보여줄 필요가 있기 때문
span-based prediction
bidirectional input pipline
next-sentence prediction





   
참고1



Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks (by 이원민)
Wed, 02 Mar 2022 18:15:59 GMT
summary

최근 대부분의 NLP 관련 연구에서는 pretrained 모델을 사용

본 논문에서는 특정 task와 domain의 data를 pretrain에서 사용하는 것의 유용성에 주목함

RoBERTA 모델을 사용하여 실험을 진행

RoBERTa는 기존 BERT의 pretrain에서 사용한 Wikipedia, BOOKCORPUS 등의 data 뿐만 아니라 CCNEWS, OPENWEB-TEXT등의 unlabled data를 추가해 160GB의 데이터를 사용

Transformer 기반의 아키텍처를 사용해 기존의 BERT와의 구조는 동일하지만, train step, batch size를 늘림

기존 BERT의 NSP 방식은 사용하지 않고, MLM 방식만 사용

이로인해 RoBERTa는 이전의 pretrained 모델보다 더 다양한 task에서 더 좋은 성능을 발휘했기 때문에 본 논문에서 사용하게 됨



본 논문의 기여도
① domain 및 task adaptive pretrain 했을때의 성능 비교를 위해 4개의 domain과 8개의 task에 대한 실험을 진행 (DAPT, TAPT, Combined DAPT and TAPT)
② 특정 task에 adapted했을 때 다른 task로 transfer하는지에 관한 실험 진행(Transfer-TAPT)
③ 더 많은 데이터를 TAPT에 적용하기 위해 task와 관련한 unlabled data를 생성 하는 2가지 방식을 사용 (Human Curated-TAPT, Automated Data Selection for TAPT)


domain & task adapted pretrain
①-1. DAPT(Domain-Adaptive Pretraining)
: 특정 domain과 관련있는 데이터를 사용해 pretrain 진행

특정 domain(논문에서는 biomedical, computer science, news, reviews domain사용)과 기존 RoBERTa의 pretrain domain의 유사도 측정  (fig2)

각 도메인의 데이터에서 자주 사용되는 상위 10,000개의 단어를 사용해 측정
결과적으로, news와 reviews domain은 유사도가 높았고, cs와 biomedical은 비교적 유사도가 낮은것을 확인


관련있는 domain의 데이터로 pretrain을 진행했을 때 기존의 RoBERTa 보다 성능이 향상된 것을 확인 (table3)

위의 결과가 단순히 많은 데이터를 pretrain에 사용해서 성능이 향상된 것이 아님을 증명하기 위해 다른 domain의 데이터를 pretrain에 사용해 실험을 진행

비교적 유사도가 낮은 것끼리 cross 
(news: cs // review: biomedical // cs: news // biomedical: review) (fig2)
결과적으로, 성능이 낮은것을 확인 (table3-¬DAPT)


DAPT를 통해 domain 관련성을 고려해 pretrain을 진행하면 최종 성능이 향상됨을 알 수 있음


 
①-2. TAPT (Task-Adapting Pretraining)
: 관련된 domain 중에서도 특정 task와 관련된 데이터를 사용해 pretrain 진행

DAPT와 비교했을 때 더 적은 데이터를 사용하지만, 훨씬 더 관련성이 높다는 장점을 가짐

기존의 RoBERTa 방식에 비해 성능이 좋고 DAPT 방식과는 비슷한 성능을 가지는 것으로 보아(table5) 시간과 비용이 더 적게 들기 때문에 효율적인 방식으로 볼 수 있음



①-3. DAPT+TAPT
: 기존의 RoBERTa의 pretrain 후, DAPT와 TAPT를 순서대로 진행

기존의 RoBERTa, DAPT, TAPT와 비교했을때 DAPT+TAPT 방식이 가장 좋은 성능을 보임 (table5)

하지만, RoBERTa, DAPT, TAPT와 비교했을때 계산 비용이 가장 큼


Transfer-TAPT
: 같은 domain의 2개의 task 사이의 transfer 효과가 있는지에 관한 실험 진행
ex) biomedical domain의 RCT, CHEMPROT  task에서 RCT의 unlabled data로 pretrain을 진행하고, fine-tuning을 진행할 때 CHEMPROT의 labled data로 결과를 확인

결과적으로 Transfer-TAPT가 TAPT의 성능보다 낮은것을 확인 (table6)

이를 통해 같은 domain 내에서 task에 따라 데이터 분포가 다를 수 있다는 것도 확인할 수 있음 



Augmenting Training data for TAPT
: 특정 task에 더 많은 unlabled data를 만들고 TAPT 실험 진행
③-1. Human Curated-TAPT
: 사람이 직접 unlabled 데이터를 찾고 TAPT에 적용하는 실험을 진행

RCT: 18040개의 training 데이터중 500개를 labled로, 나머지는 unlabled로 설정하고 Curated-TAPT 진행

DAPT+TAPT실험(table5)에서 TAPT는 기존 18040개의 데이터(table2)로 pretrain 했을때의 성능이 87.8(table5) 
DAPT+Curated-TAPT 실험(table7)에서 Curated-TAPT는 17540개(18040-500) 데이터(table2)로 pretrain 
했을때의 성능이83.8(table7) 
→ 더 적은수의 데이터로 TAPT를 진행했을 때 성능이 낮은것을 확인


HYPERPARTISAN: 5000개의 unlabled data로 Curated-TAPT 수행

DAPT+TAPT실험(table5)에서 TAPT는 기존 515개의 데이터(table2)로 pretrain 했을때의 성능이 90.0(table5)
DAPT+Curated-TAPT 실험(table7)에서 Curated-TAPT는 5000개의 데이터(table2)로 pretrain 했을때의 성능이

  92.1(table7) 

→ 더 많은수의 데이터로 TAPT를 진행했을 때 성능이 높은것을 확인)  


IMDB: unlabled 데이터를 직접 50000개 수집해 Curated-TAPT 진행

DAPT+TAPT실험(table5)에서 TAPT는 기존 20000개의 데이터(table2)로 pretrain 했을때의 성능이 95.6(table5)
DAPT+Curated-TAPT 실험(table7)에서 Curated-TAPT는 50000개의 데이터(table2)로 pretrain 했을때의 성능이

  95.8(table7) 

→ 더 많은수의 데이터로 TAPT를 진행했을 때 성능이 높은것을 확인



→ 사람이 직접 데이터를 만드는 것은 한계가 있기때문에 ③-2와 같이 자동으로 데이터를 생성하는 방법을 제안
  
③-2. Automated Data Selection for TAPT
: 자동으로 task와 관련있는 unlabled 데이터를 찾는 방안

VAMPIRE 모델(bag-of-words 기반의 language model)을 사용해 task, domain 데이터를 shared space에 임베딩

domain 데이터중 task 데이터와 유사한 데이터 선택

유사한 데이터는 Knn, random 방식으로 진행
Knn에서는 k=50,150,500을 random에서는 k=50을 대상으로 데이터를 선택해 진행 (fig3)
결과적으로 random 보다 kNN을 사용했을 때 성능이 높은 것을 확인 할 수 있음 (table8) 
 






XLNet: Generalized Autoregressive Pretraining for Language Understanding(by 안재혁)
Thu, 24 Feb 2022 07:58:14 GMT

factorize 또는 factorization order를 인수 분해라고 해석했는데, 오류일 경우 이 부분을 감안하여 봐주시길 바랍니다.

Extra. Transformer-XL
XLNet은 Transformer-XL의 후속 모델이기 때문에, transformer-XL이 가진 고유한 특징을 이해해야 XLNet을 보다 쉽게 이해할 수 있습니다.
https://www.youtube.com/watch?v=lSTljZy8ag4
다음의 영상을 참고하여 작성했습니다.

attention is not recurrenet, is can only deal with fixed-length context.

만약 고정된 길이보다 더 큰 길이의 context가 들어오게 되면, 작은 단위로 나누거나 seqeunce의 뒷 부분을 무시하기 위해 짤라서 사용되어야 한다. 

Context fragmentation: if context is long, it should be split up to segments.

언어 모델에서 큰 데이터셋으로 학습할 때, 한번에 속해 있는 모든 토큰들을 넣을 수 없다. 따라서 더 작은 단위로 sequence로 쪼개서 사용하게 된다. 이로 인해 한 segment의 정보가 다른 segment의 정보로 활용될 수 없다는 점이다. 사람과 다르게 각 segment를 독립적으로 사용될 수 밖에 없다는 것이다.
1,2 번이라는 transformer의 단점을 해결하기 위해선, 긴 길이의 context를 사용하고 이전 sequence의 문맥 정보를 활용할 수 있어야 한다.
Transformer-XL에선 트랜스포머를 활용하되, 1. Segment-level의 Recurrence를 활용하고, 2. Relative Positional Embedding을 활용하였다.
Segment-level Recurrence
Transformer-XL에선 하나의 Segment를 활용하여, 새로운 segment에 대한 모든 예측값을 한번에 사용한다. 

이전의 semgent에 대한 파라미터 값을 일종의 cache로 사용한다. 메모리 공간이 충분히 크다면 이 정보를 충분히 활용할 수 있을 것이고, 다음 예측값은 기존의 문맥 정보를 활용하기 때문에 더 높은 성능을 보장받을 수 있다. 
만약 메모리가 더 크다면 n-1이 아닌, n-4, n-3, n-2, n-1에 대한 정보를 concatenate하여 활용할 수 있을 것이다.
Relative Positional Embedding
단, 이 때 positional embedding에서 문제가 발생한다. 여러 segment에 대한 정보를 넣어준다는 것은 각각의 segment에 대해서 정보가 겹쳐진다는 것이다. 3개의 토큰에 대해서 3개의 0번 째 토큰이 있을 것인데, 이 토큰들이 겹쳐질 것이다.
이를 해결하기 위해 Transformer-XL에선 절대적인 위치가 아니라 query vector, key vector에 대한 상대적인 위치를 계산하여 이를 position으로 정의한다. 그 다음에는 기존의 Transformer에서 positional sinusoid하게 바꿔서 -1~1 값으로 변경한다.
Prediction
During evaluation, vanilla model consumes a segment to make only one prediction at the last position, which is extremely expensive. 
Transformer-XL uses representation(memory) from previous segments instead of computing from scratch.
Transformer-XL은 이전 segment의 값들을 segment로 재활용한다는 장점이 있기 때문에 한번 계산할 때, 하나의 token 값에 대한 결과a가 아니라 한 segment에 대한 결과값을 도출시킬 수 있다.

마지막 빨간색을 계산하기 위해 이전 segment에 대한 정보를 계산하는 것이 아니라, 사용했던 segment의 hidden state를 불러오는 과정이다. 따라서 scratch에서보다 계산 과정이 빠르게 된다. 
>> current segment에 대한 계산 과정이 있는 것은 동일하지만, 한번 계산에 하나의 token에 대한 결과를 내는 것이 아니라 한 segment에 대한 결과를 내는 것이 가장 큰 차이이다.


0. Abstract
다 쓴 다음에 맨 마지막에 요약하는 것도 한 가지 방법이다. 일부 컨퍼런스의 경우 Abstract를 먼저 쓰고 일주일 뒤 본문을 내기도 한다. 한 문장으로 autoencoding과 autoregressive에 대한 내용을 요약하고, 동시에 논문 작성시 화두였던 BERT에 대한 내용도 언급하고 있다.
그 뒤에 “However”이 나와서 BERT에 대해서 부정적으로 표현한다.(corrupting) [MASK]에 대해 부정적으로 표현한다.(discrepency) [MASK] 사이에서도 dependency가 있는데, 서로 독립적이라고 가정하기 때문에 상관관계를 무시한다. 또한, fine-tuning할 때 [MASK]를 사용하지 않으므로 사전 훈련과 파인 튜닝 간의 불일치가 발생한다는 것이다.
이에 XLNet를 제안한다. 이 모델 이름이 쌩뚱맞기 때문에 자세하게 설명하고 있다. XLNet은 autoregressive한 pre-tranining modeling으로, (1)maximizing the expected likelihhod over all permutations of the factorization order. 여기서 키워드는 permutation(순열)으로 factorization order로 기존의 단점을 극복했다는 것을 밝히고 있다. (2) BERT는 autoencoding method인데, AE 방법의 단점을 AR의 접근으로 극복하고 있음을 알리고 있다.
또한, Transformer-XL이라는 autoregressive model을 차용했다. 이는 기존의 sota였던 AR model인데, 이를 pretraining 과정에서 사용했다.
좋은 얘기를 했으면 마지막에 한 문장으로 정리해줘야 한다. (자랑) 다양한 실험 환경에서 BERT를 단순하게 이기는 것이 아니라 20개의 task에서 outperforming해주고 있다.
여기서 margin이라는 말은 기존의 격차보다 큰 차이가 있을 때만 쓸 수 있는 단어이다. 표 3.2에서 보면 가장 성능이 좋은 곳에 bold를 함으로써 한 눈에 볼 수 있게 한다. 마지막에 풋노트를 추가하여 코드를 남기기도 한다. 데이비드 블라이(https://ko.wikipedia.org/wiki/데이비드_블라이)은 글을 유려하게 잘 쓰는데, 이 사람은 Abstract 마지막에 링크를 추가한다. (따라하는 것은 표절이 아니다.)
1. Introduction
Unsupervised representation learning has been highly successful in the domain of natural language processing [7, 22, 27, 28, 10]. Typically, these methods first pretrain neural networks on large-scale
unlabeled text corpora, and then finetune the models or representations on downstream tasks. Under this shared high-level idea, different unsupervised pretraining objectives have been explored in
literature. Among them, autoregressive (AR) language modeling and autoencoding (AE) have been the two most successful pretraining objectives.
비지도 학습 표현은 자연어 처리 분야에서 매우 성공적이었다. 일반적으로, large-scale의 비지도 text corpora를 이용하여 신경망을 사전 훈련하고, 이후 downstream task에 맞게 모델이나 파라미터를 파인튜닝한다. high-level의 방식 하에, 다른 비지도 사전 훈련 목표들에 사용된다. 이 중에서, autoregressive(AR) 언어 모델링과 autoencoding(AE) 언어 모델링 방식이 성공적인 사전 훈련 방식이다.
AR language modeling seeks to estimate the probability distribution of a text corpus with an autoregressive model [7, 27, 28]. Specifically, given a text sequence x = (x1, · · · , xT ), AR language modeling factorizes the likelihood into a forward product p(x) = QT t=1 p(xt | xt). A parametric model (e.g. a neural network) is trained to model each conditional distribution. Since an AR language model is only trained to encode a uni-directional context (either forward or backward), it is not effective at modeling deep bidirectional contexts. On the contrary, downstream language understanding tasks often require bidirectional context information. This results in a gap between AR language modeling and effective pretraining.
AR 언어 모델링은 autoregressive model을 이용하여 text corpus의 확률 분포를 추정하는 방법이다. 특히, 주어진 text 문장에서 AR 언어 모델링은 순전파와 역전파로 확률을 factorizing한다.. 파라미터 모델은 각 지도 분포를 모델링하기 위해 훈련된ㄷ. AR 언어 모델링은 단방향 context를 인코딩하여 훈련하기 때문에, 깊은 양방향 contexts에서 효과적이지 못하다. 반면, downstream language understading task는 양방향 context 정보를 요구하므로 이는 AR 언어 모델링과 효과적인 사전 훈련에서의 차이를 발생시킨다.
In comparison, AE based pretraining does not perform explicit density estimation but instead aims to reconstruct the original data from corrupted input. A notable example is BERT [10], which has been the state-of-the-art pretraining approach. Given the input token sequence, a certain portion of tokens are replaced by a special symbol [MASK], and the model is trained to recover the original tokens from the corrupted version. Since density estimation is not part of the objective, BERT is allowed to utilize bidirectional contexts for reconstruction. As an immediate benefit, this closes the aforementioned bidirectional information gap in AR language modeling, leading to improved performance. However, the artificial symbols like [MASK] used by BERT during pretraining are absent from real data at finetuning time, resulting in a pretrain-finetune discrepancy. Moreover, since the predicted tokens are masked in the input, BERT is not able to model the joint probability using the product rule as in AR language modeling. In other words, BERT assumes the predicted tokens are independent of each other given the unmasked tokens, which is oversimplified as high-order, long-range dependency is prevalent in natural language [9].
대조적으로 AE 기반 사전 훈련은 density 추정을 수행하지 않고, 붕괴된 input(corrupted input)으로부터 오리지널 데이터를 복원하는 것을 목표로 한다. 대표적인 예시는 BERT로 현재 SOTA를 보여주는 사전 훈련 방식이다. 주어진 input token sequence에서 특별한 symbol인 [MASK]로 token의 일부분이 변경되고, 이 모델은 corrputed version에서 original token을 알아내기 위해 훈련된다. density estimation은 목표의 일부분은 아니지만, BERT는 복원을 위해 양방향 context를 활용한다. 즉각적인 이득으로 이는 앞서 AR 언어 모델링에서 보여주는 양방향 information gap을 줄여줘서 더 높은 향상을 이끌어낸다. 하지만, BERT에서 사전 훈련 동안 사용되는 [MASK]라는 특별한 심볼은 파인튜닝에는 존재하지 않는다. 이는 사전 훈련-파인 튜닝 간의 불일치를 만든다. 더욱이 예측되는 token은 input에서 가려지는데, BERT는 AR 언어 모델링에서 처럼 결과물을 이용하여 joint probability를 모델링할 수 없다. 즉, BERT는 마스크되지 않은 토큰이 주어졌을 때  예측되는 토큰은 다른 토큰과 독립적이라고 가정하며, 이는 high-order, long-range 의존성이 자연어에서 만연함에 따라 지나치게 단순화 된다.(무시하는 경향이 짙어진다.)
Faced with the pros and cons of existing language pretraining objectives, in this work, we propose XLNet, a generalized autoregressive method that leverages the best of both AR language modeling and AE while avoiding their limitations.

Firstly, instead of using a fixed forward or backward factorization order as in conventional AR models, XLNet maximizes the expected log likelihood of a sequence w.r.t. all possible permutations of the factorization order. Thanks to the permutation operation, the context for each position can consist of tokens from both left and right. In expectation, each position learns to utilize contextual information from all positions, i.e., capturing bidirectional context.
Secondly, as a generalized AR language model, XLNet does not rely on data corruption. Hence, XLNet does not suffer from the pretrain-finetune discrepancy that BERT is subject to. Meanwhile, the autoregressive objective also provides a natural way to use the product rule for factorizing the joint probability of the predicted tokens, eliminating the independence assumption made in BERT.

현존하는 언어 모델링의 찬반양론에 대해, 이번 논문에선 우리는 XLNet을 제안한다. XLNet은 일반화된 AR 방법이지만 AR의 장점과 AE의 장점만을 사용하여 그들의 한계를 회피한다.

첫 째, 전통적인 AR 모델링에서처럼 고정된 순전파 또는 역전파 인수분해 차수를 사용하는 것 대신에, XLNet은 인수 분해 차수의 모든 가능한 순열을 이용하여 예측된 log likelihood을 극대화한다. 순열 방법 덕분에, 각 포지션의 context은 양방향에서의 token으로 구성된다. 예측 과정에서 각 포지션은 모든 포지션으로부터 contextual information을 활용하기 위하여 학습되고, 결국 양방향 context에 대한 정보를 습득한다.
둘 째, 일반적인 AR 언어 모델링처럼, XLNet은 data corruption에 의존하지 않는다. 그러므로, XLNet은 BERT의 취약점인 사전 훈련-파인 튜닝의 불일치라는 단점을 겪지 않게 된다. autoregressive objective는 또한 product rule을 사용하는 자연적인 방법을 제공하는데, 예측된 토큰의 joint probability를 인수분해한다. 이를 통해 BERT에서 만들어진 독립 가정을 제거한다.

product rule : 곱셈 규칙
In addition to a novel pretraining objective, XLNet improves architectural designs for pretraining.

Inspired by the latest advancements in AR language modeling, XLNet integrates the segment recurrence mechanism and relative encoding scheme of Transformer-XL [9] into pretraining, which empirically improves the performance especially for tasks involving a longer text sequence.
Naively applying a Transformer(-XL) architecture to permutation-based language modeling does not work because the factorization order is arbitrary and the target is ambiguous. As a solution, we propose to reparameterize the Transformer(-XL) network to remove the ambiguity.

Empirically, under comparable experiment setting, XLNet consistently outperforms BERT [10] on a wide spectrum of problems including GLUE language understanding tasks, reading comprehension tasks like SQuAD and RACE, text classification tasks such as Yelp and IMDB, and the ClueWeb09-B document ranking task.
새로운 사전훈련 목표에 더하여, XLNet은 사전 훈련을 위한 구조적 디자인도 증진시킨다.

AR 언어 모델링에의 최근 발전에 영감을 받았다. XLNet은 segment recurrence mechnasism과 Transformer-XL의 인코딩 방법을 사전 훈련 방법에 통합시켜서 킨 text 문장을 포함하는 task에서 성능 향상을 제안한다.
단순하게 Transformer(-XL)을 순열 기반 언어 모델링에 적용하는 것은 동작하지 않는다. 왜냐하면 인수분해 차수는 랜덤이고, target은 모호하기 때문이다. 이에 대한 해결방법으로, 우리는 모호함을 없애기 위해 Transformer(-XL)의 파라미터를 다시 만든다.

실증적으로, 비교가능한 실험 세팅 아래에서, XLNet은 BERT보다 높은 성능을 지속적으로 보여주었는데, GLUE language understanding task, SQuAD, RACE와 같은 reading comprehension task, 텍스트 분류, document ranking task등의 포함한 넓은 영억의 task에서 동작한다. 
일반적으로 pretraninig할 때 large unlabeled data를 이용하고, downstream task를 위해 fine-tuning한다. 이러한 high-level은 같지만, 여기서 “in lierature” 라는 분야에서 연구가 되어왔다.
언어 모델링을 objectives하는데 있어서 두 가지의 방식인 AE와 AR이 있다. 여기서 objective라는 말을 단순히 “목표”라는 뜻으로 해석하기는 어렵다. 포괄적으로 해석해야 하며, 앞으로 나아가면서 무언가를 이룬다는 개념으로 봐야한다. 그래서 objective function이라면 주어진 모델에 대해서 모델을 만들기 위한 function으로 해석하게 된다. 단순히 목표 함수라고 해석하기에는 말이 맞지 않는다.
vector에 bold 체로 하여 scalar와 vector를 구별한다.
AR은 순전파 방향으로 factorize likelihood(분해한다). ~ to model each conditional distribution. : neural network modeling하는데 있어서 지도 데이터로 학습한다. AR 언어 모델링은 단방향으로 학습한다. forward는 forward대로 학습하고, backward는 backward만으로 학습한다는 뜻이다. 이는 bidirectional context를 학습하기에는 효과적이지 않다. 문제는 fine-tuning과정에선 양방향 정보가 필요하다는 것이다. 그래서 단방향으로 학습하는데 양방향 정보가 필요하다는 gap이 발생하는 결과를 낳게 된다.
In comparison으로 AE model에 대해 설명하고 있다. AR의 probability distribution과 AE의 density estimation에 대한 차이가 무엇인가? >> 크게 차이가 있다기 보단 용어의 차이로 보는 것이 편하긴 하다.
논문을 제대로 이해하기 위해선 논문 내의 citation도 보는 것이 중요하다.
has been the state-of-the-art pretraining apporach: 지금은 SOTA가 아니라는 뜻이다.
BERT에 대한 설명을 Abstract에서 설명하였다. (1), (2) [MASK]에 대해서 관계를 무시한다는 것이다. [MASK]가 연속되어 있다면 두 단어 간에 관계가 분명히 있다는 것인데, 이를 무시한다는 것이다. [MASK]를 억지로 만들었다는 뉘앙스로 설명한다.(훈련하기 위해서 억지로 만들었다.) AR의 경우 앞서 연관된 결과를 chain rule을 이용해 뒤의 확률을 계산하는데(joint probability), BERT는 이를 이용할 수 없다는 것이다. 단점을 설명해줘야 내가 만드는 모델에 당위성이 부여된다. 단, 이렇게 표현하기 위해서 내 모델에 대한 자신감이 있어야 한다.
simplified는 좋은 것이다. 단순하게 하여 성능을 높이는 것은 한 가지 방법이지만, 그 방법이 oversimplified되었다는 것이다. 자연어에서 high-order, long-range dependency를 단순화 시킨다고 언급한다.
즉, 현존하는 AR과 AE에 대하여 장단점이 있음을 정리하고 있다. 따라서 AR과 AE의 장점 만을 합한(avoiding their limitations, 단점을 극복한) 새로운 모델인 XLNet을 알린다. 더하여, XLNet은 사전 훈련에 부수적인 장점 두 개를 추가하였다.
먼저, 기존의 AR 모델의 경우 순방향 또는 역방향으로 고정하여 진행했는데, 이 대신에 all possible permutation of the factorization order을 이용하였다. 이를 통해 모든 방향에서 정보를 해석할 수 있게 된다.
저자는 BERT에서 reconstruction을 이용하는 것을 단점으로 여겼고, AR은 data corruption을 이용하지 않는 것을 장점으로 꼽았다. 즉, BERT가 가질 수 밖에 없는 discrepency에 대한 단점을 AR은 가지지 않는다는 것이다. 저자는 product를 chain rule하여 얻음으로써 독립적이라는 BERT의 단점을 해결하였다. 제일 중요한 것은 BERT의 두 개의 주요 단점을 극복했다는 것이다.

2. Paper Review
대부분의 Task에서 SOTA를 달성한 BERT를 몰아낸 것이, XLNet이다. QA Task에서 XLNet은 높은 위치를 보여준다. XLNet은 introduction에서 모델에 대한 설명 전에, AR과 AE에 대한 자세한 설명을 기록하였다. AR에서 높은 성능을 보여준 것이 GPT, ELMo이다. 이는 representation을 활용하는 방식이 조금 다른데, pretraining 방식을 사용하되 직접 활용하지 않는다. 반면, BERT는 pretraining 이후 fine-tuning 방식을 사용한다. GPT는 zero-shot learning을 목표로하는 모델이다. AR 모델들은 language model의 학습 방법을 사용해서, 이전 토큰을 이용하여 다음 토큰을 예측하는 방법이다.
vanilla AR model에서 확률을 조건부 확률의 곱으로 표현하고 있다. conditional distribution은 forward나 backward에 대한 조건부 확률이다. x_t보다 작은 값들이 주어졌을 때 다음 토큰을 예측하도록 되어 있다. 이전 토큰들로 다음 토큰들을 예측하기 때문에 문장에 대한 깊은 이해를 포함할 수 없다. → 이를 해결하기 위해 ELMo나 Autoencoding 방식을 이용해야 한다.
ELMo는 BiLSTM을 기반으로 만들어진 모델이다. Input token이 forward language model과 backward language model을 각각 학습하고, 이후에 linear를 concatenate하여 최종 학습을 진행한다. ELMo의 경우 양방향 문맥을 이용하여 context를 추출하지만, 왼쪽에서 오른쪽을 학습하고, 오른쪽에서 왼쪽을 학습하기 때문에 이 조차 깊은 이해를 하고 있지 않다고 볼 수 있다. 
autoencoding의 경우 분포를 기반으로 추정하여 다음 task를 진행하는 것이 아니라, 모든 정보에 대해 학습한 것을 기존의 정보에서 가져온 것이다. autoencoding의 경우 [MASK]를 예측하는 것이기 때문에, [MASK]라는 noising을 없애는 denoising model이다. AE는 independency assumption을 하기 때문에 정답 간에 의존성이 있음에도 불구하고 없다고 가정한다.
XLNet’s Objective
요약하면 다음과 같다.

기존의 토큰이 [1,2,3,4]의 순서로 이루어질 때, permutation으로 각각 랜덤하게 위치가 바뀌었다고 생각한다.

왼쪽 위는 [3, 2, 4, 1] 순서로 AR modeling을 진행한다. AR modeling은 왼쪽에서 부터 예측을 진행하는데, 3이 맨 앞이므로 정보를 그대로 가져오는데 이 때 정보를 각 레이어의 $mem^{n}$의 정보를 가져온다. 이 때, $mem^{n}$은 Transformer-XL에서 말한 이전 segment에 대한 정보를 의미한다.  
오른쪽 위는 [2, 4, 3, 1] 순서 일 때, 3을 예측하는 과정이다. AR을 forwarding으로 예측할 때, 3을 예측하기 위해 2,4의 정보를 가져와야 한다. 뿐만 아니라 Transformer-XL의 방식도 사용하기 때문에 $mem^{n}$에 대한 정보도 추가로 가져온다.
왼쪽 아래는 [1,4,2,3]일 때 3을 예측한다. 3을 예측하기 위해 1,2,4의 정보와 $mem^{n}$의 정보를 가져온다.
오른쪽 아래는 [4,3,1,2]에서 3을 예측한다. 3번과 같은 방식으로 진행한다.

정리하면, Permutation language model은 [New, Work, is, a, city]가 주어졌을 때, 먼저 [is, a, city, New, York]으로 permutation을 진행한다. 이후, [is, a, city]를 이용하여 ‘New’를 예측하고, [is, a, city, New]를 이용하여 ‘York’를 예측한다. 
기존의 방식을 개선하기 위해 3가지의 방식을 도입하였다.

첫 번째는 $T$가 길이, $Z_{T}$는 순열 들의 조합이다. 여기서 XLNet은 input sequence의 모든 경우의 수를 고려한 모델이기 때문이다. 단, 모든 순열을 고려하는 것은 waste of cost이기 때문에 전부 고려하지 않는다. 하지만, 훈련이 끝날 때 마다 파라미터가 공유되고 데이터의 길이가 많아지고 훈련이 계속 될수록 순열의 일부 조합이 결국 전체 순열의 조합으로 근사될 수 있음을 이용하는 것이다.
  다음 토큰의 분포를 예측하기 위해, 기존의 AR에선 context의 시점이 고정되었다. 하지만, XLNet에선 순열로 인하여 시점이 변경된다. 단, 이 경우 고정이 되지 않아 예측 위치에 대한 정보가 필요하다. 

GPT의 경우 단순하게 [x1, x2, x3, x4]를 이용하여 [x5]를 예측한다. 하지만, Permutation LM의 경우 [x2, x1, x4, x3]의 순서를 이용하는데, 다음에 오는 토큰이 [x5]인지, [x6]인지, [x7]인지 모델에선 알 수 없다는 점이다. 즉, 사전에 미리 입력할 수가 없다는 것이다. 이를 해결 하기 위해서 사용하는 방법이 Two-Stream Self Attention이다.




two stream은 Query Stream과 Content Stream이다.** content stream은 예측하고자 하는 token의 정보를 같이 사용한다. 예측할 때 예측하려는 정보를 사용하는 것은 모순적이다. 이를 해결하기 위해 query stream을 사용한다. Query stream은 전의 sequence 정보를 활용하되, 위치 토큰 embedding만을 활용해서 가져오는 것이다. query stream이 사용하는 $g_{t}^{layer}$은 random initialization에 위치 정보를 가지고 있는 vector이다.
즉, query stream을 이용하여 positional embedding을 추가하는 것이다. $h_{\theta}$가 기존에 사용되는 파라미터였다면, $g_{\theta}$는 positional embedding이 붙어있는 파라미터로 변경되었다. $g_t$가 context와 attention으로 정보를 획득하는데, 두 가지의 제약이 있다. 1. 시점 t에서 context $x_{z
여기서 $x\prime$은 시점 전체에 대한 x를 의미한다.

두 번째: 첫 번째 attention은 훈련이 안되므로 랜덤으로 초기화되고, 마지막 층을 이용하여 t 시점에 대한 context를 예측한다.
  context Representation은 vanilla에서 hidden state에서 transformer을 의미한다. 시점 t이후의 context를 masking하여 치팅을 방지해야 한다. (이후 시점으로 이전 시점을 예측하면 안되므로)
  Partial Prediction: 1→3→2라면 마지막 몇개 만을 사용하여 cost를 줄였다? —> 왜 좋은 지에 대한 설명이 논문에서 부족하다. (성능에 변함이 없기 때문에 사용한 듯)

Transformer-XL에서 사용한 technique를 이용한다.
  Relative Positional Encoding: CNN이나 RNN에는 시점에 대한 직접적인 정보를 이용하지 않고, Transformer의 경우 절대적인 위치에 대한 정보를 추가하여 진행하였다. 하지만, relative positional encoding은 segment에서 위치에 대한 정보를 표현할 수 있지만, 순열이 여러개가 된다면 여러 g_t에 대해서 recurrent되는 단점이 있다. 예를 들어, i meet you로 학습되어야 할 정보가 i you meet로 학습되면 혼동을 줄 수 있다는 것이다. 절대적인 위치를 사용하는 것이 좋지만, 상대적인 위치를 이용함으로써 cost의 낭비를 막는다. $\tau$번째 segment에 대한 위치 정보를 $\tau+1$번 째 segment 위치에 대한 정보를 같다고 취급하는 것이다. 하지만, 순열로 섞어 버렸는데 같다고 취급되버리면 문제가 되는 것이 아닌가? → 상대적인 정보만을 이용할 수 있는 모델을 이용하여 이를 해결한다.
  Segment Recurrence Mechanism: 이전 segment로 얻은 hidden state를 어떻게 활용할지에 대한 내용이다. 


multiple segment에서 XLNet은 어떻게 학습시키는가 → BERT와 비슷하게, A segment [SEP] B Segment [CLS]를 이용한다. 이 때 두 문장을 랜덤으로 샘플링하고, 한 개의 문장으로 concat을 진행한다. 이후 다시 permutation을 수행한다. 결론적으로 비슷한 위치에 대한 학습 정보를 사용하는 것은 indcutive bias를 이용한다는 것인데 이를 통해 improving generalization을 얻는다.
위키피디아 데이터를 이용하여 500,000 에포크 동안 학습을 진행하였다. 문제는 underift이 발생하였는데 더 진행하더라도 정확도 향상에 도움이 안되었고, 모델의 크기를 더 크게 하면 downstream task에서 오히려 정확도가 떨어지는 결과를 낳게 된다.



XLNet
Wed, 23 Feb 2022 14:54:12 GMT
XLNet

XLNet

XLNet은 2019년 당시 대부분의 NLP Task에서 SOTA를 달성했던 BERT를 큰 차이로 밀어낸 모델
핵심 Contribution
GPT로 대표되는 Auto-Regression모델과, BERT로 대표되는 Auto-Encoder 모델의 장점을 합한 Generalized AutoRegressive Pre-training Model
Permutation language model objective와, Two-stream attention Mechanism 제안




Introduction

Pre-training을 통해 얻어진 Representation을 직접적으로 활용(Word2Vec, ELMo)하거나 Pre-trained Model을 DownStream Task에 대해 Fine-tuning하는 방법(BERT,GPT)등이 성과를 보여주는 중
NLP DownStream Task를 위한 Pre-Training 대표적인 Objective
Auto Regressive(AR) [ELMo, GPT, RNNLM]
일반적인 Language Model의 학습 방법으로 이전 Token을 예측하는 문제를 해결
LM의 Objective
$input \space sequence \space :x=(x_1,x_2,...,x_T)$
$forward \space likelihood \space: p(x)= \prod^T_{t=1}p(x_t|x_{
$backward \space likelihood : p(x)=\prod^1_{t=T}p(x_t|x_{>t})$
$training\space objective(forward): max logp_\theta(x)=max \Sigma^T_{t=1}logp(x_t|x_{
$max \Sigma^T_{t=1}logp(x_t|x_{
$h_\theta(x_{1:t-1})$은 신경망에 의해 학습된 Context Representation
$e(x_t)$는 x의 임베딩




Likelihood & Objective
Input Sequence의 Likelihood는 forward / backward 방향의 Conditional Probability들의 곱으로 나타냄
모델은 이러한 Conditional Distribution을 Objective로 학습


AR은 방향성이 정해져야 하므로, 한 쪽 정보만을 이용할 수 있음
양방향 문맥을 활용해 문장에 대해 깊이 이해하기 어려움
ELMo의 경우 양방향을 이용하지만, 각 방향에 대해 독립적으로 학습된 모델을 이용하므로 얕은 이해만 가능




Auto Encoding(AE)
AR과 달리 AE를 기반으로한 사전학습은 분포를 추정하지 않고, 손상된 데이터를 원래 데이터로 재건하는 것을 목표로 함
AE는 주어진 Input을 그대로 예측하는 문제를 품
주로 차원 축소 등을 목적으로 이용


Denosing Auto Encoder는 Noise가 섞인 input을 원래 input으로 예측하는 문제를 품
BERT의 방식도 일종의 DAE로 볼 수 있음
Sequence Token을 일정 확률로 [MASK]로 변환하여 원래 토큰으로 복원하는 방식




LM의 Objective
$input \space sequence:\bar x = (x_1,x_2,...,x_T)$
$corrupted \space input: \hat x =(x_1,[MASK],...,x_T)$
$likelihood : p(\bar x| \hat x) \approx \prod^t_{t=1}p(x_t|\hat x)$
$training \space objective : max logp(\bar x| \hat x)= max \Sigma^T_{t=1}m_tlogp(x_t|\hat x)$
$max \Sigma^T_{t=1}m_tlogp(x_t|\hat x)=\Sigma^T_{t=1}m_tlog{exp(H_\theta(\hat x)t^Te(x_t))\over \Sigma{x'}exp(H_\theta(\hat x)_t^Te(x'))}$
$\hat x$ = Corrupted Version
$\bar x$ = Masked Token




일반적인 DAE의 Likelihood $p(\bar x | \hat x)$와 이를 Maximize하는 Objective를 이용
But. 계산 과정에서 두 가지 차이점이 있음
independent assumption : 주어진 input sequence에 대해 각 [MASK] token의 정답 token이 등장할 확률은 독립이 아니지만, 독립으로 가정
독립이므로, 각 확률의 곱으로 나타낼 수 있음


$x_t$가 [MASK] token일 경우, $m_t=1$, 나머지 경우에는 $m_t =0$
[MASK] token에 대해서만 prediction 진행
$m_t$를 둠으로써, [MASK] 토큰만 예측하는 Objective는 DAE의 Objective(input + noise에 대해 input을 복원, 즉 노이즈의 위치와 관계 없이 전체를 복원)와 다르지만 노이즈를 원래의 Input으로 복원하는 개념상 유사




AR과 달리 AE는 특정 [MASK] Token을 맞추기 위해 양방향 정보를 이용할 수 있음
But. Independent assumption으로 모든 [MASK] token이 독립적으로 예측됨으로써, 이들 사이의 dependency를 학습할 수 없다는 단점이 있음
또한, 노이즈(마스크 토큰) 자체는 실제 fine-tuning 과정에 등장하지 않으므로, pre-training과 fine-tuning간의 불일치 발생








Proposed Method : XLNet

위 두 가지의 장점을 살리고 단점을 극복하기 위해 세가지의 새로운 방식을 제안

Objective
for Objective Target-Aware Representation
Two-stream self-Attention 구조


Objective : Permutation Language Modeling

길이 T의 Sequence $X=[x_1,x_2,...,x_T]$가 주어졌을 때, 시퀀스를 나열할 수 있는 모든 순서의 집합 $(Z_T)$ 순열은 $[1,2,3,...,T],[2,3,4,...,T],...,[T,T-1,...,1]$등 총 $T!$개 만들 수 있음
새로운 objective는 다음 식과 같이 위 집합$(Z_T)$에 속해있는 모든 순서들을 고려하여 AR 방식으로 모델링을 진행
$input \space sequence : x=(x_1,x_2,...,x_T)$
$likelihood ;E_{z\sim Z_T}[\prod^T_{t=1}p_\theta(x_{z_t}|x_{z
$training \space objective :max_\theta \space E_{z \sim Z_T}[\Sigma^T_{t=1}log \space p_\theta(x_{z_t}|x_{z


각 순서에 대한 log likelihood 기대값을 최대화
기존의 AR모델링은 해당 Objective의 순열 중 한 가지 경우 원래의 순서($[1,2,3,..,T]$)만을 고려


즉, 본 방법은 input Sequence index(순서)의 모든 Permutation을 고려한 AR방식 이용
위 $Z_T$를 기준으로, 여러 순열이 있을테지만, 하나의 텍스트를 모든 순열의 순서에 대해서 고려하는 것은 불가능하기에, 특정 순열의 순서$(z)$를 샘플링 하고 해당 순서에 대한 $p_\theta(x)$를 분해
많은 데이터를 학습하면 Parameter $\theta$는 학습하는 동안 모든 순서에 대해 공유되므로 모든 순서를 고려한다고 생각할 수 있음
이 방식으로 모델은 자연스럽게 어떤 근사 없이 양방향 컨텍스트를 볼수 있게 됨


여기서 주의해야 할 점은, Model은 Input(시퀀스)에 대해 순서를 섞는 것이 아니고, $p(x)$에 대한 조건부 확률들의 곱으로 분리할 때만 순서를 섞음
이로인해, 모델이 기존 시퀀스 토큰들의 절대적인 위치를 알 수 있게 학습






Architecture : Two-Stream Self-Attention for Target-Aware Representation

Target-Aware Representation

위에서 제안한 Objective를 기존 Transformer에 적용하면 문제가 발생
일반적으로 파라미터 $\theta$를 갖는 모델로 다음 토큰 분포 $p_\theta(X_{z_t}|x_{z
$p_\theta(X_{z_t}=x|x_{z


이 때, transformer의 $h_\theta(x_{z
기존 AR모델링에서는 Context가 고정되면 예측할 토큰의 위치가 다음 시점(t)로 고정되어 문제가 발생하지 않지만, 위 Objective의 경우 순열된 순서를 고려하기에 주어진 Context($x_{z
이에 대한 해결책으로 모델의 Input에 토큰의 위치 정보 $(z_t)$를 추가적으로 제공
$h_\theta(x_{z g_\theta(x_{z
$p_\theta(X_{z_t}=x|x_{z






Two-Stream Self-Attention

$h_\theta(x_{z g_\theta(x_{z

time step $z_t$는 주변 Context($x_{z
두 가지 Constraint
특정 시점 t에서 target position $z_t$의 token $x_{zt}$를 예측하기 위해, hidden representation $g(x_{z
단어 자체의 정보를 알게 된다면 Cheating


특정 시점 t 이후 $j(>t)$에 해당하는 $x_{zj}$를 예측하기 위해 hidden representation $g(x_{z


위 조건들은 특정 시점에서 하나의 hidden state를 인코딩하는 기존 Transformer구조에서는 작동하지 않음


Architecture
  

모델의 기본 구조는 위와 같음
두가지 hidden representation 제안
Query Representation
$g^{(m)}{z_t}<- Attention(Q=g^{m-1}{z_t},KV=h^{m-1}_{z
첫 번째 Constraint를 해결하기 위해 제안
현재 시점을 제외한 이전 시점 token들의 Content와 현재 시점의 위치 정보를 이용하여 계산되는 Reprentation
마지막 층의 Query Representation을 이용하여 현재 position의 토큰을 예측하는 pre-training objective 계산
첫 층의 Query는 훈련 가능한 random value로 초기화하고 위 식과같은 방식으로 각 layer의 State 계산


Content Representation
$h^{(m)}{z_t}<- Attention(Q=h^{m-1}{z_t},KV=h^{m-1}_{z \le t};\theta)$
현재 시점 및 이전 시점 Token들의 Content를 이용하여 계산되는 Representation
바닐라 transformer의 Hidden State와 동일한 역할
첫 층의 Content Stream은 해당 위치 Token의 Word embedding으로 초기화
각 층의 State는 위 식과 같이 계산
t index 이후 Token의 State들은 마스킹하여 계산






Partial Prediction

위 방식은 사실 매우 느리고 Optimize가 어려움

이를 해결하기 위해,특정 순서에서 마지막 몇 개의 예측만 이용하는 방식을 사용
  



위 식을 최대화 하는 방식





Incorporating Ideas from Transformer-XL

XLNet은 긴 문장의 처리를 위해 Transformer-XL에서 사용 된 두 가지 테크닉을 차용

Relative Positional Encoding

바닐라 Transformer는 Self-attention을 기반으로하고, CNN이나 RNN과 달리 단어들의 상대적, 절대적 위치 정보를 직접적으로 모델링하지 않음

대신 Input에 단어의 절대적인 위치에 대한 Representation (Absolute positional encoding)을 추가하는 방식으로 순서에 대한 모델링을 함

But. 위 방식은 하나의 Segment 내에서는 위치에 대한 의미를 표현할 수 있지만, Transformer-XL과 같이 여러 Segment에 대해 Recurrent 모델링을 하는 경우 문제가 있음
아래 식은 Segment-level의 Recurrence의 식
$h_{\tau+1}=f(h_\tau,E_{s_\tau+1}+U_{1:L})$
$\tau+1$번 째 segment의 문장 $s_{\tau+1}$에 대한 hidden state를 구하는 식


$h_\tau=f(h_{\tau-1},E_{s_\tau}+U_{1:L})$
$\tau$번째 segment의 문장 $s_\tau$에 대한 hidden state를 구하는 식




위 식에서 $\tau$는 segment의 순서를 의미
$E_{s_\tau}$는 input 문장 $s_\tau$의 word embedding
$f$는 transformation function


위 식에서 사용하는 input을 주목하면 word embedding $E$의 경우 segment순서에 맞춰서 알맞게 들어 갔으나, $U_{1:L}$의 경우 $\tau$번 째 Segment에 속한 단어들의 position이 $\tau+1$번째 segment의 단어들의 position보다 앞에 있지만, 둘 다 같은 위치를 표현하는 $U_{1:L}$을 사용

즉, 위 두 식에서 $\tau$번째 Segment의 첫 번째 단어와, $\tau+1$번째 segment의 첫 번째 단어를 위치상 같다고 인식

이러한 문제를 해결하기 위해서 Transformer-XL과, XLNet은 input-level이 아닌 self-attention mechanism에서 relative positional encoding이라는 단어 간의 상대적 위치 정보를 모델링하는 기법을 제안

Attention score in standard Transformer
  

Attention score with Relative Positional Encoding
  

Term (b)와 (d)에서 기존 absolute embedding $U_j$를 relative positional embedding $R_{i-j}$로 대체

$R$은 학습가능하는 파라미터가 아닌 sinusoid encoding matrix


Term (c)와 (d)에서 $U^T_iW^T_q$를 각각 $u^T,v^T$로 대체

Query vector가 모든 query position에 대해 같기 때문에 다른 단어들에 대한 attention bias가 query position에 상관없이 동일하게 유지되어야함


$W_k$를 $W_{k,E},W_{k,R}$로 분리

Content 기반의 key vector와 location 기반의 key vector를 만들어 내기 위함


Attention score with Relative Positional Encoding의 Term (a)는 content를 처리, Term (b)는 content에 의존한 positional bias를 잡아냄 Term (c) global content bias Term (d)는 global positional bias 인코딩





Segment Recurrence Mechanism

XLNet은 긴 문장에 대해서 여러 segment로 분리하고 이에 대해 recurrent하게 모델링을 함
Transformer-XL에서 제안된 segment-level recurrence를 XLNet에 적용하기 위해 2가지 포인트에 주목
어떻게 permutation setting에 recurrence mechanism을 적용할 것인지
이전 segment로 부터 얻어진 hidden state를 재사용할 수 있게 하는 것


$\tilde x=s_{1:T}, x=s_{T+1:2T}$ 일 때, $\tilde z=[1,2,...,T], z=[T+1,...,2T]$의 순열이라면, $\tilde z$를 기반으로 첫 번째 segment에 대한 처리를 하고, 각 layer m으로 부터 얻어진 content representation $\tilde h^{(m)}$을 caching
이후, 두 번째 Segment에 대한 계산은 아래의 수식과 같이 나타냄
$h^{(m)}{z_t}<-Attention(Q=h{z_t}^{(m-1)},KV=[\tilde h^{(m-1)},h^{(m-1)}_{z \le t}];\theta)$


$[.,.]$은 concatenation. $\tilde h^{(m)}$를 이전 segment 처리에서 계산하고 나면, $\tilde z$와 독립적으로 현재 segment (z)에 대한 attention update가 이뤄짐
이를 통해 과거 segment에 대한 factorization order를 고려하지 않고, memory의 caching과 reusing 가능


Query stream $g^{(m)}_{z_t}$에 대해서도 같은 방식으로 계산 가능






Modeling Multiple Segments

XLNet을 어떻게 multiple segment에 대해 AR하게 모델링하고 학습시키는지
Pre-traning
input은 BERT와 유사하게 [A, SEP, B, SEP, CLS]의 형태로 주어짐
SEP CLS는 스페셜 token
Segment A, Segment B에 들어가게 될 두 개의 문장(segment)를 랜덤으로 샘플링
이후, 두 Segment를 하나의 문장으로 Concat하여 Permutation을 수행




Relative Segment Encoding
BERT는 Absolute Segment Embedding을 사용하지만, XLNet은 Relative Position Encoding과 비슷한 원리로 Relative Segment Encoding 적용
전체 Sequence에서 주어진 position $i,j$가 같은 Segment라면, $s_{ij}=s_+$ 아니면 $s_{ij}=s_-$를 사용 $s_+, s_-$는 Attention head에 존재하는 학습가능한 파라미터임


위 방식을 통한 이점
Relative encoding의 inductive bias가 generalization을 향상
둘 이상의 segment를 갖는 fine-tuning task에 대한 가능성을 열어줌






Discussion

BERT와의 비교
예를 들어, [New, York, is, a, city]라는 문장(Sequence of words)가 주어졌을 때, BERT와 XLNet 모두 예측할 토큰으로 [New, York] 2개를 선택하여 $logp(New York|is \space a \space city)$를 maximize 해야 하는  상황
이때 BERT와 XLNet의 objective는 아래와 같음
$J_{BERT}=logp(New|is \space a \space city)+logp(York|is \space a \space city)$
$J_{XLNet}=logp(New|is\space a \space city)+ logp(York|New, is \space a \space city)$
즉, XLNet에서는 각 word간의 Sequence의 순서에 관계 없이 Dependency를 찾을 수 있음


LM과의 비교
GPT와 같은 AR LM류 모델들은 과거에 대한 Dependency만을 고려할 수 있음
이는 QA와 같이 span extraction을 하는 문제에서 취약할 수 있음
하지만 XLNet에서는 가능












Improving Language Understanding by Generative Pre-Training(by 장준영)
Tue, 22 Feb 2022 11:45:37 GMT
논문 링크: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
요약
GPT 탄생 배경

labeled 데이터 부족
unlabeled 데이터 충분
unlabled text corpus(말뭉치)에서 language model pre-training하고 task 맞게 fine-tuning

unlabeled text 문제(semi supervised learning 어려움)

world level이상 정보 얻기 어려움
어떤 optmization objective가 효과적인지 불분명
학습된 표현을 target task에 효과적으로 전달하는 consensus 없음

GPT에서 사용한 방법

unsupervised pre training + supervised fine tuning → semi supervise
약간의 조정으로 적용 가능한 generative representation 학습
unlabeled data에 language modeling objective → target task에 적용
transformer 사용

배경
Unlabeled Data

word level이상 정보 얻기 어려움
어떤 optmization objective가 효과적인지 불분명
학습된 표현을 target task에 효과적으로 전달하는 consensus 없음
→ semi supervised 어려움

GPT-1 사용 방식

wide range 약간의 변형으로 적용 가능하게 범용 representation 학습


unlabeled data에 language modeling objective
target task에 적용


transfromer 사용
long term dependency에 robust 한 결과



Related work
semi supervised learning

기존 - unlabeled data to word or phrase level
최근 - word embedding
sentence level



unsupervised pre training

goal is 좋은 initialization point 찾기
LSTM은 좁은 범위
transformer 사용으로 해결 가능

Auxiliary training objectives

semi supervised 대체 방안
gpt에서 사용
unsupervised pre training으로 먼저 target 관련 linguistic aspect 미리 학습



Framework
high capacity language model → fine-tuning

unsupervised pre-training
 

maxmizing likelihood

k : size of context window

trained using stochastic gradient descent


multi layer transformer decoder



Supervised fine-tuning

pre training에서 만든 language model도 fine tuning task에 맞게 보조 목표로 추가

convergence 가속화
improving generalization of supervised model




optimize (5)



Task specific input transformation
 

기존에는 task specific한 목표에 따라 학습
많은 customize 필요, transfer learning 사용 안 함
So, traversal style approach
convert structured input into ordered sequence





실험
BooksCorpus dataset 사용

long range
다양한 장르

1B Word Benchmark 사용

SOTA 달성

Analysis

transfer시 layer 개수에 따라 성능 향상 경향



pre trained 없을 때 약 15% 하락
fine tuning시 LM 목적함수 제외시 큰 dataset에서 성능 떨어지고 작은 것은 성능 향상
Transformer 대신 LSTM 쓰면 약 6% 하락

결론

BERT보다 언어 생성에 유리
LSTM 대신 transformer 구조를 활용
pre training 사용 → fine tuning
sota 달성

삽입 사진 출처: 원본 논문



Improving Language Understanding by Generative Pre-training(by 안재혁)
Mon, 14 Feb 2022 08:45:18 GMT
1. Introduction
The ability to learn effectively from raw text is crucial to alleviating the depedence on supervised learning in natural language precessing(NLP). Most deep learning methods require substantial amounts of manually labeled data, which restricts their applicability in many domains that suffer from a dearth of annotated resources.
raw text에서 효과적으로 학습할 수 있는 방법은 자연어 처리에서 지도 학습의 의존성을 완화하는데 필수적이다. 대부분의 딥러닝 방법은 많은 양의 손으로 만든 labeled data를 요구하는데, 이는 주석 자원의 부족 현상을 겪고 있는 도메인에서 적용하는데 어려움을 준다.

annotated resources : POS tags, syntactic or semantic features를 의미합니다. (http://www.cs.cmu.edu/~ytsvetko/jsalt-part1.pdf)
제가 지금 텍스트를 쓰는 것처럼 텍스트 만을 이용하면 비지도 데이터를 많이 얻을 수 있지만, 작성하고 있는 데이터가 “논문 정리” 라는 지도 데이터로 학습되기 위해선 제가 이 텍스트에 대해 “논문 정리”라고 태깅 작업을 해야 합니다. 이는 매우 오랜 시간을 필요로 하는 것은 당연합니다.

In these situations, models that can leverage linguistic information from unlabeled data provide a valuable alternative to gathering more annotation, which can be time-consuming and expensive. Further, even in cases where considerable supervision is available, learning good representations in an unsupervised fashion can provide a significant performance boost. The most compelling evidence for this so far has been the extensive use of pretrained word embeddings to improve performance on a range of NLP tasks.
이러한 상황에서, 비지도 데이터의 언어학 정보를 이용하는 모델은 많은 많은 시간과 비용이 드는 주석(annotation) 수집에서 유용한 대안을 제공한다. 더욱이, 지도 학습이 가능한 경우에서도 비지도 방식을 이용하여 좋은 파라미터를 학습하는 것은 상당한 성능 향상을 제공할 수 있다. 지금까지 이에 대한 설득력 있는 증거는 자연어 처리 분야에서 성능 개선을 위해 사전 훈련된 단어 임베딩을 광범위하게 사용한 것이다.

예를 들어, word2vec의 경우 레이블을 필요로 하지 않고, target word와 context를 이용하여 임베딩을 이용합니다. 즉, word2vec와 같은 임베딩도 일종의 비지도 학습입니다. 우리는 비지도 학습을 이용하여 학습 훈련의 성능을 높여왔습니다.

Leveraging more than word-level information from unlabeled text, however, is challenging for two main reasons. First, it is unclear what type of optimization objectives are most effective at learning text representations that are useful for transfer. Recent research has looked at various objectives such as language modeling, machine translation, and discourse coherence, with each method outperforming the others on different tasks.
하지만 비지도 데이터에서 단어 수준 정보 이상으로 이용하는 것은 두 가지 이유 때문에 문제가 있다. 첫 번째는 어떤 타입의 최적화 목표(optimization objective)가 전이 하는데 효율적인 텍스트 표현을 학습하는데 효율적인지 알 수 없다는 점이다. 최근 연구는 언어 모델링, 기계 번역, 담화 추론 등의 각각의 방법은 다른 방법에서 보다 높은 성능을 보여줬다. 
.
Second, there is no consensus on the most effective way to transfer these learned representations to the target task. Existing techniques involve a combination of making task-specific changes to the model architecture, using intricate learning schemes and adding auxiliary learning objectives. These uncertainties have made it difficult to develop effective semi-supervised learning approaches for language processing.
두 번째는 학습된 표현을 target task로 전이하는데 가장 최적화된 방법에 대한 일치된 의견이 없다는 것이다. 현존하는 기술들은 모델 구조에 목표에 특성화된 변화의 조합을 이용하는데, 이는 복잡한 학습 전략이나 보조 학습 목표를 사용해야 한다. 이러한 불확실성은 언어 처리를 위한 효율적인 준지도 학습 접근의 발전을 방해한다.
In this paper, we explore a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning. Our goal is to learn a universal
representation that transfers with little adaptation to a wide range of tasks. We assume access to a large corpus of unlabeled text and several datasets with manually annotated training examples(target tasks).
본 논문에서는 언어 이해 임무(language understanding tasks)를 위한 준지도 학습 접근 방법을 탐색하는데, 비지도 학습의 사전 훈련과 지도 학습의 fine-tuning을 이용한다. 본 논문의 목표는 보편적인 표현을 얻어서 다양한 태스크에서 활용할 때 약간의 변화만으로 전이가 가능하도록 하는 것이다. 이를 위해 큰 비지도 데이터의 말뭉치와 여러개의 손수 태깅된 훈련 예시를 이용한다.

저희가 아는 비지도 학습을 이용한다는 것입니다. 단, 여기서 주목할 점은 “큰” 비지도 데이터 말뭉치와 “여러개”의 직접 태깅된 훈련 데이터셋입니다.

Our setup does not require these target tasks to be in the same domain as the unlabeled corpus. We employ a two-stage training procedure. First, we use a language modeling objective on the unlabeled data to learn the initial parameters of a neural network model. Subsequently, we adapt these parameters to a target task using the corresponding supervised objective.
우리의 셋업은 비지도 말뭉치를 이용함으로써 target task가 같은 도메인일 필요가 없다는 것이다. 우리는 two-stage 훈련 절차를 가진다. 첫 째로, 우리는 비지도 데이터를 이용한 언어 모델링 목표(language modeling objective)를 이용하여 신경망 모델의 파라미터 초기값을 얻는 것이다. 그 다음 우리는 이 파라미터를 target task에 전이하여 일지하는 지도 목표에 사용한다.

매우 큰 비지도 데이터를 이용하여 보편적인 모델을 만드는 것입니다. 현재 KoGPT-2처럼 매우 큰 모델을 이용하여 대본, 위키피디아, 챗봇 등에 활용될 수 있는데 KoGPT-2의 경우 40GB 이상의 텍스트로 학습된 한국어 디코터(decoder) 모델입니다.(https://github.com/SKT-AI/KoGPT2) 뒤에서 후술되지만 GPT의 경우 트랜스포머에서의 인코더-디코더에서 디코더 파트만 이용합니다.

트랜스포머 내용의 경우 이전에 다루었기 때문에 생략했습니다.
We evaluate our approach on four types of language understanding tasks – natural language inference, question answering, semantic similarity, and text classification. Our general task-agnostic model
outperforms discriminatively trained models that employ architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. (중략). We also analyzed zero-shot behaviors of the pre-trained model on four different settings and demonstrate that it acquires useful linguistic knowledge for downstream tasks.
우리는 4개의 언어 이해 태스크로 접근한다. 자연어 추론, QA, 텍스트 유사도, 텍스트 분류이다. 우리의 일반적인 태스크와 관련없는 모델은 각각의 태스크에 특별히 맞추어진 구조를 이용한 모델보다 분명하게 더 높은 성능을 보여준다. 이는 연구된 12개의 태스크 중 9개에서 SOTA를 보였다. 또한 각기 다른 4개의 세팅에서  사전 훈련된 모델의 zero-shot behavior를 분석했는데, 하위 태스크에서 유용한 언어 지식을 획득한다는 것을 입증했다. 

자연어 추론: 자연어 이해를 기반으로 모델의 추론 능력을 평가하는 작업으로 두 문장의 의미 관계를 함의(Entailment), 모순(Contradiction), 중립(Neutral)으로 분류하는 문장 쌍 분류의 일종이다. 두 문장은 전제와 가설로 나누어지는데, 전제를 참이라고 가정할 때 가설의 내용이 참인지, 거짓(모순)인지, 혹은 알 수 없는지(중립)에 따라 두 문장의 관계가 분류된다. (https://www.koreascience.or.kr/article/CFKO202130060562801.pdf, 의존 구문 분석을 활용한 자연어 추론)

텍스트 유사도 :  각기 다른 두 텍스트가 얼마나 유사한지를 나타내는지 나타내는 표현. ex) “이 요리를 누가 만들었지?” 와 “이 볶음밥 누가 만들었어?”는 같은 의미이지만 데이터 상으로는 다른 데이터가 되는데, 텍스트의 “의미”를 파악해서 두 텍스트의 유사도를 확인한다. (https://wiserloner.tistory.com/931)

Zero-shot behavior를 자세히 참고하기 위해선 다음의 논문을 참조하면 됩니다.(https://arxiv.org/abs/2011.08641, A Review of Generalized Zero-Shot Learning Methods).

요약 부분 번역
  Zero-shot behavior은 전이 학습에서 발전된 기계학습의 한 종류이다. 각기 다른 특징을 가지고 있는 A와 B로 학습된 모델이 A와 B의 특징을 모두 가지고 있는 C를 만났을 때 새로운 클래스인 C로 예측되도록 학습하는 방법을 말합니다. 즉, seen(source) data로 학습된 모델이 unseen(target) data를 만났을 때 unseen label로 분류되도록 하는 것을 목표로 합니다.
  참고: https://deep-learning-study.tistory.com/873, https://m.blog.naver.com/with_msip/221886769247





2. Related Works
Semi-supervised learning for NLP
Our work broadly falls under the category of semi-supervised learning for natural language. This paradigm has attracted significant interest, with applications to tasks like sequence labeling or text classification. The earliest approaches used unlabeled data to compute word-level or phrase-level statistics, which were then used as features in a supervised model. 
우리의 작업은 자연어에서 준지도 학습의 분야에 영향을 받았다. 이 파라다임은 sequence labeling, text classification와 같은 태스크에 적용하는데 상당히 많은 관심을 이끌었다. 이전 작업들은 단어 수준이나 구문 수준의 통계학을 계산하기 위해 unlabeled data를 이용하였는데, 이는 이후 비지도 모델에서 feature에서 사용되었다.  

sequence labeling : 인공 신경망을 이용하여 태깅 작업하는 분야

Over the last few years, researchers have demonstrated the benefits of using word embeddings, which are trained on unlabeled corpora, to improve performance on a variety of tasks. These approaches, however, mainly transfer word-level information, whereas we aim to capture higher-level semantics. Recent approaches have investigated learning and utilizing more than word-level semantics from unlabeled data. Phrase-level or sentence-level embeddings, which can be trained using an unlabeled corpus, have been used to encode text into suitable vector representations for various target tasks.
과거 몇 년 동안 연구자들은 비지도 말뭉치를 이용한 워드 임베딩에의 유효성을 입증했는데, 이는 다양한 태스크에서 성능 향상을 보였다. 하지만 이 접근은 주로 단어 수준의 정보를 전이하는 것이고 본 논문에선 단어보다 높은 의미를 획득하는 것을 목표로 한다. 최근 접근은 비지도 데이터로부터 단어 수준의 의미 이상으로 학습하고 활용하는 것을 조사했다. 비지도 말뭉치를 이용하여 학습될 수 있는 구문 단위 또는 문장 단위 임베딩은 다양한 target task를 위해 적합한 벡터 표현으로 부호화되어 사용된다.

대부분의 워드 임베딩이 단어를 잘 표현하는데 목표를 두고 접근했다면, GPT는 위에 언급한 자연어 추론, 텍스트 유사도 등 더 높은 수준의 의미를 파악하는데 중점을 둔다는 의미입니다.

Unsupervised pre-training
Unsupervised pre-training is a special case of semi-supervised learning where the goal is to find a good initialization point instead of modifying the supervised learning objective. Early works explored the use of the technique in image classification and regression tasks. Subsequent research [15] demonstrated that pre-training acts as a regularization scheme, enabling better generalization in deep neural networks. In recent work, the method has been used to help train deep neural networks on various tasks like image classification [69], speech recognition [68], entity disambiguation [17] and machine translation [48].
비지도 사전훈련은 준지도 학습의 특별한 케이스이다. 목표는 지도 학습의 목표를 변경하지 않고 좋은 초기 값을 얻는 것이다. 초기 연구는 이미지 분류나 회귀 작업에서 사용되었다. 이후 연구에서 사전 훈련이 규제로 사용되어 신경망에서 좋은 일반화를 얻는데 사용할 수 있다는 것을 입증했다. 최근 연구에선 비지도 사전 훈련이 이미지 분류, 음성 인식, 엔티티 연결, 기계 번역 등에 사용되었다.

이를 설명하기 위한 일종의 예시입니다. https://proceedings.neurips.cc/paper/2018/file/2a38a4a9316c49e5a833517c45d31070-Paper.pdf(Supervised autoencoders: Improving generalization
performance with unsupervised regularizers)
  요약하면, autoencoder에서 reconstruction loss를 classifer loss와 연결한다면 두 loss의 balancing을 잡아주는 과정에서 autoencoder의 manifold learning에서 효율을 높여줍니다.


The closest line of work to ours involves pre-training a neural network using a language modeling objective and then fine-tuning it on a target task with supervision. Dai et al. [13] and Howard and
Ruder [21] follow this method to improve text classification. However, although the pre-training phase helps capture some linguistic information, their usage of LSTM models restricts their prediction
ability to a short range. In contrast, our choice of transformer networks allows us to capture longer range linguistic structure, as demonstrated in our experiments. Further, we also demonstrate the
effectiveness of our model on a wider range of tasks including natural language inference, paraphrase detection and story completion.
우리의 연구와 가장 비슷한 연구는 언어 모델링 목표를 신경망을 사전 훈련하고 지도를 이용한 target task로 fine-tuning하는 것이다. Dai et al., Howard, Ruder는 이 방법을 이용해 텍스트 분류에서 개선을 보였다. 하지만 사전 훈련이 언어학 정보를 획득하는데 도움을 줄지라도, LSTM model의 사용은 모델의 예측 능력을 작은 분야로 한정하는 것이다. 반면, 트랜스포머는 더 긴 범위로 언어학 구조를 파악할 수 있고 이는 우리의 실험으로 입증되었다. 더욱이, 우리는 우리의 모델의 효율성을 자연어 추론, 구문 탐색, 이야기 완성을 포함한 넓은 범위의 task에서 확인하였다.
Other approaches [43, 44, 38] use hidden representations from a pre-trained language or machine translation model as auxiliary features while training a supervised model on the target task. This involves a substantial amount of new parameters for each separate target task, whereas we require minimal changes to our model architecture during transfer.
다른 접근으로는 target task에서 지도 모델을 훈련하면서 사전 훈련된 언어나 기계 번역에서의 hidden representation을 보조 기능으로서 사용하는 것이다. 이는 각 target task를 위해 상당한 양의 새로운 파라미터를 필요하는 반면, GPT는 전이 과정에서 기존 구조에 약간의 변경만을 이용하는 장점을 가진다.
Auxiliary training objectives
Adding auxiliary unsupervised training objectives is an alternative form of semi-supervised learning. Early work by Collobert and Weston [10] used a wide variety of auxiliary NLP tasks such as POS tagging, chunking, named entity recognition, and language modeling to improve semantic role labeling. More recently, Rei [50] added an auxiliary language modeling objective to their target task objective and demonstrated performance gains on sequence labeling tasks. Our experiments also use an auxiliary objective, but as we show, unsupervised pre-training already learns several linguistic aspects relevant to target tasks.
보조 비지도 학습 목표를 추가하는 것은 준지도 학습의 대안이다. Collobert and Weston에 의한 초기 작업은 POS tagging, chunking, 엔티티 인식, 모델링와 같은 보조적인 NLP task를 사용하여 semantic role labeling을 향상했다. 최근 Rei는 보조 언어 모델링 목표를 target task에 붙였는데 sequence labeling task에서 성능 향상을 입증했다. 우리의 실험은 보조 목표를 사용하지만, 비지도 사전 학습은 미리 target task와 관련된 다양한 언어적 aspect를 배운다.
3. Framework
Our training procedure consists of two stages. The first stage is learning a high-capacity language model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to
a discriminative task with labeled data.
우리의 훈련 과정은 두 개의 스테이지로 구성된다. 첫 스테이지는 큰 텍스트 말뭉치로 높은 가능성을 가진 언어 모델을 학습하는 것이다. 그 다음 fine-tuning 스테이지에선 labeled data로 각각의 task를 모델에 적응시킨다.
Unsupervised pre-training
Given an unsupervised corpus of tokens $U = (u_{i}, ..., u_{n})$, we use a standard language modeling objective to maximize the following likelihood:
비지도 말뭉치 토큰 U가 주어졌을 때, 다음의 가능도를 최대화하기 위하여 표준 언어 모델링 목표를 사용한다. 
$L1(U) = \sum_{i}
log P(u_{i}
|u_{i−k}, . . . , u_{i−1}; Θ)$ .  —- Eq. 1
where k is the size of the context window, and the conditional probability P is modeled using a neural network with parameters Θ. These parameters are trained using stochastic gradient descent [51].
In our experiments, we use a multi-layer Transformer decoder [34] for the language model, which is a variant of the transformer [62]. This model applies a multi-headed self-attention operation over the
input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens:
여기서 k는 맥락 윈도우(context window)의 크기고, 조건부 확률 P는 파라미터 $\Theta$의 신경망을 이용하여 모델링 되었다.  이 파라미터는 SGD를 이용해 훈련된다. 논문의 실험에서, 우리는 다중 레이어 트랜스포머 디코더를 이용하는데, 이는 트랜스포머의 변형 형태이다. 이 모델은 input context에 tokemulti-head self-attention를 적용하고, position-wise feedforward layers를 통과하여 target token에 대한 output distribution을 생성한다.

where $U = (u_{i}, ..., u_{n})$ is the context vector of tokens, n is the number of layers, $W_{e}$ is the token embedding matrix, and  $W_{p}$ is the position embedding matrix.
여기서 U는 token의 맥락 벡터이고, n은 레이어의 개수, $W_{e}$는 token의 임베딩 매트릭스, $W_{p}$ 는 position embedding matrix이다.
Supervised fine-tuning
After training the model with the objective in Eq. 1, we adapt the parameters to the supervised target task. We assume a labeled dataset $C$, where each instance consists of a sequence of input tokens,
$x_{1}, . . . , x_{m},$ along with a label y. The inputs are passed through our pre-trained model to obtain the final transformer block’s activation $h_{l}^{m}$, which is then fed into an added linear output layer with
parameters $W_{y}$ to predict y:
방정식 1에서 objective의 모델을 훈련한 뒤에, 우리는 이 파라미터를 지도 target task로 전이시킨다. 우리는 레이블된 dataset C를 추정하는데, 각 instance는 input token sequence x1 ~ xm과 레이블 y로 구성된다. 입력은 사전 훈련된 모델을 통과하여 마지막 트랜스포머의 활성화값 $h_{l}^{m}$을 얻는다. 이는 $W_{y}$를 가진 linear output layer을 통과하여 y를 예측한다.

We additionally found that including language modeling as an auxiliary objective to the fine-tuning helped learning by (a) improving generalization of the supervised model, and (b) accelerating
convergence. This is in line with prior work [50, 43], who also observed improved performance with such an auxiliary objective. Specifically, we optimize the following objective (with weight λ):
)
우리는 언어 모델링을 fine-tuning의 보조적인 목표로 포함시키는 것은 도움을 주는데, (a)로 지도 모델의 일반화를 증가시키고, (b) 수렴을 가속화한다. 이에 대해 이전에도 관찰한 논문들이 있다. 특히, 가중치 람다를 최적화한다.

Related work의 unsupervised learning에서 언급한 것처럼 두 개의 loss를 밸런싱함으로써 (a)와 (b)를 얻을 수 있습니다. 논문에 “prior work”라고 언급한 논문에서는 이 람다를 0.1로 설정하였습니다. (https://arxiv.org/pdf/1704.07156.pdf, Semi-supervised Multitask Learning for Sequence Labeling)

Overall, the only extra parameters we require during fine-tuning are Wy, and embeddings for delimiter tokens (described below in Section 3.3).
종합적으로 fine-tuning 과정에서 추가한 파라미터는 Wy와 delimiter token을 위한 임베딩입니다.
Task-specific input transformations
For some tasks, like text classification, we can directly fine-tune our model as described above. Certain other tasks, like question answering or textual entailment, have structured inputs such as
ordered sentence pairs, or triplets of document, question, and answers. Since our pre-trained model was trained on contiguous sequences of text, we require some modifications to apply it to these tasks. 
텍스트 분류와 같은 몇몇 태스크에선 위에서 말한 것처럼 직접적으로 모델을 fine-tuning한다. 하지만 QA, textual entailment와 같은 몇몇 다른 태스크는 순서를 가진 문장 쌍, 문서의 세 쌍, 질문과 대답 같은 인풋을 가진다. 우리의 사전 훈련된 모델은 인접한 텍스트의 문장에서 훈련되므로, 이러한 태스크에 적용하기 위해 변경이 요구된다.

contiguous : sequence를 계속 주면서, 그 다음 sequence를 예측하는 방식을 이야기한다. BERT라면 MASK를 이용하여 중간의 단어를 예측하는 거였다면, GPT의 경우 그 뒤의 문장을 해석할 수 있도록 모델이 훈련된다.

Previous work proposed learning task specific architectures on top of transferred representations [44]. Such an approach re-introduces a significant amount of task-specific customization and does not
use transfer learning for these additional architectural components. Instead, we use a traversal-style approach [52], where we convert structured inputs into an ordered sequence that our pre-trained
model can process.
이전 연구들은 전이된 표현 위에 태스크에 특화된 구조를 배우는 것을 제시했다. 이러한 접근은 task-specific customization에 상당한 양을 재도입하고, 추가적인 구조 요소를 위해 전이 학습을 사용하지 않는다. 대신에, 우리는 구조적인 input을 순서를 가진 문장으로 변환하는데 이 문장은 우리의 사전 훈련된 모델이 처리할 수 있다.

전이된 표현이 전이 학습을 의미하는 것이 아닙니다. previous work가 말하는 추가적인 customization이 의미하는 것은 ELMo입니다. ELMo는 양방향 LSTM을 이용하는데, 딥러닝 모델을 이용한 단어 임베딩을 말합니다. 즉, ELMo를 통해서 표현된 학습 데이터가 신경망으로 전이되어 추가적인 계산을 진행하므로 이는 많은 양의 계산이 오버헤드가 발생합니다. 하지만 GPT의 경우 추가적인 모델 없이, 전이 학습을 이용하여 추가적인 계산을 방지하는 것입니다.  (https://arxiv.org/pdf/1802.05365.pdf, Deep contextualized word representations)

<< discussion : ELMo또한 pre-trained model로 이해하고 있었는데, does not use transfer learning이라고 말한 이유가 무엇인지 >> 
These input transformations allow us to avoid making extensive changes to thearchitecture across tasks. We provide a brief description of these input transformations below and Figure 1 provides a visual illustration. All transformations include adding randomly initialized start and end tokens $(, )$.

Textual entailment
For entailment tasks, we concatenate the premise p and hypothesis h token sequences, with a delimiter token ($) in between.
전제 p와 가설 h의 token seqeunce를 concat했고, 이는 delimiter $로 나누어진다.

Textual entailment는 주어진 두 문장에 대한 추론으로, natural language inference입니다. 두 문장이 주어졌을 때 첫 번째 문장이 두 번째 문장을 수반하는가 혹은 위배되는가를 해결함으로써 모델을 학습한다.

Similiarity
For similarity tasks, there is no inherent ordering of the two sentences being compared. To reflect this, we modify the input sequence to contain both possible sentence orderings (with a
delimiter in between) and process each independently to produce two sequence representations $h_{l}^{m}$ which are added element-wise before being fed into the linear output layer.
similiarity task에선, 비교되는 두 문장에 대한 고유한 순서가 없다. 이를 반영하기 위하여 우리는 input sequence가 가능한 문장 순서를 갖도록 변경하는데, entailment와 마찬가지고 delimiter를 이용한다. 그리고 각각을 독립적으로 처리하여 두 문장의 표현 $h_{l}^{m}$을 생성하는데, 이는 linear layer로 가기 전에 element-wise에 추가된다.
Question Answering and Commonsense Reasoning
For these tasks, we are given a context document z, a question q, and a set of possible answers {$a_{k}$}. We concatenate the document context and question with each possible answer, adding a delimiter token in between to get $[z;q;$;q_{k}]$. Each of these sequences are processed independently with our model and then normalized via a softmax layer to produce an output distribution over possible answers.
QA에서, 문서 z, 질문 q, 가능한 답변 ak가 주어졌다. 문서 문맥과 질문을 가능한 답변과 concatenate하고, [z; q; $; ak]을 얻기 위해 사이 사이에 delimiter를 추가한다. 각 sequence는 모델에서 독립적으로 처리되고 softmax layer로 정규화되어 가능한 답변 사이의 output에 대한 확률 분포를 생성한다.
4. Experiments
4.1 Setup
Unsupervised pre-training
We use the BooksCorpus dataset [71] for training the language model. It contains over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance. Crucially, it contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information. An alternative dataset, the 1B Word Benchmark, which is used by a similar approach, ELMo [44], is approximately the same size but is shuffled at a sentence level - destroying long-range structure. Our language model achieves a very low token level perplexity of 18.4 on this corpus.
언어 모델을 훈련하기 위해 BooksCorpus dataset을 이용한다. 이 데이터셋은 어드벤쳐, 판타지, 로맨스 등의 다양한 장르를 가진 7000개의 출판되지 않는 책을 포함한다. 특히, 이는 긴 길이의 이어지는 텍스트를 포함하는데 생성 모델이 긴 범위의 정보에 대한 조건을 배울 수 있게 한다. 유사한 접근으로 사용되는 ELMo는 거의 같은 크기이지만 문장 수준에서 섞이게 되어 긴 범위의 구조를 파괴시킨다. 우리의 언어 모델은 이 말뭉치에서 18.4의 perplexity를 달성하였다.

perplexity는 낮을수록 좋은 것이고, GPT는 낮은 perplexity를 달성했음을 논문에서 밝혔다.

Model specifications
Our model largely follows the original transformer work [62]. We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states. We used the Adam optimization scheme [27] with a max learning rate of 2.5e-4. The learning rate was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule. We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens.
우리 모델은 기존의 트랜스포머 방식을 따른다. 12개의 디코더만을 가지는 트랜스포머는 768 차원, 12개의 attention head인 self-attention heads를 가진다. position-wise feed-forward networks를 위해 3072개의 리니어 크기를 가진다. Adam optimization의 learning rate는 2.5e-4이다. warmup scheduler처럼 0부터 시작하여 선형적으로 2000 까지 증가하고, cosine scheduler를 이용하여 0으로 무효화시킨다. 우리는 64개의 랜덤한 샘플을 가진 미니배치로 100 에포크 동안 학습하였고, 이 샘플은 512개의 연속된 문장을 가진다.  
Since layernorm [2] is used extensively throughout the model, a simple weight initialization of N(0; 0:02) was sufficient. We used a bytepair encoding (BPE) vocabulary with 40,000 merges [53] and residual, embedding, and attention dropouts with a rate of 0.1 for regularization. We also employed a modified version of L2 regularization proposed in [37], with w = 0:01 on all non bias or gain weights. For the activation function, we used the Gaussian Error Linear Unit (GELU) [18]. We used learned position embeddings instead of the sinusoidal version proposed in the original work. We use the ftfy library2 to clean the raw text in BooksCorpus, standardize some punctuation and whitespace, and use the spaCy tokenizer.3.
layer normalization은 모델 전체에서 사용되므로 N(0, 0.02)의 간단한 가중치 초기값으로 충분하다. 우리는 40,000의 집합을 가진 BPE와 residual embedding, 규제를 위해 0.1의 dropout을 사용한다. 우리는 L2 규제를 사용하는데 bias가 없거나 gain weight가 없는 상태에서 w=0:01이다. 활성화 함수에서 GELU 함수를 사용한다. 기존 연구에서 사용된 sinosoidal version 대신 학습된 poistion embeddings을 사용한다. BooksCorpus의 raw text를 깨끗하게 하기 위해 ftfy library2를 사용하고 구두점,  공백을 표준화(제외)하고 spaCy tokenizer3을 사용한다.
Fine-tuning details
Unless specified, we reuse the hyperparameter settings from unsupervised pre-training. We add dropout to the classifier with a rate of 0.1. For most tasks, we use a learning rate of 6.25e-5 and a batchsize of 32. Our model finetunes quickly and 3 epochs of training was sufficient for most cases. We use a linear learning rate decay schedule with warmup over 0.2% of training. $\lambda$ was set to 0.5.
명시되지 않은 경우, 우리는 비지도 사전 훈련으로부터 얻은 하이퍼파라미터 세팅을 재사용한다. classifier에 0.1의 dropout을 추가한다. 대부분의 task에서 6.25e-5의 learning rate를 설정하고, 32의 배치 사이즈를 이용한다. 우리의 모델은 빠르게 fine-tunes되고 3 에포크의 훈련은 대부분의 경우에서 유효하다. 우리는 선형적인 learning rate decay schedule에 0.2%의 warmup scheduler를 사용한다. 이 때 람다는 0.5로 설정된다.
4.2 Supervised fine-tuning
We perform experiments on a variety of supervised tasks including natural language inference, question answering, semantic similarity, and text classification. Some of these tasks are available as part of the recently released GLUE multi-task benchmark [64], which we make use of. Figure 1 provides an overview of all the tasks and datasets.
우리는 다양한 지도 task를 수행하는 실험하는데 자연어 추론, QA, 의미 유사도, 텍스트 분류이다. 몇몇 task는 최근 발표된 GLUE multi-task benchmark의 일부로써 사용 가능하다. Figure1은 논문에서 진행되는 task와 dataset의 개요를 보여준다.

Natural Language Inference
The task of natural language inference (NLI), also known as recognizing textual entailment, involves reading a pair of sentences and judging the relationship between them from one of entailment, contradiction or neutral. Although there has been a lot of recent interest [58, 35, 44], the task remains challenging due to the presence of a wide variety of phenomena like lexical entailment, coreference, and lexical and syntactic ambiguity.
자연어 추론(NLI) task는 textual entailment로 알려져 있다. NLI는 두 개의 문장이 주어지고 두 문장이 사이에 관계에서 수반된 문장인지, 모순된 문장인지, 중립된 문장인지 판단한다. NLI 분야는 많은 관심이 있었지만, task는 매우 아직 까지 어려운 분야로 남아있는데 lexical entailment, conference and lexical and syntactic ambiguity 같은 다양한 현상 때문이다.

lexical entailment : Lexical Entailment is concerned with identifying the semantic relation, if any, holding between two words, as in (pigeon, hyponym, animal).(https://paperswithcode.com/task/lexical-entailment)
문장 간의 관계를 파악하는데 있어, 단어 사이에도 관계의 모호성에 대한 문제로 인해 NLI 분야에 어려움이 많다는 내용으로 해석했습니다.

We evaluate on five datasets with diverse sources, including image captions (SNLI), transcribed speech, popular fiction, and government reports (MNLI), Wikipedia articles (QNLI), science exams (SciTail) or news
articles (RTE).
우리는 다양한 source를 가지고 있는 다섯 개의 데이터셋으로 평가한다. 데이터셋에는 이미지 캡션, 대본, 유명 소설, 정부 발표, 위키피디아 기사, 과학 시험, 뉴스 기사로 이루어져 있다.
Table 2 details various results on the different NLI tasks for our model and previous state-of-the-art approaches. Our method significantly outperforms the baselines on four of the five datasets, achieving
absolute improvements of upto 1.5% on MNLI, 5% on SciTail, 5.8% on QNLI and 0.6% on SNLI over the previous best results. This demonstrates our model’s ability to better reason over multiple sentences, and handle aspects of linguistic ambiguity. On RTE, one of the smaller datasets we evaluate on (2490 examples), we achieve an accuracy of 56%, which is below the 61.7% reported by a multi-task biLSTM model. Given the strong performance of our approach on larger NLI datasets, it is likely our model will benefit from multi-task training as well but we have not explored this currently.

Table2는 NLI task에서 우리의 모델과 SOTA 모델의 결과이다. 우리의 방법은 분명히 다섯 개 중 네 개의 데이터셋에서 SOTA를 보였다. MNLI에선 1.5%, SciTail 에선 5%, QNLI에선 5.8%, SNLI에선 0.6%의 성능 향상을 보였다. 이는 우리의 모델이 여러개의 문장에서 더 좋은 추론 결과를 보여주고, 언어의 모호함을 더 좋게 다룸을 입증한다. RTE은 2490개의 예제를 가진 작은 데이터셋인데, 우리의 모델은 56%를 달성했고, SOTA 모델인 multi-task biLSTM model은 61.7%이였다. 큰 NLI dataset에서 좋은 성능을 보여주는 우리의 모델을 비추어봤을 때, 이는 우리의 모델이 multi-task training에서 더 좋은 성능을 보여줄 수 있을 것이다. 이는 입증되진 않았다.
Question answering and commonsense reasoning
Another task that requires aspects of single and multi-sentence reasoning is question answering. We use the recently released RACE dataset [30], consisting of English passages with associated questions from middle and high school exams. This corpus has been shown to contain more reasoning type questions that other datasets like CNN [19] or SQuaD [47], providing the perfect evaluation for our model which is trained to handle long-range contexts. In addition, we evaluate on the Story Cloze Test [40], which involves selecting the correct ending to multi-sentence stories from two options. On these tasks, our model again outperforms the previous best results by significant margins - up to 8.9% on Story Cloze, and 5.7% overall on RACE. This demonstrates the ability of our model to handle long-range contexts effectively.
단일, 다중 문장을 추론을 요구하는 task는 QA이다. 우리는 RACE dataset을 사용한다. 이 데이터셋은 중, 고등학교에서 문제와 관련된 영어 passage로 이루어져 있다. 이 데이터셋은 CNN 이나 SQuaD와 같은 데이터셋과 달리 많은 추론 유형 문제를 포함하고 있다. 그래서 이는 긴 범위의 맥락을 다룰 수 있도록 훈련된 우리의 모델을 평가하는데 완벽한 데이터셋이다. 더하여, 우리는 Story Cloze Test를 평가하는데, 두 개의 옵션으로부터 여러개의 문장 스토리 중 정확한 결말을 선택하는 것이다. 이 문제에서 우리의 모델은 다시 이전의 가장 좋은 모델의 성능보다 더 높은 성능을 보여주었다. Story Cloze에서 8.9%, RACE에서 5.7%이다. 이는 우리의 모델의 능력이 길이가 긴 문맥을 효과적으로 다룸을 입증한다.

passage : a short piece of writing or music that is part of a larget piece of work(https://dictionary.cambridge.org/ko/사전/영어/passage). 지문으로 해석하면 될 것 같습니다.

Semantic Similarity
Semantic similarity (or paraphrase detection) tasks involve predicting whether two sentences are semantically equivalent or not. The challenges lie in recognizing rephrasing of concepts, understanding negation, and handling syntactic ambiguity. We use three datasets for this task – the Microsoft Paraphrase corpus (MRPC) [14] (collected from news sources), the Quora Question Pairs (QQP) dataset [9], and the Semantic Textual Similarity benchmark (STS-B) [6]. We obtain state-of-the-art results on two of the three semantic similarity tasks (Table 4) with a 1 point absolute gain on STS-B. The performance delta on QQP is significant, with a 4.2% absolute improvement over Single-task BiLSTM + ELMo + Attn.
의미 유사도(또는 패러프레이징 탐색)은 두 문장이 의미적으로 얼마나 동등한지 예측하는 문제이다. 이 문제는 개념을 다시 바꾸어 말하는 것인데, 부정을 이해하고 의미의 모호함을 다룬다. 우리는 이 문제를 위해 세 개의 데이터셋을 이용한다. Microsoft Paraphrase, Quora Question Pairs(QQP), 그리고 Semantic Textual Similiarity benchmark이다. 우리는 이 중 2개에서 SOTA를 달성했다. 

Classification
Finally, we also evaluate on two different text classification tasks. The Corpus of Linguistic Acceptability (CoLA) [65] contains expert judgements on whether a sentence is grammatical or not, and tests the innate linguistic bias of trained models. The Stanford Sentiment Treebank (SST-2) [54], on the other hand, is a standard binary classification task. Our model obtains an score of 45.4 on CoLA, which is an especially big jump over the previous best result of 35.0, showcasing the innate linguistic bias learned by our model. The model also achieves 91.3% accuracy on SST-2, which is competitive with the state-of-the-art results. We also achieve an overall score of 72.8 on the GLUE benchmark, which is significantly better than the previous best of 68.9.
마지막으로 우리는 텍스트 분류 문제를 평가한다. CoLA의 말뭉치는 문장이 문법적인지 expert judgements를 포함하고,  훈련된 모델에 고유한 언어 편향을 테스트한다. SST-2는 반면에 표준 binary classification task이다. 우리의 모델은 CoLA에서 45.4의 점수를 받았는데 이는 이전의 가장 좋은 결과인 35.0보다 뛰어넘은 결과이다. 이 모델은 SST-2에서 91.3%의 결과를 보여줬고 마찬가지로 SOTA result이다. 우리는 GLUE benchmark에서 72.8의 종합 점수를 성취했고 이는 기존의 68.9보다 높은 점수이다.
Overall, our approach achieves new state-of-the-art results in 9 out of the 12 datasets we evaluate on, outperforming ensembles in many cases. Our results also indicate that our approach works well
across datasets of different sizes, from smaller datasets such as STS-B (5.7k training examples) – to the largest one – SNLI (550k training examples).
종합적으로, 우리의 접근은 12개의 데이터셋 중에서 9개의 SOTA를 보였다. 우리의 결과는 우리의 접근이 다른 사이즈, 작은 데이터셋부터 큰 데이터셋까지 좋은 결과를 보여주었다.
5. Analysis
Impact of number of layers transferred
We observed the impact of transferring a variable number of layers from unsupervised pre-training to the supervised target task. Figure 2(left) illustrates the performance of our approach on MultiNLI and RACE as a function of the number of layers transferred. We observe the standard result that transferring embeddings improves performance and that each transformer layer provides further benefits up to 9% for full transfer on MultiNLI. This indicates that each layer in the pre-trained model contains useful functionality for solving target tasks.
우리는 비지도 사전 훈련에서 지도 target task로 갈 때 레이어의 개수에 따른 전이의 영향을 관찰했다. Figure 2는 전이된 레이어의 개수의 기능으로써 MultiNLI 과 RACE에서 우리의 접근의 성능을 나타낸다. 우리는 transferring embeddings이 성능을 향상시키고, (의역) MultiNLI에서 트랜스포머 레이어의 개수가 12개가 됐을 때 최대 9%의 향상을 보여준다. 이는 사전 훈련된 모델에서 각각의 레이어가 target task를 해결하는데 유효한 기능을 수행함을 가리킨다.


오른쪽은 zero-shot behavior에 대한 그래프입니다.

Zero-shot Behaviors
We’d like to better understand why language model pre-training of transformers is effective. A hypothesis is that the underlying generative model learns to perform many of the tasks we evaluate on in order to improve its language modeling capability and that the more structured attentional memory of the transformer assists in transfer compared to LSTMs. We designed a series of heuristic solutions that use the underlying generative model to perform tasks without supervised finetuning. We visualize the effectiveness of these heuristic solutions over the course of generative pre-training in Fig 2(right). 
왜 언어 모델을 사전 훈련 하는 것이 효과적인지 이해하고자 한다. 가설은 다음과 같다. 먼저 근원적인 생성 모델은 우리가 수행하려고 하는 많은 문제를 수행하고자 학습시켜, 언어 모델링의 수용력을 향상시켰다. 또한 LSTM에 비해 전이 하는데 있어 트랜스포머의 structured attentional memory가 도움을 준다는 것이다. 우리는 fine-tuning없이 문제를 해결하는 근원적인 생성 모델을 사용하여 경험적인 해결을 디자인하였다. Fig2의 오른쪽의 generative pre-training 과정에 걸쳐 경험적인 솔루션의 효율성을 제공한다.
We observe the performance of these heuristics is stable and steadily increases over training suggesting that generative pretraining supports the learning of a wide variety of task relevant functionality. We also observe the LSTM exhibits higher variance in its zero-shot performance suggesting that the inductive bias of the Transformer architecture assists in transfer.
우리는 훈련 과정을 거칠수록 성능이 안정적이고 꾸준히 증가함을 관찰한다. 이는 생성 사전 훈련이 다양한 task 관련 기능의 학습을 지지함을 암시한다. 또한 우리는 LSTM이 zero-shot 성능에서 높은 분산을 보여주는데 이는 트랜스포머의 inductive bias가 전이에 도움을 줌을 암시한다.

inductive bias : 학습 모델이 만나지 못한 상황에 대해 대처하기 위해 추가적인 가정을 도입하는 것을 의미합니다. 예를 들어 CNN의 경우 locality에 대한 가정이 추가됩니다. entities 간의 Relation이 지역성, 즉 서로 가까운(Proximity) Element 간에만 존재한다고 가정하는 것으로 볼 수 있으며, 결과적으로 어떤 특성을 가지는 Element들이 서로 뭉쳐있는지 중요한 경우에 탁월한 구조가 됩니다. 마찬가지로 RNN에서도 시계열 데이터에서 더 좋은 성능을 보여준다는 것을 가정하게 됩니다. 이 모두 inductive bias입니다.
  반면, 트랜스포머의 경우 전체 데이터를 한번에 사용하므로 추가적인 가정을 세울 수 없다, 즉 inductive bias가 부족하다는 것입니다. 그래서 Robust하게 동작할 수 있지만 많은 양의 데이터가 필요하다는 것입니다.  (https://velog.io/@euisuk-chung/Inductive-Bias란, https://robot-vision-develop-story.tistory.com/29, https://enfow.github.io/paper-review/graph-neural-network/2021/01/11/relational_inductive_biases_deep_learning_and_graph_netowrks/)


<< RNN과 CNN의 inductive bias가 transformer보다 크다는 것인데, 왜 transformer의 inductive bias가 전이에 도움이 되는지? >> 
For CoLA (linguistic acceptability), examples are scored as the average token log-probability the generative model assigns and predictions are made by thresholding. For SST-2 (sentiment analysis), we append the token very to each example and restrict the language model’s output distribution to only the words positive and negative and guess the token it assigns higher probability to as the prediction. For RACE (question answering), we pick the answer the generative model assigns the highest average token log-probability when conditioned on the document and question. For DPRD [46] (winograd schemas), we replace the definite pronoun with the two possible referrents and predict the resolution that the generative model assigns higher average token log-probability to the rest of the sequence after the substitution.
CoLA(text classification task에서 사용된 데이터셋)에서 예제는 생성 모델이 할당하고 임계값에 의해 예측되는 평균 토큰 로그 확률로 채점된다. SST-2는 토큰을 예시에 더하고, 모델의 결과 분포를 단어의 긍정과 부정으로만 제한하고, 예측으로 더 높은 확률을 할당한 토큰을 추측한다. For RACE에선 조건이 주어졌을 때 생성 모델이 가장 높은 로그 확률을 부여한 토큰으로 할당한 정답을 고른다. DPRD에선, 우리는 정관사를 두 개의 가능한 referrents(one that refers or is referred to especially)로 교체하고 생성 모델이 대체 후 시퀀스의 나머지에서 더 높은 로그 확률을 보여주는 토큰에 할당한 해상도를 예측한다.
Ablation studies
We perform three different ablation studies (Table 5). First, we examine the performance of our method without the auxiliary LM objective during fine-tuning. We observe that the auxiliary objective helps on the NLI tasks and QQP. Overall, the trend suggests that larger datasets benefit from the auxiliary objective but smaller datasets do not. Second, we analyze the effect of the Transformer by comparing it with a single layer 2048 unit LSTM using the same framework. We  observe a 5.6 average score drop when using the LSTM instead of the Transformer. The LSTM only outperforms the Transformer on one dataset – MRPC. Finally, we also compare with our transformer architecture directly trained on supervised target tasks, without pre-training. We observe that the lack of pre-training hurts performance across all the tasks, resulting in a 14.8% decrease compared to our full model.
우리는 3개의 ablation studies를 수행했다. 먼저 fine-tuning 과정에서 보조적인 언어 모델 목표 없이 우리의 과제를 수행했을 때를 조사한다. 우리는 보조 목표가 자연어 추론과 QQP(의미 유사도에서 사용된 데이터셋)에 도움이 됨을 관찰한다. 종합적으로 큰 데이터셋은 보조 목표에서 도움이 되지만 작은 데이터셋에서 도움이 되지 않음을 보여준다. 두 번째, 우리는 2048개의 유닛을 가진 싱글 레이어 LSTM을 트랜스포머와 비교하여 트랜스포머의 효과를 분석한다. 우리는 트랜스포머 대신에 LSTM을 사용했을 때 5.6점의 점수 하락을 관측했다. LSTM만 사용하면 MRPC에선 트랜스포머보다 성능이 뛰어났다. 또한 우리의 트랜스포머와 사전 훈련 없이 지도 target task로 훈련된 트랜스포머 아키텍쳐를 비교하였다. 그 결과 사전 훈련의 부족이 전체 task에서 성능을 하락시킴을  관측한다. 이 하락은 전체 우리의 모델에 비해 약 14.8% 감소하였다.

Ablation studies : machine learning system의 building blocks을 제거해서 전체 성능에 미치는 효과에 대한 통찰력을 얻기 위한 과학적 실험입니다. 예를 들어, 모델에서 n개의 레이어가 있을 때 n-1번째 레이어를 삭제하며 나타나는 변화를 관측합니다.(https://cumulu-s.tistory.com/8)


6. Conclusion
We introduced a framework for achieving strong natural language understanding with a single task-agnostic model through generative pre-training and discriminative fine-tuning. By pre-training on a diverse corpus with long stretches of contiguous text our model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering, semantic similarity assessment, entailment determination, and text classification, improving the state of the art on 9 of the 12 datasets we study. Using unsupervised (pre-)training to boost performance on discriminative tasks has long been an important goal of Machine Learning research. Our work suggests that achieving significant performance gains is indeed possible, and offers hints as to what models (Transformers) and data sets (text with long range dependencies) work best with this approach. We hope that this will help enable new research into unsupervised learning, for both natural language understanding and other domains, further improving our understanding of how and when unsupervised learning works.
우리는 생성 사전 훈련 및 차별적인 fine-tuning을 활용한 목적-불가지적인(목적과 관계없이 활용가능한) 모델을 이용하여 강력한 자연어 이해가 가능한 프레임워크를 도입했다. 긴 길이의 인접한 텍스트의 말뭉치로 사전훈련하여, 우리의 모델은 상당한 지식을 습득하였고 QA, 의미 유사성, 수반 결정, 텍스트 분류를 해결할 수 있도록 성공적으로 전이 되었다. 이를 통해 장기간 의존성을 처리할 수 있는 능력을 획득 하였고, 그 결과 12개의 데이터셋 중 9개에서 SOTA를 얻었다. 개별 목표에 대한 성능을 향상시키기 위해 비지도 사전 훈련을 진행하는 것은 머신 러닝 연구에서 주요 목표였다. 우리의 결과물은 다음을 제시한다. 상당한 성능 획득이 가능하고, 모델에 관한 힌트와 데이터셋이 우리의 접근에 도움이 된다. 우리의 연구가 자연어 이해과 다른 도메인에서 비지도 학습에 새로운 연구가 가능하도록 해줄 것이고, 넘어서 어떻게, 언제 비지도 학습이 가능한지에 대한 이해를 발전시킬 것이다.