[NLP Paper] Attention Is All You Need

NLP/Papers

[NLP Paper] Attention Is All You Need

iamzieun 2023. 4. 8. 12:44

본 포스팅은 논문 Attention Is All You Need를 읽고 요약한 글입니다.

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new

arxiv.org

Abstract

기존의 지배적인 sequence transduction 모델: RNN 기반의 encoder-decoder 구조, attention 기반의 encoder-decoder 구조
Transformer: attention만을 기반으로 함

1 Introduction

RNN은 재귀적인 연산으로 인해 parallelization(병렬화)가 불가능하여, 긴 sequence를 나눠서 처리할 수 없음
- 이를 해결하기 위해 factoriation tricks, conditional computation 등이 도입됐지만, 근본적인 문제점은 해결되지 않음
Attention은 input 및 output sequence의 길이에 대한 모델의 dependency를 근본적으로 해결
Transformer는 recurrence 없이 Attention만을 기반으로 하여 input과 output간의 dependency를 해결

2 Background

Self-Attention: sequence 내 각 position 사이의 관계를 파악하여 해당 sequence의 representation을 계산하는 attention mechanism
End-to-End memory network: recurrent attention mechanism을 기반으로 함
Transformer: self-attention만을 이용해 input과 output의 representation을 계산

3 Model Architecture

Encoder-Decoder 구조
- encoder: input sequence → sequence of continuous representation
- decoder: sequence of continuous representation → output sequence

3.1 Encoder and Decoder Stacks

Encoder
- multi-head self-attention mechanism
- position-wise fully connected feed-forward network
- residual connection
- layer normalization
Decoder
- masked multi-head self-attention mechanism: 이전 position까지의 sequence만을 바탕으로 현재 position을 predict
- multi-head self-attention mechanism: encoder의 output을 input으로 활용
- position-wise fully connected feed-forward network
- residual connection
- layer normalization

3.2 Attention

Attention: query, key-value 쌍과 output의 mapping
- output = value의 가중합 (가중치는 query와 key로부터 계산됨)

3.2.1 Scaled Dot-Product Attention

\(\sqrt{d_v}\)로 나누는 이유
- \(d_k\)가 커지면 \(Q \cdot K^T\) matrix의 분산이 커짐
- \(Q \cdot K^T\) matrix의 분산이 커지면 값이 sparse해져 softmax 연산에서 극단의 값을 가지게 되는 원소들이 생김 → gradient vanishing 문제 발생
- 이를 예방하기 위해 로 나누어주는 과정을 거침

3.2.2 Multi-Head Attention

\(\begin{aligned}\operatorname{MultiHead}(Q, K, V) & =\operatorname{Concat}\left(\operatorname{head}1, \ldots, \text { head }{\mathrm{h}}\right) W^O \\\text { where } \operatorname{head}_{\mathrm{i}} & =\operatorname{Attention}\left(Q W_i^Q, K W_i^K, V W_i^V\right)\end{aligned}\)
- query, key, value vector를 \(h\)개의 각각 다른 linear projection을 거쳐 새로운 query, key, value vector를 만듦 (\(d_{model}\) → \(d_k, d_v\))
- h개의 query, key, value vector로 attention을 실행한 결과를 concatenate한 후 linear projection
→ jointly attend to information from different representation subspaces at different positions

3.2.3 Applications of Attention in our Model

encoder-decoder attention (decoder의 두 번째 attention layer)
- query는 이전 decoder layer로부터, key / value는 encoder output으로부터 생성됨
- → decoder가 input sequence의 모든 부분에 대한 정보에 접근할 수 있음
encoder의 self-attention layer
- key / value / query 모두 이전 encoder output으로부터 생성됨
- → 이전 encoder의 모든 부분에 대한 정보에 접근할 수 있음
decoder의 self-attention layer (decoder의 첫 번째 attention layer)
- softmax의 input 중 미래 단어와의 유사도를 계산한 부분은 음의 무한대로 masking 처리
- → 오른쪽에서 왼쪽으로의 정보의 흐름을 차단함으로써 auto-regressive property를 보존

3.3 Position-wise Feed-Forward Networks

\(F F N(x)=M A X\left(0, x W_1+b_1\right) W_2+b_2\)
- 한 layer 내부의 position끼리는 같은 parameter를 적용, layer간에는 다른 parameter를 적용

3.4 Embeddings and Softmax

학습된 embedding: input tokens / output tokens → \(d_{model}\)차원의 vector
학습된 linear transformation / softmax function: decoder output → 다음 token이 될 확률

3.5 Positional Encoding

input embedding에 positional encoding을 더함 → 모델이 sequence의 순서 정보를 사용할 수 있게 함
\(\begin{gathered}P E_{(p o s, 2 i)}=\sin \left(p o s / 10000^{2 i / d_{\text {model }}}\right) \\P E_{(p o s, 2 i+1)}=\cos \left(p o s / 10000^{2 i / d_{\text {model }}}\right)\end{gathered}\)
- \(pos\): embedding vector에서 단어의 위치
- \(i\): embedding vector 내 차원의 index
- \(d_{model}\): Transformer의 모든 층의 출력 차원 (Transformer의 hyperparameter)

4 Why Self-Attention

total computational complexity per layer
- sequence length n < representation dimensionality d 인 경우 self-attention이 recurrent보다 빠름
- n > d인 경우에는, 주변 r개의 단어만을 고려하여 self-attention을 실행하게 함으로써 complexity를 줄일 수 있음
amount of computation that can be parallelized ← minimum number of sequential operations required
path length between long-range dependencies in the network
- 네트워크에서 멀리 떨어져 있는 정보들 사이를 연결하는 path의 길이가 짧을수록 해당 정보들 사이의 관계를 학습하기 쉬움
interpretable
- 각각의 attention head는 각각 다른 syntactic / semenatic task를 수행하도록 학습됨

5 Training

5.1 Training Data and Batching

Data
- WMT 2014 English-German dataset (4.5M sentence pairs)
- WMT 2014 English-French dataset (36M sentence pairs)
Batch
- 비슷한 길이를 가지는 문장 pair끼리 묶어서 batch 형성 (한 batch당 약 25000개의 source tokens / target tokens)

5.2 Hardward and Schedule

base model: NVIDIA P100 GPU * 8개로 12시간 학습
big model: NVIDIA P100 GPU * 8개로 3.5일 학습

5.3 Optimizer

optimizer: Adam
learning rate:

5.4 Regularization

Residual Dropout
Label Smoothing

저작자표시