[NLP classic paper intensive reading] Attention Is All You Need

foreword

As a model architecture with the same or even higher status as rnn and cnn, Transformer can be said to have created a new era. Why use attention, what are the shortcomings of cnn and rnn, and why attention can perform so well on language tasks. This article will give very detailed answers, I believe it will give some new ideas and inspiration to friends who check for gaps~


Paper: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Code: https://github.com/tensorflow/tensor2tensor

Abstract

This paper proposes a novel and simple network architecture - Transformer, which is completely based on the attention mechanism and does not require recursion and convolution. Experiments are conducted on machine translation tasks. The model has good performance, high parallelism, and requires a small training time.

1. Introduction

Recurrent neural networks such as LSTM and GRU are the most advanced methods in sequence models, but such models usually perform calculations along the input and output positions to align the positions. This inherent sequentiality hinders the parallelization of training.
The attention mechanism has become an important component of many sequence models and transcription models to solve the long-range dependency problem (how the knowledge of the encoder is effectively transmitted to the decoder), but except for a few cases, the attention mechanism is still used in combination with the cyclic neural network.
This paper proposes the Transformer architecture, which completely relies on the attention mechanism and significantly improves the parallelism.

2. Background

In order to reduce sequence calculations, some methods use convolutional neural network pairs as basic building blocks to calculate hidden vector representations of input and output, but the time required to associate two arbitrary input or output tokens increases with distance. increases, which makes it difficult to learn long-range positional relationships. Transformer fixes the number of operations through the attention mechanism, and uses a multi-head attention mechanism to offset the problem of effective resolution.

In convolution, two tokens that are far away need to be convolved with many layers to be connected together, but the self-attention mechanism can be directly connected together. The effective resolution here can be understood as a pattern. The concept of multi-channel convolution can learn multiple patterns of pictures. The multi-head attention mechanism has the same effect.

The self-attention mechanism connects different positions of a single sequence to calculate the representation of the sequence, and has been successfully applied to tasks such as reading comprehension and summarization.
At that time there was also an end-to-end memory network based on a recurrent attention mechanism that performed well on simple language question answering and language modeling tasks.
The Transformer architecture is the first known transcription model that relies entirely on the self-attention mechanism.

3. Model Architecture

Most of the current competitive models are encoder-decoder architectures, and the encoder inputs a sequence of length n ( x 1 , . . . , xn ) (x_1,...,x_n)(x1,...,xn) , output a representation of length nz = ( z 1 , . . . , zn ) \mathbf{z}=(z_1,...,z_n)z=(z1,...,zn) , among whichzi z_iziIs a hidden vector with a fixed length, and then input the output of the encoder to the decoder to generate a sequence of final length m ( y 1 , . . . , ym ) (y_1,...,y_m)(y1,...,ym) , the model in the decoder is autoregressive, that is, the previous output is used as the input at the next moment. Transformer's architecture is shown in the following figure:
image.png

3.1 Encoder and Decoder Stacks

Encoder

The encoder is composed of 6 identical blocks stacked, each block has two sublayers, the first is a multi-head self-attention mechanism, the second is an MLP, the outputs of both sublayers are connected to the residual, and the layer is normalized Unification, that is, Layer Norm ( x + Sublayer ( x ) ) \mathrm{LayerNorm}(x+\mathrm{Sublayer}(x))LayerNorms ( x+Sublayer ( x )) , the vector dimensions are all set to 512.

Decoder

The decoder layer is also stacked from 6 identical blocks, and in addition to the two sublayers in the encoder, the decoder inserts a third sublayer that performs a multi-head attention mechanism on the output of the encoder. Like the encoder, the output of each sublayer is concatenated with residuals and layer normalized. In addition, the author also modified the bottom self-attention layer of the decoder to prevent the subsequent output information from being leaked.

It is necessary to explain why LayerNorm is used instead of BatchNorm. In timing samples, the length of each sample is different, so a maximum sample length is set, and the insufficient part is filled with 0. The main difference between the two lies in the mean and variance. If the sample length changes greatly, the mean and variance jitter of each BatchNorm will be very severe. It is not applicable to the new arrival of very long or short sequences, but LayerNorm is each sample itself. The mean and variance of are more stable. See the picture below for details.

76EDC981CE1B448B94FF9C96EBEAA6A1.png

3.2 Attention

An attention function is a function that maps a query and some key-value pairs into an output. The output can be regarded as the weighted sum of values, and the weight can be understood as calculated by the similarity between query and value.

3.2 Scaled Dot-Product Attention

image.png
The attention in this article is called Scaled Dot-Product Attention (Scaled Dot-Product Attention) , the input includes queries, keys and values, and the dimensions are dk d_kdk d k d_k dkand dv d_vdv. After calculating the dot product of queries and keys, divide by dk \sqrt{d_k}dk , and then input to the softmax function to obtain the weight, the calculation formula is as follows:
A attention ( Q , K , V ) = softmax ( QKT dk ) V \mathrm{Attention}(Q,K,V)=\mathrm{softmax}(\ frac{QK^T}{\sqrt{d_k}})VAttention(Q,K,V)=softmax(dk QKT) The two common attention mechanisms of V
are additive attention and dot product attention. The attention in this paper is based on dot product attention plus scaling. Dot-product attention is actually faster and more space-efficient due to the use of highly optimized matrix multiplication code. But additive attention at largerdk d_kdkThe lower performance is better than dot product attention, so this article uses scaling to prevent the polarization of softmax results from being severely polarized, resulting in slower gradient descent. The attention calculation process is shown in the following figure:
image.png

3.2.2 Multi-Head Attention

image.png
The author found that the separate dmodel d_{\mathrm{model}}dmodelIt is more effective to perform h projections on the keys, values ​​and queries of the dimension, and execute the attention function on each value of the projection in parallel to generate dv d_vdvDimensional output, concatenating these h outputs together and then projecting can get dmodel d_{\mathrm{model}}dmodelDimension output, as shown above. The advantage of the multi-head attention mechanism is that it allows the model to simultaneously pay attention to subspace information with different representations at different locations. The process is as follows:
MultiHead ⁡ ( Q , K , V ) = Concat ⁡ ( head ⁡ 1 , … , head h ) WO where head = Attention ⁡ ( QW i Q , KW i K , VW i V ) \begin{aligned } \operatorname{MultiHead}(Q, K, V) & =\operatorname{Concat}\left(\operatorname{head}_{1}, \ldots, \text { head }_{\mathrm{h}}\ right) W^{O} \\ \text { where head } & =\operatorname{Attention}\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i} ^{V}\right) \end{aligned}MultiHead(Q,K,V) where head =Concat(head1,, head h)WO=Attention(QWiQ,KWiK,VWiV)
Where the projection matrix W i Q ∈ R dmodel × dk W^Q_i \in \mathbb{R}^{d_{\mathrm{model}} \times d_k}WiQRdmodel×dk W i K ∈ R d m o d e l × d k W^K_i \in \mathbb{R}^{d_{\mathrm{model}} \times d_k} WiKRdmodel×dk W i V ∈ R d m o d e l × d v W^V_i \in \mathbb{R}^{d_{\mathrm{model}} \times d_v} WiVRdmodel×dv, WO ∈ R hdv × dmodel W^O \in \mathbb{R}^{hd_v \times d_{\mathrm{model}}}WORhdv×dmodel. In this paper, the author sets h to 8, so dk = dv = dmodel / h = 64 d_k=d_v=d_{\mathrm{model}}/h=64dk=dv=dmodel/h=64

3.2.3 Applications of Attention in our Model

This paper applies the multi-head attention mechanism in three places:

  • The attention is deployed in the connection between the encoder and the decoder, the queries come from the previous decoder layer, and the keys and values ​​come from the output of the encoder. This allows every position in the decoder to have information about the input sequence.
  • The self-attention layer is retained in the encoder, self-attention, that is, all keys, values ​​and queries come from the same input, which is the output from the previous encoder layer.
  • Coincidentally, the self-attention mechanism in the decoder allows each position in the decoder to pay attention to all positions in the decoder, but requires a mask setting to prevent the left position from seeing the information of the right position to satisfy the autoregressive property.

3.3 Position-wise Feed-Forward Networks

Each layer of the encoder and decoder contains a fully connected layer, which works independently on each token. It includes two linear transformations and a relu, as follows: FFN ( x ) = max ( 0 , x W 1 +
b 1 ) W 2 + b 2 \mathrm{FFN}(x)=\mathrm{max}(0,xW_1+b_1)W_2+b2FFN(x)=max(0,xW1+b1)W2+b 2
can also be understood as a convolutional layer with a kernel size of 1.
1A2B7589D93EAFA61B372380DA46F1B1.png

The understanding of MLP can be shown in the upper part of the above figure, that is, the output of each token is mapped to high-dimensional and then mapped back. The following part is a comparison between attention and RNN as a whole. Both of them actually process sequence information first and then process In the next step of mapping, Transformer can calculate the information of the entire sequence in parallel, while RNN can only calculate the previous sequence information serially, which greatly reduces the efficiency and effect.

3.4 Embeddings and Softmax

The token in the input sequence is converted into a dimension dmodel d_{\mathrm{model}} through learnable embeddingdmodelIn addition, the output of the decoder is converted into the probability of predicting the next token through a linear transformation (mapped to a vocabulary) and a softmax function. It should be noted that the embedding layer needs to be multiplied by dmodel \sqrt{d_{\mathrm{model}}}dmodel , so as to ensure that the subsequent position information will not cause too much fluctuation.

3.5 Positional Encoding

Attention only calculates weighted values ​​by similarity, which does not contain timing information, that is, the order of sequence tokens is disrupted, and the calculation results remain unchanged, so position information needs to be injected. The authors take sine and cosine functions at different frequencies:
PE ( pos , 2 i ) = sin ⁡ ( pos / 1000 0 2 i / d model ) PE ( pos , 2 i + 1 ) = cos ⁡ ( pos / 1000 0 2 i / d model ) \begin{aligned} P E_{(pos, 2 i)} & =\sin \left(pos / 10000^{2 i / d_{\text {model }}}\right) \\ P E_ {(pos, 2 i+1)} & =\cos \left(pos / 10000^{2 i / d_{\text {model }}}\right) \end{aligned}PE( p os , 2 i )PE( p os , 2 i + 1 )=sin(pos/100002 i / dmodel )=cos(pos/100002 i / dmodel )
pos is the position and i is the dimension. Position encoding and embedding have the same dimension, and the two can be added. The sine function was chosen because it allows the model to infer longer sequence lengths than those encountered during training.

4. Why Self-Attention

Traditional loop layers and convolutional layers usually map a variable sequence to a sequence of equal length, while using self-attention has three advantages:

  1. Computational complexity;
  2. The amount of parallel computing;
  3. The path length of long-range dependencies in the network (shorter is easier to learn).

image.png
As shown in the table above, self-attention layers can perform sequential operations concurrently, while recurrent layers require O ( n ) O(n)O ( n ) sequential operation. From the perspective of computational complexity, when n is smaller than the dimension d, the self-attention layer is faster than the recurrent layer. In order to improve the computational performance of long sequence tasks, the limited self-attention allows the input to only consider the neighborhood with a center size of r , increasing the maximum path length toO ( n / r ) O(n/r)O ( n / r ) .
For convolutional layers, if kernel widthk < n k<nk<n , a simple layer cannot connect all pairs of input and output positions, requiringO ( n / k ) O(n/k)O ( n / k ) convolutional layers to increase the length of the longest path between any two locations in the network. In terms of computational complexity, the convolutional layer iskkK times, even with separable convolution, the complexity is at least equal to the combination of attention+MLP layers.
Finally, the attention mechanism comes with the added benefit of producing more interpretable models and can exhibit behaviors related to the syntactic and semantic structure of sentences.

In practice, the sizes of n and d are actually similar, so the computational complexity of the three algorithms can be considered to be the same or in the same order of magnitude, but attention has obvious advantages in computational parallelism and long-distance dependence.

5. Train

5.1 Training Data and Batching

The authors conduct experiments on the English-German dataset of 4.5 million data pairs and the English-French dataset of 36 million data pairs.

5.2 Hardware and Schedule

The experimental part is carried out on 8-card P100.

5.3 Optimizer

The optimizer part uses Adam, and the learning rate uses a dynamic learning rate.

5.4 Regularization

The authors employ three regularization methods.

  • Residual Dropout: Applied to the output of each sublayer and the embedding part of the input, before inputting to the add&Norm module.
  • Label smoothing: Reducing the perplexity of the model and increasing the uncertainty of the model can improve accuracy and BLEU values.

6. Results

6.1 Machine Translation

The experiment performed well on both the English-German translation task and the English-French translation task, and the training cost was greatly reduced. The table below summarizes the experimental results.
image.png

6.2 Model Variations

In order to evaluate the importance of different components in Transformer, the author tested the changes in model performance by modifying hyperparameters, and the results are shown in the table below.
image.png
According to B in the above table, it can be observed that the dimension dk d_k of the reduced attention keydkIt will damage the quality of the model. Further observations in C and D show that the larger the model, the better, and dropout is of great help to avoid overfitting. Replacing the sinusoidal position encoding with the learned position embedding in E yields almost the same results.

7. Conclusion

This paper proposes the Transformer model, which is a transcription model that only uses the attention mechanism, and replaces the recurrent layer in the previous encoder-decoder architecture with a multi-head attention mechanism. On machine translation tasks, Transformer training speed is fast and the effect is good. Therefore, the author hopes to apply Transformer to other tasks, and effectively process input and output tasks such as images, audio, and video by studying local and limited self-attention mechanisms.

read summary

Looking back at the classics in the NLP field, the cornerstone of LLM, and the creator of the new era - Transformer, I have gained a lot. This article is limited by the length of the article, so many details are not explained, and even the inspiration and ideas of model design cannot be presented in the form of a complete story, but this does not prevent it from being useful for NLP and even the entire AI field. outstanding contribution. This reading is combined with Mushen's explanation video , which gave me a deeper understanding of the Transformer architecture. Mushen gave a clear explanation of many unexplained or unclear points in the article, including why to use LayerNorm, the usage and meaning of Q, K, V, and the role of the attention layer. These contents are all in my mind In the past year, I have been trying to master the building block of Transformer, while ignoring the mystery of this building block itself. I believe that through this intensive reading, I can provide more information for my subsequent scientific research. Brand new understanding and thinking.

Guess you like

Origin blog.csdn.net/HERODING23/article/details/131838939