Intensive reading of Transformer papers - Attention Is All You Need

https://www.bilibili.com/video/BV1pu411o7BE

Summary

Sequence transcription models generate one sequence from another. Transformer only utilizes attention mechanism (attention), and has achieved good success in the field of machine translation.

in conclusion

One of Transformer's important contributions is proposed: multi-head self-attention.

Ⅰ、Introduction

The method of RNN to calculate a sequence: take this sequence step by step from left to right. If the sequence is a sentence, the RNN will look backward word by word; for the tth word, it will calculate an output ht h_tht(hidden state). ht h_thtFrom the hidden state of the previous word ht − 1 h_{t-1}ht1Determined with the current t-th word itself. By doing this, ht h_thtThe historical information learned earlier can be passed through ht − 1 h_{t-1}ht1Put it in the moment, calculate with the current word, and output. This is also a key to RNN's ability to effectively process timing information-put all the previous information in the hidden state, and then proceed one by one. (second paragraph)

The problem with this is: ①. RNN is a step-by-step calculation process, which is difficult to parallelize . For example: count ht h_tht, must wait for ht − 1 h_{t-1}ht1The calculation is complete. Assuming that the sentence has 100 words, 100 steps are calculated in time series. ②. The historical information of the sequence is transmitted step by step. If the timing is very long, early timing information may be lost later. If you don't want to throw it away, ht h_thtWill be relatively large, resulting in a large hidden state of subsequent nodes, resulting in a large memory overhead . (second paragraph)

Ⅱ, Background (related work)

Use CNN to replace RNN to reduce timing calculations. The problem is that CNNs have difficulty modeling long sequences. Because when convolution is doing calculations, each time you look at (focus on) a small window, for example, look at a 3 ∗ 3 3*333 pixel blocks. If the two pixels are far apart, a lot of convolutions are required, layer by layer, and finally the two pixels that are far apart can be merged. But if you use Transformer's attention mechanism, you can see all the pixels every time, and one layer can see the entire sequence. One advantage of convolution is that it can do multiple output channels (one output channel can be considered as identifying different modes). Transformer also wants to achieve this effect, so Multi-Head Attention is proposed. (first paragraph)

Use the achievements of predecessors - self-attention. (second paragraph)

Transformer is the first model that only relies on self-attention as an encoder-to-decoder architecture. (fourth paragraph – last paragraph)

Ⅲ, Model Architecture (model)

A better architecture in the sequence model is the encoder-decoder structure. encoder takes an input sequence ( x 1 , . . . , xn ) (x_1,\ ...,\ x_n)(x1, ..., xn) into an outputz = ( z 1 , . . . , zn ) \pmb{z}=(z_1,\ ...,\ z_n)z=(z1, ..., zn) , wherezt z_tztCorresponding to xt x_txtA vector representation of ( t ∈ { 1 , . . . , n } ) (t\in\{1,...,n\})(t{ 1,...,n }) . The encoder turns the raw input into a series of vectors that machine learning can understand. 《|》 For the decoder, get the output of the encoder and generate a sequence of length m( y 1 , . . . , ym ) (y_1,\ ...,\ y_m)(y1, ..., ym) . Note:n ≠ m or n = mn \neq m\ or \ n = \ mn=m or n  = m . The difference between decoder and encoder is thatyt y_tytare generated one by one, and the encoder may read the entire sequence at once. This form of mode is called auto − regressive auto-regressiveautoReg ress i v e ——The output of the model can be the input of the model: given z \pmb{z}z , generating the first outputy 1 y_1y1; get y 1 y_1y1After that, generate y 2 y_2y2; if you want to generate yt y_tyt, the preceding y 1 to yt − 1 y_1\ to \ y_{t-1}y1 to y t1To get them all. "|" More specifically, the Transformer stacks self-atttion and point-wise fully connected layers together one by one. (first paragraph)

The encoder input is the sequence ( x 1 , . . . , xn ) (x_1,\ ...,\ x_n)(x1, ..., xn) , export isz = ( z 1 , . . . , zn ) \pmb{z}=(z_1,\ ...,\ z_n)z=(z1, ..., zn)

decoder input is z = ( z 1 , . . . , zn ) \pmb{z}=(z_1,\ ...,\ z_n)z=(z1, ..., zn) , the output is( y 1 , . . . , ym ) (y_1,\ ...,\ y_m)(y1, ..., ym)

Please add a picture description

The left half is the encoder, and the right half is the decoder.

3.1 Encoder and Decoder Stacks (encoder and decoder architecture)

Encoder: It consists of 6 identical layers. Each layer has two sub-layers: ①. multi-head self-attention mechanism, ②. position-wise fully connected feed-forward network (actually an MLP). And use residual connection for each sub-layer, and do layer normalization processing. That is, the output of each sub-layer is Layer Norm ( x + Sublayer ( x ) ) LayerNorm(x+Sublayer(x))L a yer N or m ( x+S u b l a yer ( x )) . The output of all sub-layers and embedding layers of Transformer is 512-dimensional.



Extension: Batch Norm and Layer Norm

Usually in variable length applications, Batch Norm is not used.

In the two-dimensional case, as shown in the figure, the row is the sample, and the column is the same feature of all samples in the batch. Batch Norm is to normalize a column (subtract the mean value and divide by the std variance). Layer Norm is to normalize a row. In particular, Layer Norm can be seen as transposing first, then doing Batch Norm, and finally transposing.

Please add a picture description

In three dimensions, as shown in Fig. In variable length sequences, Layer Norm is used more. One reason is that the length of each sample in these sequences may change (shown in the figure, and the processing of Batch Norm and Layer Norm is also shown in the figure). In Batch Norm, if the sample length changes greatly, the jitter of the mean and std variance calculated in each small batch is relatively large. Moreover, when making predictions, the global mean and std should be recorded. If the global mean and std encounter a new forecast sample (the length of the forecast sample is very long), it is likely to be not so easy to use. When Layer Norm is used, there is no such restriction, and there is no need to save a global mean and std. Because mean and std are calculated based on each sample.

Please add a picture description



Decoder: There are 6 identical layers. Each layer has two sub-layers that are the same as the encoder, and the third sub-layer implements multi-head attention. Decoder also uses residual links and Layer Norm. 《|》 What the decoder does is autoregressive—the input set of the current output is the output of some previous moments. This also means that when making predictions, the output at those later moments cannot be seen. However, in the attention mechanism, the entire complete input can be seen at a time; therefore, the aforementioned problems should be avoided. It also means that when Decoder is training, when predicting the output at the tth moment, it should not see those inputs after the tth moment. Transformer's approach is to use a masked attention mechanism to ensure that when the input comes in, those inputs after time t will not be seen at time t, thereby ensuring the consistency of training and prediction.

3.2 Attention

attention func is a function that maps query and key-value pairs into an output (output) (query, keys, values, and output are all vectors). The output is the weighted sum of values; and the weight of each value is calculated by the similarity between the key of the value and the query (the calculation of the similarity depends on the compatibility func).



Example:

Please add a picture description

The figure above shows 3 pairs of kv. The yellow query is close to value① and value②, and the similarity calculated with key① and key② is high, and the output of the yellow query is the sum of the three similarities; similarly, the green query is close to ②③, and the similarity is high (similarity is indicated by the thickness of the line).



3.2.1 Scaled Dot-Product Attention

(Different attention funcs lead to different attention versions. In this section, the author explains Transformer's attention mechanism.)

The length of query and key is equal, recorded as: dk d_kdk, the length of value is recorded as: dv d_vdv(The length of output is also dv d_vdv). The specific calculation is to do the inner product operation between each query and all keys, and then divide by dk \sqrt{d_k}dk , then do softmax func, and finally get the weight of values. (first paragraph)

In the second paragraph of this section, the author gives formula 1 (below) to describe the algorithm in the first paragraph from the perspective of matrix operations.
A attention ( Q , K , V ) = softmax ( QKT dk ) V Attention(\pmb{Q},\pmb{K},\pmb{V})=softmax(\frac{\pmb{Q}\pmb{ K}^T}{\sqrt{d_k}})\pmb{V}Attention(Q,K,V)=softmax(dk QKT)V

Two common attention funcs: additive attention (easy to handle the case where query and key are not equal in length), dot-product (multi-plicative) attention. dot-product attention is consistent with this article, except that this article removes dk \sqrt{d_k}dk . The above two attention funcs are similar, but this article chooses dot-product. Because the implementation of dot-product attention is simple and the operation is efficient. (third paragraph)

when dk d_kdkWhen small, both dot-product attention perform similarly. but dk d_kdkWhen it is large (query and key are very long), the result of the dot product operation may be very large, and the maximum value of the result obtained by softmax will be closer to 1, and the remaining values ​​are close to 0 (the result is closer to both ends). At this time, the calculated gradient is relatively small. 《|》Because in theory, the final result of softmax should be as close to 1 as possible in the confidence place, and close to 0 in the unbelief place; this means that the convergence is almost the same, and the gradient is small, so it can't run anymore. Furthermore, the dk d_k used by TransformerdkRelatively large ( dk = 512 d_k=512dk=512 ). So, divide bydk \sqrt{d_k}dk is a good choice. (fourth paragraph)

Please add a picture description

3.2.2 Multi-Head Attention

Instead of making a single attention func, it is better to project queries, keys, and values ​​to low-dimensional, and project h times; then do h times of attention func; finally, combine the output of each function, and then project to get the final output (first paragraph). The figure below is the introduction to Multi-Head Attention in the paper

Please add a picture description

Why do Multi-Head Attention? A closer look at Scaled Dot-Porduct Attention shows that there are no parameters to learn. But in order to recognize different patterns, it is desirable to have different ways of counting pixels. The operation at this time is to project to low dimension, the projected W \pmb{W}W can be learned. Projecting h times, I hope to learn different projection methods, so that the projected metric space can match the similar functions required by different modes. Finally, stitch them together and do another projection to get the final result.

M u l t i H e a d ( Q , K , V ) = C o n c a t ( h e a d 1 , . . . , h e a d h ) W O MultiHead(\pmb{Q}, \pmb{K}, \pmb{V}) = Concat(head_1,...,head_h)\pmb{W}^O MultiHead(Q,K,V)=Concat(head1,...,headh)WO
w h e r e   h e a d i = A t t e n t i o n ( Q W i Q , K W i K , V W i V ) where\ head_i = Attention(\pmb{QW}_i^Q,\pmb{KW}_i^K,\pmb{VW}_i^V) where headi=Attention(QWiQ,KWiK,VWiV)

3.2.3 Applications of Attention in our Model

(In this section, the paper explains how to use attention in Transformer)
Explain the three attentions in the first picture.

3.3 Position-wise Feed-Forward Networks

A feedforward network is essentially an MLP. A "position" in Position-wise Feed-Forward Networks is a word of the input sequence, and a word is a position. The sub-layer uses the same MLP to act on each word once. (first paragraph)

F F N ( x ) = m a x ( 0 , x W 1 + b 1 ) W 2 + b 2 FFN(x) = max(0,x\pmb{W}_1 + b_1)\pmb{W}_2 + b_2 FFN(x)=max(0,xW1+b1)W2+b2

The above formula shows that the network consists of two linear layers. xxx is the output corresponding to the query, iexxThe dimension of x is 512. After the first layer is enlarged by four times, the last layer is reduced to 512. (second paragraph)



Transformer (simplified version) compared with RNN:

Please add a picture description

As can be seen from the above figure, RNN is similar to Transformer, and both use a linear layer or MLP for semantic space conversion; the difference is the way of transmitting sequence information: RNN takes the information of the previous moment as input, and the output is passed to The next moment; Transformer pulls the information in the entire sequence globally through an attention layer, and then uses MLP for semantic conversion. Similarly, they all focus on how to use sequence information effectively.



3.4 Embeddings and Softmax

The input is a word ( token tokent o k e n ), they need to be mapped to corresponding vectors when doing calculations. The role of embedding is to learn a length of ddgiven any wordThe vector of d represents it (hered = 512 d=512d=512 ). There needs to be an embedding in front of the input of the encoder and decoder; the linear layer in front of the softmax also needs to have an embedding. And the weights of these three embeddings are the same. These weights are also multiplied byd \sqrt{d}d (Because when learning embedding, you may put the L 2 N orm (Euclidean distance) of the vector L2\ Norm (Euclidean distance)L 2 N or m (Euclidean distance)  is relatively small. For example, if you learn 1, no matter how large the dimension is, the final value will be equal to 1. If the dimension becomes larger at this time, the learned weight value will become smaller. But then we need to add Positional Encoding, it will not be fixed as the dimension becomes largerL 2 N orm L2\ NormL 2 N or m  . The weight is multiplied byd \sqrt{d}d After that, it is guaranteed that the two added are about the same in scale).

3.5 Position Encoding

attention has no timing information - output is a weighted sum of values. The weight is the distance between query and key, which has nothing to do with timing information. That is: it does not care where the kv pair is located in the sequence. It means that if the order of a sentence is disrupted, the result of attention will be the same. It is wrong to process time-series data, where the order changes but the result does not change. Therefore, timing information needs to be added to attention. The addition of RNN timing information is to transfer historical information by using the output of the previous moment as the input of the next moment. RNN itself is a timing thing, but attention is not! The approach of attention is to add timing information to the input - use numbers to indicate the position of the word and add it to the input. This method is called Position Encoding in the paper .

(There are specific algorithm content about position encoding in the paper)

IV、Why Self-Attention

This chapter explains why self-attention is used. The content of the paper is mainly to say that self-attention works better than when using a loop layer or a convolution layer.

Please add a picture description

(Time period for Table 1 explanation: 1:06:21~~1:12:50)

Using attention focuses on particularly long sequences, which can knead the entire information better.

Guess you like

Origin blog.csdn.net/Snakehj/article/details/130568614