[Natural Language Processing | Transformer] Transformer: Attention is All You Need paper explanation

Transformer is proposed by the paper "Attention is All You Need":

insert image description here

The paper address is:

https://arxiv.org/pdf/1706.03762.pdf

1. The overall structure of Transformer

First introduce the overall structure of Transformer . The following is the overall structure of Transformer for Chinese-English translation:

insert image description here

The picture above is the overall structure of the Transformer , the Encoder on the left and the Decoder on the right .

It can be seen that the Transformer consists of two parts, the Encoder and the Decoder . Both the Encoder and the Decoder contain 6 blocks . The workflow of Transformer is roughly as follows:

Step 1 : Obtain the representation vector X of each word in the input sentence. X is obtained by adding the Embedding of the word ( Embedding is the Feature extracted from the original data ) and the Embedding of the word position .

insert image description here

The second step : the obtained word representation vector matrix (as shown in the figure above, each row is a representation of a word xxx ) is passed intothe Encoder, and after 6Encoder blocksCCof all words in the sentence can be obtainedC , as shown in the figure below. The word vector matrix usesX n × d X_{n \times d}Xn×dsaid, nnn is the number of words in the sentence,ddd is the dimension representing the vector (d = 512 d = 512d=512 ). The matrix dimension output byeachEncoder block

insert image description here
The figure above shows the Transformer Encoder encoding sentence information.

The third step : the encoding information matrix CC output by the EncoderC is passed tothe Decoder,and the Decoderi + 1 i + 1according to the currently translated word 1~ ii+1 , as shown in the figure below. In use, translate to wordi + 1 i + 1i+When it is 1 , it is necessary to coveri + 1 i + 1throughthe Maski+Words after 1 .

insert image description here
Transofrmer Decoder predictions.

The Decoder in the above picture received the encoding matrix CC of the EncoderC , then first input a translation start character "", predict the first word "I"; then input the translation start character "" and the word "I", predict the word "have", and so on. This isTransformer, followed by the details of each part.

2. Input of Transformer

The input of word in Transformer means xxx is obtainedby adding the wordEmbeddingand the positionEmbedding(Positional Encoding

insert image description here

2.1 Word Embedding

There are many ways to obtain the Embedding of a word . For example, it can be obtained by pre-training algorithms such as Word2Vec and Glove , or it can be obtained by training in Transformer .

2.2 Position Embedding

In Transformer , in addition to word Embedding , positional Embedding also needs to be used to indicate the position where the word appears in the sentence. Because Transformer does not use the structure of RNN , but uses global information, it cannot use the order information of words, and this part of information is very important for NLP .

So Transformer uses position Embedding to save the relative or absolute position of words in the sequence.

The position Embedding is represented by PE, and the dimension of PE is the same as the word Embedding . PE can be obtained through training, or it can be calculated using some formula. The latter is adopted in Transformer, and the calculation formula is as follows: PE (
pos , 2 i ) = sin ⁡ ( pos / 1000 0 2 i / d ) PE ( pos , 2 i + 1 ) = cos ⁡ ( pos / 1000 0 2 i / d ) PE_{(pos,2i)}=\sin(pos / 10000^{2i / d})\\ PE_{(pos,2i+1)}=\cos(pos / 10000^{2i / d })PE( p os , 2 i )=sin ( pos / 1000 02 i / d )PE( p os , 2 i + 1 )=cos(pos/100002 i / d )
Among them, pos represents the position of the word in the sentence,ddd represents the dimension of PE (same as the word Embedding),2 i 2i2 i means even dimension,2 i + 1 2i+12i _+1 means odd dimension (ie2 i ≤ d 2i \leq d2i _d 2 i + 1 ≤ d 2i+1 \leq d 2i _+1d ). Using this formula to calculate PE has the following benefits:

  • Make PE able to adapt to sentences longer than all the sentences in the training set. Suppose the longest sentence in the training set has 20 words, and suddenly a sentence with a length of 21 comes, then the 21st sentence can be calculated by using the formula calculation method bit Embedding .
  • allows the model to easily calculate the relative position, for a fixed-length spacing kkkPE ( pos + k ) PE_{(pos + k)}PE(pos+k)You can use PE ( pos ) PE_{(pos)}PE(pos)calculated.

因为 sin ⁡ ( A + B ) = sin ⁡ ( A ) cos ⁡ ( B ) + cos ⁡ ( A ) sin ⁡ ( B ) , cos ⁡ ( A + B ) = cos ⁡ ( A ) cos ⁡ ( B ) − sin ⁡ ( A ) sin ⁡ ( B ) \sin(A + B) = \sin(A) \cos(B) + \cos(A) \sin(B), \cos(A + B) = \cos(A) \cos(B) - \sin(A) \sin(B) without ( A)+B)=without ( A )cos(B)+cos(A)sin(B),cos(A+B)=cos(A)cos(B)without ( A )sin(B)

Add the word Embedding and the position Embedding of the word to get the representation vector xx of the wordx x x x isTransformer.

3. Self-Attention (self-attention mechanism)

insert image description here
The picture above is the internal structure diagram of the Transformer in the paper , with the Encoder block on the left and the Decoder block on the right .

The part in the red circle is Multi-Head Attention , which is composed of multiple Self-Attentions . You can see that the Encoder block contains a Multi-Head Attention , and the Decoder block contains two Multi-Head Attentions (one of which uses Masked ). An Add & Norm layer is also included above the Multi-Head Attention . Add means Residual Connection ( Residual Connection ) is used to prevent network degradation, and Norm means Layer Normalization , which is used to normalize the activation value of each layer.

Because Self-Attention is the focus of Transformer , we focus on Multi-Head Attention and Self-Attention . First, we will learn more about the internal logic of Self-Attention .

3.1 Self-Attention structure

insert image description here

The above picture shows the structure of Self-Attention , and the matrix QQ is needed in the calculationQ (query),KKK (key value),VVV (value).

In practice, Self-Attention receives the input (word representation vector xxThe matrix XXcomposed of xX ) or the output of the previousEncoder block. whileQQQ, K K K, V V V isobtained by linear transformation through the input ofSelf-Attention

3.2 Calculation of Q, K, V

The input of Self-Attention uses matrix XXX is represented, you can use the linear transformation matrixWQ W_QWQ, W K W_K WK, W V W_V WVCalculate QQQ, K K K, V V V. _ The calculation is shown in the figure below, pay attention toXXX, Q Q Q, K K K, V V Each row of V represents a word.

insert image description here

3.3 Output of Self-Attention

get the matrix QQQ, K K K, V V After V , the output ofSelf-Attentioncan be calculated
. The calculation formula is as follows: A attention ( Q , K , V ) = softmax ( QKT dk ) Attention(Q, K, V) = softmax(\frac{QK^T }{\sqrt{d_k}})Attention(Q,K,V)=softmax(dk QKT)
d k d_k dkis QQQ K K The number of columns of the K matrix, that is, the vector dimension.

Calculate the matrix QQ in the formulaQ andKKThe inner product of each row vector of K , in order to prevent the inner product from being too large, it is divided bydk d_kdksquare root of . QQQ timesKKAfter the transposition of K , the number of rows and columns of the obtained matrix is ​​nnn n n n is the number of words in the sentence, and this matrix can represent the attentionbetween words. The picture below isQQQ timesKTK^TKT , 1234 represents the words in the sentence.

insert image description here
Q Q Q timesKKCalculation of transpose of K ]

Get QKT QK^TQKAfter T , useSoftmaxthe attentionof each word for other wordsThe Softmaxin the formulais to performSoftmax, that is, the sum of each row becomes 1.

insert image description here
[Perform Softmax on each row of the matrix ]

After getting the Softmax matrix, it can be combined with VVV is multiplied to get the final outputZZZ

insert image description here
Self-Attention output】

The first row of the Softmax matrix in the above figure represents the attention coefficient of word 1 and all other words , and the final output of word 1 is Z 1 Z_1Z1equals all words iii 's valueV i V_iViAccording to the ratio of attention coefficients added together, as shown in the figure below:
insert image description here
[ Z i Z_iZicalculation method]

3.4 Multi-Head Attention

In the previous step, we already know how to calculate the output matrix ZZ through Self-AttentionZ , whileMulti-Head Attentionformedby a combination of multipleSelf-AttentionsMulti-Head Attention:

insert image description here

From the above figure, you can see that Multi-Head Attention contains multiple Self-Attention layers, first input XXX is passed to hhrespectivelyAmong h differentSelf-Attentionshhis calculatedh output matricesZZZ. _ The figure below ish = 8 h = 8h=In the case of 8 , 8 output matricesZZZ

insert image description here

Get 8 output matrices Z 1 Z_1Z1to Z 8 Z_8Z8After that, Multi-Head Attention stitches them together ( Concat ), and then passes it into a Linear layer to get the final output ZZ of Multi-Head AttentionZ

insert image description here
You can see the matrix ZZ output by Multi-Head AttentionZ and its input matrixXXThe dimensions of X are the same.

4. Encoder structure

insert image description here
Transformer Encoder block

The red part in the above figure is the Encoder block structure of the Transformer , which can be seen to be composed of Multi-Head Attention , Add & Norm , Feed Forward , Add & Norm . I have just learned about the calculation process of Multi-Head Attention , now let's take a look at the Add & Norm and Feed Forward parts.

4.1 Add & Norm

The Add & Norm layer consists of Add and Norm . The calculation formula is as follows:
Layer Norm ( X + M ulti Head Attention ( X ) ) LayerNorm(X + MultiHeadAttention(X))L a yer N or m ( X+MultiHeadAttention(X))
L a y e r N o r m ( X + F e e d F o r w a r d ( X ) ) LayerNorm(X + FeedForward(X)) L a yer N or m ( X+F ee d F or w a r d ( X ))
whereXXX representsMulti-Head AttentionorFeed Forward,MultiHeadAttention(XXX ) andFeedForward(XXX ) means output (output and inputXXThe X dimension is the same, so it can be added).

Add refers to XXX + MultiHeadAttention( X X X ), is a residual connection, usually used to solve the problem of multi-layer network training, allowing the network to only focus on the current difference,often used inResNet

insert image description here

Norm refers to Layer Normalization , which is usually used in the RNN structure. Layer Normalization will convert the input of each layer of neurons into the same mean and variance, which can speed up convergence.

4.2 Feed Forward

The Feed Forward layer is relatively simple. It is a two-layer fully connected layer. The activation function of the first layer is Relu , and the activation function is not used in the second layer. The corresponding formula is as follows:
max ⁡ ( 0 , XW 1 + b 1 ) W 2 + b 2 \max(0, XW_1 + b_1)W_2 + b_2max(0,XW1+b1)W2+b2
X X X is the input, andthe dimension of the output matrix obtained byFeed ForwardXXX agrees.

4.3 Composition of Encoder

An Encoder block can be constructed through the Multi-Head Attention , Feed Forward , Add & Norm described above , and the Encoder block receives the input matrix X ( n × d ) X_{(n \times d)}X(n×d), and output a matrix O ( n × d ) O_{(n \times d)}O(n×d). An Encoder can be formed by superimposing multiple Encoder blocks .

The input of the first Encoder block is the representation vector matrix of sentence words, the input of the subsequent Encoder block is the output of the previous Encoder block , and the matrix output by the last Encoder block is the coding information matrix CCC , this matrix will be used in Decoderlater.

insert image description here
Encoder encoding sentence information】

Five, Decoder structure

insert image description here
The red part in the above figure is the Decoder block structure of the Transformer , which is similar to the Encoder block , but there are some differences:

  • Contains two Multi-Head Attention layers;
  • The first Multi-Head Attention layer uses the Masked operation;
  • KK of the second Multi-Head Attention layerK, V V V matrix usesEncoder's encoding information matrixCCC performs calculations, whileQQQ is calculated using the output of the previousDecoder block;
  • Finally there is a Softmax layer that calculates the probability of the next translated word.

5.1 The first Multi-Head Attention

The first Multi-Head Attention of the Decoder block uses the Masked operation, because it is translated sequentially during the translation process, that is, after the translation of the secondOnly i words can be translated i + 1 i + 1i+1 word. The second iibythe Maskedoperationi words knowi+1 i+1i+Information after 1 word. Let's take the translation of "I have a cat" into "I have a cat" as an example to understand theMaskedoperation.

The following description uses a concept similar to Teacher Forcing. Children's shoes who are not familiar with Teacher Forcing can refer to the following article for a detailed explanation of the Seq2Seq model. In the Decoder , it is necessary to solve the current most likely translation based on the previous translation, as shown in the figure below. First, the first word is predicted as "I" based on the input "", and then the next word "have" is predicted based on the input "I".

insert image description here

Decoder prediction】

Decoder can use Teacher Forcing and parallelize training during training, that is, pass the correct word sequence (I have a cat) and corresponding output (I have a cat) to Decoder . Then in predicting the iiWhen the i is output, thei + 1 i + 1i+The words after 1 are covered up. Note thatthe Maskoperation isused beforethe SoftmaxofSelf-Attention

The first step : Decoder 's input matrix and Mask matrix, the input matrix contains the representation vector of five words "I have a cat" (0, 1, 2, 3, 4), and the Mask is a 5 × 5 5\times 55×Matrix of 5 . InMask,it can be found that word 0 can only use the information of word 0, while word 1 can use the information of words 0 and 1, that is, only the previous information can be used.

insert image description here
[Input matrix and Mask matrix]

Step 2 : The next operation is the same as the previous Self-Attention , by inputting the matrix XXX is calculated to getQQQ, K K K, V V V matrix. Then calculateQQQ andKTK^TKThe product of T QKT QK^TQKT

insert image description here
Q Q Q timesKKTranspose of K ]

The third step : get QKT QK^TQKAfter T , Softmaxto calculatethe attention scoreSoftmax, weneed to usethe Maskmatrix to block the information after each word. The blocking operation is as follows:
insert image description here
[MaskbeforeSoftmax]

Get Mask QKT QK^TQKAfter T inMask QKT QK^TQKSoftmaxis performed on T , and the sum of each row is 1. But word 0 has an attention scoreon words 1, 2, 3, and 4.

Step 4 : Use Mask QKT QK^TQKT and matrixVVV is multiplied to get the outputZZZ , then the output vectorZ 1 Z_1Z1It contains only word 1 information.

insert image description here
Output after Mask 】

Step 5 : Through the above steps, an output matrix Z i Z_i of Mask Self-Attention can be obtainedZi, and then similar to the Encoder , multiple output Z i Z_i are spliced ​​​​by Multi-Head AttentionZiThen calculate the output ZZ of the first Multi-Head AttentionZ Z Z Z with inputXXSame as the X dimension.

5.2 The second Multi-Head Attention

The second Multi-Head Attention of the Decoder block has not changed much, the main difference is the KK of the Self-AttentionK, V V The V matrix is ​​not calculatedusing the output of the previousDecoder blockthe encoding information matrixCCthe EncoderC calculated.

According to the output CC of EncoderC calculatesKKK, V V V , according to the output ZZofthe previousDecoder blockZ calculatesQQQ (if it is the firstDecoder block, use the input matrixXXX for calculation), the subsequent calculation method is consistent with the previous description.

The advantage of this is that during the Decoder , each word can use the information of all the words of the Encoder (the information does not need a Mask ).

5.3 Softmax predicts output words

The last part of the Decoder block is to use Softmax to predict the next word. In the previous network layer, we can get a final output ZZZ , because ofMask, the output of word 0 isZ 0 Z_0Z0Contains only word 0 information, as follows:

insert image description here
[ ZZ before Decoder SoftmaxZ

Softmax predicts the next word based on each row of the output matrix:

insert image description here
Decoder Softmax prediction】

This is the definition of the Decoder block . Like the Encoder , the Decoder is composed of multiple Decoder blocks .

6. Summary of Transformer

  • Transformer is different from RNN and can be trained in parallel better;
  • Transformer itself cannot use the order information of words, so you need to add position Embedding to the input , otherwise Transformer is a bag of words model;
  • The focus of Transformer is the Self-Attention structure, which uses QQQ, K K K, V V The V matrix is ​​obtained by linear transformation of the output;
  • There are multiple Self-Attentions in Multi-Head Attention in Transformer , which can capture the attention score of the correlation coefficient in multiple dimensions between words .

Guess you like

Origin blog.csdn.net/wzk4869/article/details/130504630