Transformer-《Attention Is All You Need》

Table of contents

0.Transformer introduction

1.self-attention 和Multi-heads self-attention

1.1 self-attention (self-attention mechanism)

1.2 Multi-heads self-attention (multi-head self-attention mechanism)

2. Network structure

2.1 encoder (encoder)

2.2 decoder (decoder)

2.3 Position-wise Feed-Forward Networks(FFN)

2.4 Positional Encoding 

2.5 Add&Norm

3. Reference blog post


0.Transformer introduction

In 2017, Vaswani et al. proposed the Transformer model in "Attention Is All You Need", which is the first model that completely relies on self-attention to calculate its input and output, and has been widely used in the field of natural language processing since then.

Paper Title: "Attention Is All You Need" 

Paper address: https://arxiv.org/abs/1706.03762

Transformer is a typical Seq2Seq model. The core idea of ​​the Seq2Seq model is to encode a sequence (such as text) into a fixed-size vector representation, and then decode it into another sequence. Such models usually include an encoder and a decoder, which are responsible for encoding the input sequence into a hidden representation and decoding the hidden representation into an output sequence, respectively. Therefore, Transformer follows this overall architecture, both encoder and decoder use stacked self-attention and Feed Forward NN (forward propagation layer), as shown in the following figure:

1.self-attention 和Multi-heads self-attention

1.1 self-attention (self-attention mechanism)

The attention mechanism can be described as the process of mapping a query-key-value to an output, where query, key, value, and output are all vectors. The output is computed as a weighted sum of values, where the weight assigned to each value is computed as a function of the query and the corresponding key. The Attention calculation formula is shown in the figure below:

1) Initialize and generate qkv:

The following figure is an example, the input is three nodes a1, a2, a3, through the embedding layer, the three input data are converted into vectors x1, x2, x3, and then the vectors of the three nodes and the weight matrices Wq, Wk, Wv (these three parameters are trainable and shared) are calculated to get qi, ki, vi, as shown in the figure below. (where q stands for query, which will be matched with each k in the future;
k stands for key, which will be matched by each q in the future; v represents the information extracted from the input vector x)

 2) Calculate the relevance score score:

The second step of self-attention is to calculate the correlation score score, which is QK^{T}part of the Attention calculation formula. The score is calculated by the dot product of the query vector (Queries) and the key vector (keys) of the corresponding word we are scoring. It represents the degree of correlation between the two vectors. The larger the value, the greater the degree of correlation.

Dot product formula:\vec{a}\ast \vec{b}=\left | \vec{a} \right |\ast \left |\vec{b} \right |\ast\cos \theta

In calculating the score, qi performs dot product operation with ki one by one to obtain the score value. For example, q1 and k1, k2, and k3 are respectively performed to obtain three score values. 

3) Scale and normalize the score:

It is pointed out in the original text that the value after the dot multiplication is very large, resulting in a small gradient after passing softmax, so the scaling operation is used in this paper to make the gradient of the model training more stable. The zoom operation is to divide by in the Attention calculation formula \sqrt{d_{k}}, which d_{k}is the dimension of the input q, which is 64 in the original text, so it is 8 after the square root.

After scaling the score value, use Softmax to normalize the scores so that they are all positive and add up to 1, which is equivalent to calculating the weight for each v, and the softmax score determines each word in the sentence importance of a location.

4) Redistribute the importance of each word:

Multiply each value vector (values) by the corresponding softmax score, the purpose is to redistribute the importance of each word, that is, the part of softmax multiplied by V in the Attention formula, and the calculated value is the value of Attention. For example, the output of the word "a1" after self-attention processing is 0.70 x v1 + 0.18 x v2+ 0.12 x v3, that is, after the current sentence is processed by self-attention, the meaning of the word "a1" contains 70% of itself Meaning and 18% of the meaning of the next word "a2" and 12% of the meaning of the next word "a3", such processing reflects the relationship of the text context.

5) Summary: 

In the actual calculation, we will splice the input word vectors x1, x2, x3 and the Q, K, and V obtained through the weight matrix into a matrix for faster processing. As shown in the figure above, Q, K, and V are the input matrices, representing the query matrix, key matrix, and value matrix, respectively, and dk​is the vector dimension. The function of this formula is to obtain the weighted sum of V corresponding to the input by weighting the similarity of Q and K, and obtain the output matrix of the attention mechanism.

1.2 Multi-heads self-attention (multi-head self-attention mechanism)

The authors point out in the paper that multi-head attention allows the model to jointly focus on information from different representation subspaces at different locations. In the case of a single attention head, averaging suppresses this.

 Therefore, in actual use, the multi-heads attention mechanism (Multi-heads Attention) is basically used. The expression for multi-head self-attention is as follows:

For multi-head attention, we have multiple sets of query vectors (Queries), key vectors (keys) and value vectors (values). Here, a set of q, k, and v vectors is called a head, and each head obtained will be obtained The results of the concat splicing operation are performed, and the spliced ​​results are W^{O}multiplied by an additional weight matrix (learnable parameters) for fusion, and the final result matrix is ​​obtained after fusion.

Eight attention heads are used in the paper, and the parameters in each set of attention heads are trainable. After training, the ability of the model to focus on different positions of the input data can be expanded. The eight attention heads will each generate eight outputs, but we actually only need one output. So we need a way to compress these eight outputs into one matrix. The method is also very simple, just multiply them by an additional weight matrix W^{O}. This operation can simply be mapped through a neural network layer, as shown in the following figure:

2. Network structure

2.1 encoder (encoder)

The encoder is an algorithm component responsible for encoding the input information into a feature vector. This input information is different according to different tasks, but text can also be an image, but the text and image itself cannot be directly input into the encoder, and they must be vectorized ( Embedding) into a vector to be sent to the encoder.

The encoder mainly converts the input sequence into a fixed-length vector. The Encoder in the Transformer consists of 6 identical layers, and each layer contains 2 parts: 1) Multi-Head Self-Attention (multi-head self-attention layer
)
2 ) Feed-Forward Neural Network (forward propagation layer)

In the encoder module, there is a residual connection around the self-attention layer of the encoder, and then the Layer Norm layer normalization operation, and the normalized output is then mapped through the feedforward network FNN (forward neural network) for further processing. The feedforward network is essentially several layers of neural network layers, with ReLU activation in the middle, and residual links between the two layers, such as the MLP block described in Section 2.3 below.

In the network, the residual connection can help the backpropagation of the gradient, allowing the model to converge faster and better. Layer normalization is used to stabilize the network and alleviate the problem of unstable numerical transfer of deep learning models.

2.2 decoder (decoder)

The decoder serves the generation task, and if it is a discriminative task, there is no need for a decoder structure. The decoder needs to translate and interpret the output of the encoder to generate the target sequence we want.

The decoder is used to generate the output sequence from the vector generated by the encoder. The Decoder is also composed of 6 identical layers, and each layer contains 3 parts: 1
) Masked Multi-Head Attention (mask multi-head sub-attention layer)

It is the same as the Encoder's Multi-Head Attention calculation principle, but an extra mask code is added. The masking mechanism in Transformer is used to prevent the model from accessing information that should not be accessed when processing sequences. There are two kinds of masks involved in the model, namely padding mask and sequence mask.

a.padding mask

In natural language processing tasks, in order to feed sentences of different lengths into the model, we usually need to pad the shorter sentences to be the same length as the longest sentence. Padding is usually represented using special symbols such as <pad>.

The purpose of the padding mask is to ignore these padding locations during self-attention computation. This is done because these filler symbols don't actually carry any meaningful information, and we don't want them to affect the computation of attention weights between other words. The padding mask is implemented by setting the attention logits corresponding to the padding positions to a very large negative number. In this way, when the softmax function is applied, the attention weight corresponding to the filled position will be close to zero.

b.sequence mask
The sequence mask is to prevent the decoder from seeing future information. For a sequence, when time_step is t, our decoding output should only depend on the output before t, but not on the output after t. Therefore, we need to think of a way to hide the information after t. This is effective during training, because every time we input the target data into the decoder completely during training, it is not needed during prediction. During prediction, we can only get the output predicted at the previous moment.
The sequence mask is implemented by adding a lower triangular matrix (whose upper triangular part is negative infinity) to the attention logits matrix. In this way, when the softmax function is applied, the attention weights corresponding to words after the current position will be close to zero. This makes the decoder only focus on the current and previous words at each time step.

c.Summary

The padding mask is used to ignore the effect of padding symbols, while the sequence mask ensures that the decoder follows the autoregressive principle during the generation process. By masking, we can make the model more stable and reliable when processing sequences. The Multi-Head Attention in the Encoder also needs to be masked, but only the padding mask is required in the Encoder, while the padding mask and sequence mask are required in the Decoder.

2) Encoder-Decoder Multi-Head Attention (multi-head encoding-decoding sub-attention layer)

The Multi-Head Attention in the Encoder is based on the Self-Attention. The second Multi-Head Attention in the Decoder is only based on the Attention. Its input Quer comes from the output of the Masked Multi-Head Attention, and the Keys and Values ​​come from the Encoder. The output of the last layer.

3) Feed Forward NN (forward propagation layer)

It is the same as that in the encoder, and it is implemented using the MLP module, as shown in Section 2.3.

In the final output of the decoder, it first undergoes a linear transformation, and then Softmax obtains the output probability distribution. Through dictionary matching, the corresponding word with the highest probability is output as our predicted output.

2.3 Position-wise Feed-Forward Networks(FFN)

In Transformer, the Feed Forward (FNN) layer is an MLP that is applied individually to each vector in the sequence after the self-attention mechanism. The expression of the FFN layer is as follows:

 MLP Block consists of two fully connected layers and a nonlinear activation function (such as ReLU or GELU). The first fully connected layer will expand the number of input nodes by 4 times, because there is a residual connection in the final output. , so the second fully connected layer will restore the number of input nodes, as shown in the following figure:

2.4 Positional Encoding 

In the calculation mechanism of self-attention, the position information is not considered. We hope to identify each element information of the input sequence, so Positional Encoding is performed in the input stage of the encoder and decoder. A position code is a vector representing information about each position in a sequence. The dimensions of the positional encodings are the same as the input vectors, so that we can add them element-wise, preserving the positional information.

 Positional encodings can be fixed (such as those based on sine and cosine functions) or learnable (vectors obtained through training). In Transformer's original paper, the author used fixed position encoding based on sine and cosine functions. For a given position pos and encoding dimension i, the calculation formula of position encoding is as follows:

PE_{\left ( i,2k \right )}=\sin \left ( \frac{i}{10000^{2k/d}} \right )

PE_{\left ( i,2k+1 \right )}=\cos\left ( \frac{i}{10000^{2k/d}} \right )

Among them, PE(i,2k) and PE(i,2k+1) are the values ​​of row i, column 2k and column 2k+1 in the position encoding matrix, and d is the dimension of the input vector. Through this formula, we can generate a position encoding vector for each position in the input sequence, which has a certain pattern and can represent the position information of the position in the sequence.

In order to add positional encoding to the input sequence, we can add each word vector in the input sequence to the corresponding positional encoding vector to obtain an input vector containing positional information, as shown in the following figure:

2.5 Add&Norm

1.Add

  Add is to add a residual block on the basis of the calculated results. The purpose of adding a residual block is to prevent the degradation problem in the deep neural network training. The degradation means that the deep neural network increases the network layer Number, Loss gradually decreases, then tends to be stable and reaches saturation, and then continues to increase the number of network layers, Loss increases instead.

2.Norm

 Before the neural network is trained, it is necessary to Normalize the input data for the following purposes:

1. It can speed up the training.

2. Improve the stability of training. 

Layer Normalization is used in Transformer for normalization, because in the original paper, transformer is applied to the NLP field, layer_norm is aimed at the length of the text, and the entire sequence of text is better than batch_norm.

The processing object of Batch Normalization is a batch of samples, and the processing object of Layer Normalization is a single sample. Batch Normalization is to normalize the same dimensional features of this batch of samples, and Layer Normalization is to normalize all dimensional features of this single sample.


3. Reference blog post

1. Encoder-Decoder_qq_47537678's Blog - CSDN Blog

2. Encoder-Decoder Model - Junior School of Magic

3.  Transformers (Transformer) - Elementary School of Magic

4.  Detailed explanation of Self-Attention and Multi-Head Attention in Transformer

5. The difference between batchNormalization and layerNormalization- Know about

6.  Detailed explanation of the smallest white Transformer in history_Stink1995's Blog-CSDN Blog

Guess you like

Origin blog.csdn.net/damadashen/article/details/130927320