Transformer process analysis and detailed thinking

Transformer process

Inputs represent the input data, which can be a sentence in NLP, converted into a numerical vector after Input Embedding, and then add positional information to the numerical vector through Positional Encoding, enter the Encoder part composed of N layers, first pass the multi-head attention Layer (Multi-Head Attention), which calculates self-attention, and the next Add&Norm layer represents residual connection and normalization processing. Unlike in CNN, the normalization here is layer normalization, and in CNN it is batch normalization. After that, it is sent to the Feed Forward layer, which is actually an MLP here.
The sublayer structure of the Decoder part and the Encoder part is basically the same. There are two obvious modifications. One is that the multi-head attention layer adds Mask. The reason is that only part of the words in a sentence are known when predicting in NLP, so as to predict the subsequent content of the sentence. Mask It is used to block out some sentences that should not be known. Another modification is that after a multi-head self-attention layer and residual and normalization layers are added to the Decoder, a cross-attention layer is added. The q of this layer comes from the Decoder's Masker Multi-Head Attention, k and v Output from Encoder.
After passing through the Encoder and Decoder, it is mapped through the linear layer, and finally softmax outputs the result. The details of the attention mechanism and each module in Transformer will be introduced in detail below.
Transformer network structure diagram

Attention mechanism

The attention mechanism draws on the attention mechanism of humans. In the human visual system, there is a selective attention mechanism.

The visual attention mechanism is a brain signal processing mechanism unique to human vision. Human vision quickly scans the global image to obtain the target area that needs to be focused on, which is generally referred to as the focus of attention, and then invests more attention resources in this area to obtain more detailed information about the target that needs to be focused on. And suppress other useless information.

This is a means for humans to use limited attention resources to quickly screen out high-value information from a large amount of information. It is a survival mechanism formed in the long-term evolution of humans. The human visual attention mechanism has greatly improved the efficiency of visual information processing and accuracy.
Figure 1 Human visual attention

Figure 1 visualizes how humans efficiently allocate limited attention resources when seeing an image. The red area indicates the target that the visual system pays more attention to. Obviously, for the scene shown in Figure 1, people will pay attention Put more effort into people's faces, the title of the text, and the first sentence of the article.

The attention mechanism in deep learning is essentially similar to the selective visual attention mechanism of human beings. The core goal is to select information that is more critical to the current task goal from a large number of information.

What is the ultimate goal of self-attention?
Given the current input sample xi ∈ R 1 × d x_i\in\mathbb{R}^{1\times d}xiR1 × d (for better understanding, we disassemble the input), produce an output, and this output is the weighted sum of all samples in the sequence. Because it is assumed that this output can see all the input sample information, and then choose its own attention points according to different weights.

The attention mechanism in Transformer includes self-attention and cross-attention. In fact, self-attention is Q=K=V=X, attention(Q, K, V)=attention(X, X, X), and cross- Attention is Q != K = V, Q != K != V.

The constituent elements in the Source are imagined to be composed of a series of <Key, Value> data pairs. At this time, given an element Query in the Target, each Key is obtained by calculating the similarity or correlation between the Query and each Key. Corresponding to the weight coefficient of Value, and then weighting and summing Value, the final Attention value is obtained. So in essence, the Attention mechanism is to weight and sum the Value values ​​​​of the elements in the Source, and Query and Key are used to calculate the weight coefficient corresponding to the Value.

Schematic diagram of the Attention mechanism
For more information about the attention mechanism in deep learning, click this blog post .

module analysis

Embedding和Positional Encoding

Embedding

Simply put, embedding is to use a numerical vector to "represent" an object, where the object can be a word, or a commodity, or a movie, etc. The nature of this embedding vector is to enable objects corresponding to vectors with similar distances to have similar meanings. If you have any doubts about the similarity principle of Embedding representation objects, you can refer to here .
Embedding diagram

Embedding can encode objects with low-dimensional vectors and retain their meaning, which is very suitable for deep learning. First of all , Embedding is a powerful tool for dealing with sparse features. In the process of building traditional machine learning models, we often use one hot encoding to encode discrete features, especially id-type features, because there are many categories and ID-type features in the recommendation scene, a large number of one-hot encoding will lead to sample features The vectors are extremely sparse, and the structural characteristics of deep learning are not conducive to the processing of sparse feature vectors. Therefore, almost all deep learning recommendation models use the Embedding layer to convert sparse high-dimensional feature vectors into dense low-dimensional feature vectors. Therefore, various Embedding technologies are the basic operations for building deep learning recommendation models. Secondly , Embedding can incorporate a large amount of valuable information, which itself is an extremely important feature vector. Compared with the eigenvectors directly processed by the original information, Embedding has stronger expressive ability, especially after the Graph Embedding technology is proposed, Embedding can introduce almost any information for encoding, so that it contains a lot of valuable information itself, so The Embedding vector obtained through pre-training itself is an extremely important feature vector.

Positional Encoding

In natural language processing, sentences composed of the same words may express completely different meanings in different orders. Text is time-series data, and the order relationship between words often affects the meaning of the entire sentence. For example, "I love you" and "You love me" form the same words, but express completely different meanings. Transformer works very well and can perform parallel calculations on sentences, that is, input all the words in a sentence at the same time and calculate them at the same time, which greatly speeds up the calculation efficiency. But here comes the problem, parallel computing is good, but how do we let the model know the order information of each word in a sentence? This will lead to our Positional Encoding (positional encoding).

The purpose of position encoding is to embed the position information into the input vector to ensure that the position information is not lost during the calculation. There are two mainstream methods for positional encoding of the transformer model: absolute positional encoding and relative positional encoding.

Absolute position encoding refers to directly randomly initializing a postion embedding for different positions, adding it to the word embedding input model, and training as a parameter. For the model, each element of the input sequence is marked with a "position label" indicating its absolute position.
Schematic diagram of absolute position coding
Absolute position encoding takes into account that different positions use different encodings, but the relationship between positions is not explicitly reflected. Relative position encoding can explicitly reflect the relationship between different positions and inform the model of the distance between two elements.

The initial version of Transformer uses absolute position encoding, implemented with trigonometric functions.
insert image description here
There are many ways to encode absolute positions. The simplest one is to directly use counting to encode. Given a text with a length of T, use 0, 1, 2,..., T-1 as the position encoding of each word in the text. This encoding has two disadvantages: 1. If a sentence has a large number of words, the positional encoding of the following words is much larger than that of the first word, and it is inevitable that the value of the feature will be skewed after being merged with word embedding; 2. . The value of this position encoding is larger than the value of general word embedding, which may interfere with the model to a certain extent.

In order to avoid the above problems, it can be considered to normalize it so that all codes fall between [0, 1], but the problem is also obvious, the position coding steps of different length texts are different, in shorter text The position encoding difference of two words immediately adjacent in the text will be consistent with the position encoding difference of two adjacent words in the long text, as shown in the figure below.

The advantage of using trigonometric functions is 1. PE can be distributed in the [0,1] interval. 2. The PE values ​​of characters at the same position in different statements are the same (for example: when pos=0, PE=0).

Considering the periodicity of trigonometric functions, there may be situations where the pos values ​​are different but the PE values ​​are the same. The length of PE is lengthened to be as long as the word embedding, and then sin/cos are alternately used to calculate the value of PE to further increase PE, so Get the calculation formula in the paper.

Why is the positional encoding added to the input vector instead of splicing? This is because Transformer usually makes an embedding for the original input, so as to map to the required dimension, which can be realized by using a transformation matrix as a matrix product. The input x in the above code is actually the transformed representation. rather than the original input.

In fact, concat a vector representing position information on the original input is equivalent to directly adding position encoding information to the original input after linear transformation. We try to use concat to add positional encoding to the original input:
insert image description here

Encoder

Multi-Head Attention

In the first layer of Encoder, the input of the multi-head self-attention layer is a feature vector matrix (such as a matrix composed of multiple word vectors), while in the other layers behind, the input is the output of the previous layer. A multi-head self-attention layer consists of multiple self-attention layers, and its output is concatenated from the output of each of the self-attention layers.

Let's first look at the operation of a single self-attention layer. When the self-attention layer obtains the input, it will perform different linear transformations to generate three matrices Q (Query), K (Key), and V (Value). Weight Matrix WQ W_QWQ W K W_K WK W V W_V WVis a learnable weight parameter.
insert image description here
Then calculate the output according to the following formula:
insert image description here
dk d_kdkis Q , VQ,VQ,The number of columns of the V matrix, that is, the vector dimension.

The operation of this calculation process is also called Scaled Dot-Product Attention.

So far, we have obtained a single self-attention output. The multi-head self-attention is "multi-head self-attention", usually the "number of heads" (commonly understood as the number) is set to 8, so that there are 8 self-attention outputs, and finally they are spliced ​​together and then passed through a Just linear transformation.
insert image description here

Feed Forward Network(FFN)

The essence of FFN is two fully connected layers, one of which has a ReLU layer, and there is Dropout between the two layers. Equivalent to MLPs.
insert image description here
insert image description here

Add & Norm

This layer contains two operations, residual add and layer normalization.

Residual Add

Assuming that the input before passing through a sublayer is x, and the calculation of the residual connection is performed after passing through the sublayer, the corresponding residual calculation can be expressed as
y = x + Sublayer ( x ) y=x + Sublayer(x)y=x+S u b l a yer ( x )
whereyyy represents the result of the residual module,Sublayer ( x ) Sublayer(x)S u b l a yer ( x ) represents the result after a certain sub-layer.

Layer normalization Layer normalization

The idea of ​​​​Layer Normalization is very similar to Batch Normalization, except that Batch Normalization normalizes a mini-batch size sample in each neuron, while Layer Normalization normalizes all neuron nodes of a single sample in each layer.
insert image description here

EncoderSummary

The sub-layers and attention mechanism in the Encoder have been described in relative detail above. The entire Encoder is presented in the cascade of N structures as follows, and the N=6 of the original transformer code.
insert image description here

Decoder

The structure of the decoder part and the encoder part is basically the same. The main change is that the multi-head attention layer adds a mask mask, and a multi-head attention is added before the feed forward layer of the Encoder. Its input is different from the previous multi-head attention. different.

The role of the mask is that in NLP, when outputting predictions, only a part of the truth sentence should be known, and the rest needs to be predicted, but the transformer processes all the content of the entire sentence in parallel, in order to match the reasoning process and get the correct The model introduces a mask to hide the parts that should not be known.

The newly added multi-head attention layer, its input k and v come from the final output of the encoder, and q comes from the output of the previous layer of the decoder. According to the similarity between q and k, v is weighted.
insert image description here

1. What is the input of the Decoder?

The input of the Decoder in the train mode is different from that in the test mode. In the train mode, the input of the Decoder is Ground Truth, that is, no matter what the output is, the correct answer will be taken as the input. This mode is called teacher-forcing. But in the test mode, there is no Ground Truth to teach at all, so the output of the words that have already appeared (note that the output here is to complete the calculation of the entire Decoder to get the prediction result rather than the output of a layer of Decoder) as the next Decoder The input of the calculation, I think this is also the meaning of shifted right in the paper, always shifting to the right.

2. Is the Decoder calculated in parallel?

In Transformer, the most talked about, and one of its biggest advantages compared with RNN type model is that it can be calculated in parallel, but this parallel calculation is limited to Encoder, in which all words are input together Calculate together, but not in the Decoder. In the Decoder, it is still input word by word like an RNN. Calculate the Q obtained by calculating the words that have appeared and the K and V obtained by the Encoder. After passing through all the Decoder layers and then through After FC+Softmax gets the result, it takes the result as the input of the Decoder and goes through the whole process until the END tag is obtained.

3. Interaction between Encoder and Decoder

The interaction between the Encoder and the Decoder is not a one-to-one correspondence between each layer, but after all 6 layers of the Encoder have been calculated to obtain K and V, K and V are then passed to each layer of the Decoder for calculation with the calculated Q. In addition, there are various possible methods for the connection between the Encoder and the Decoder.

Summarize

The Transformer obtains features through the encoder. During training, the input of the Decoder is the code of the ground truth and the encoder, so that all gt can be sent to the decoder for training at the same time. During inference, the input of the decoder is the output of the previous layer decoder and the encoder. Encoding needs to be fed one by one.

reference link

Attention Mechanism in Deep Learning (2017 Edition)

What exactly is Embedding that everyone is talking about?

What is embedding?

The principle and calculation of Positional Encoding

The Way of Transformer Practice (1), Input Embedding

Transformer practice (2), Encoder

Detailed explanation of layer normalization (Layer Normalization)

Detailed explanation of Transformer Decoder

Guess you like

Origin blog.csdn.net/qq_37214693/article/details/126400453