Transformer [the most detailed explanation of Transformer in the whole network]

Transformer

1. Transformer structure

First look at the overall framework of Transformer:

It may look complicated, but it is actually the framework of Encoder and Decoder {seq2seq}. The default NX=6, the 6-layer Encoder and 6-layer Decoder are nested in the nested Self-Attention mentioned last time , and then perform multiple nonlinear transformations.

Transformer overall framework

The framework in the above picture is very complicated. Since Transformer was originally used as a translation model, let us give you an example to understand the purpose first.

insert image description here

Transformer is equivalent to a black box, input "Je suis etudiant" on the left, and get a translation result "I am a student" on the right.

insert image description here

Let's describe how the model of the Encoder-Decoder framework performs text translation:

insert image description here

Transformer is also a Seq2Seq model (the model of the Encoder-Decoder framework) . An Encoders on the left reads the input, and a Decoders on the right gets the output. Encodes and Decoders have 6 layers by default, as shown in the following figure:
insert image description here

  从上图可以看到,Encoders 的输出,会和每一层的 Decoder 进行结合

The reason is : the Encoder inputs KV to the Decoder of each layer, and the Q generated by the Decoder queries information from the Encoder KV (described below)

We take one of the layers to show:
insert image description here

2. Seq2seq from RNN to transformer (understanding with examples)

Example 1: Seq2seq - speech recognition, machine translation
insert image description here

Example 2: Seq2seq——Input the audio and subtitles of TV dramas and train a speech recognition model
insert image description here

Example 3: Seq2seq - input text, output speech
insert image description here

Example 4: Seq2seq——Input dialogue templates and train chatbots
insert image description here

Example 4: Seq2seq - Image Feature Extractioninsert image description here

Example 4: Seq2seq——Transformer
insert image description here


3. Encoder

What is the Encoder doing?
Word vectors, picture vectors, all in all, the encoder is to use word vectors to enable computers to understand some things that exist objectively in the human world more reasonably (uncertainly) .
Encoder transforms into a better one from a single sentenceword vector(K、V)!
insert image description here

insert image description here

With the foreshadowing of so much knowledge above, we know that Eecoders are N=6 layers. From the above figure, we can see that each layer of Encoder includes two sub-layers:

  • The first sub-layer is multi-head self-attention , which is used to calculate the input self-attention;
  • The second sub-layer is a simple feedforward neural network layer Feed Forward ;
  • Note: In each sub-layer we have simulated the residual network (detailed below), the output of each sub-layer is LayerNorm (x + Sub_layer(x)), where sub_layer represents the previous layer of the layer layer output;

Now let's analyze the Encoder's data flow diagram step by step:

insert image description here

insert image description here

  1. The dark green x1 represents the output of the Embedding layer. After adding the Positional Embedding vector, the final input feature vector in the Encoder is obtained, that is, the light green vector x1 ;
  2. The light green vector x1 represents the feature vector of the word "Thinking" , where x1 passes through the Self-Attention layer and becomes a light pink vector z1 ;
  3. As the directly connected vector of the residual structure, x1 is directly added to z1, that is, [w3(w2(w1x+b1)+b2)+b3+x], and then the Layer Norm operation is performed to obtain the pink vector z1 ;
    • The role of the residual structure : to avoid the disappearance of the gradient, w3(w2(w1x+b1)+b2)+b3, if w1, w2, w3 are particularly small, 0.0000000...1, x approaches 0,
    • The role of Layer Norm : In order to ensure the stability of the data feature distribution and accelerate the convergence of the model
  4. z1 passes through the feedforward neural network (Feed Forward) layer, then adds the residual structure to itself, and then passes through the Layer Norm layer to obtain an output vector r1; the feedforward neural network includes two linear transformations and a ReLU activation function : FFN(x) = max(0, xW1 + b1)W2 + b2;
    • The function of the feedforward neural network (Feed Forward) : each step before is doing a linear transformation, wx+b, the superposition of the linear change is always a linear change (linear change is the translation and expansion and reduction in space), through the Feed Forward Relu does a nonlinear transformation, such a space transformation can infinitely fit any state
  5. Since Transformer's Encoders have 6 Encoders, r1 will also be used as the input of the next layer of Encoder, replacing the role of x1 , and so on until the last layer of Encoder;

It doesn’t matter if you don’t understand, all the above operations are doing word vectors, but this word vector is better and can more accurately represent the word and sentence (source)


4. Decoder

Decoder The decoder will receive the word vector (K, V) generated by the encoder, and then use this "word vector" to generate the translation result.

insert image description here
insert image description here
Decoders are also N=6 layers. From the above figure, we can see that each layer of Decoder includes 3 sub-layers:

  • The first sub-layer is Masked multi-head self-attention, which is also the self-attention of the calculation input;
    • Let’s not explain why Masked is done here, it will be explained in the section "Transformer Dynamic Process Display".
  • The second sub-layer is Encoder-Decoder Attention calculation, which performs attention calculation on the input of Encoder and the output of Masked multi-head self-attention of Decoder;
    • It does not explain here why the Cross attention calculation should be performed on the output of the Encoder and Decoder together. It will be explained in the section "Transformer Dynamic Process Display"
  • The third sub-layer is the feedforward neural network layer, which is the same as the Encoder;

Decoder input:

insert image description here

Decoder takes the output of its previous time point (that is, the query vector Q ) and the feature vector ( K, V ) output by Encode to become a new input;
that is, Encoder provides K e , V e matrix, and Decoder provides Q matrix ;

Why does Encoder give Decoders K and V matrices?

  • Q comes from Decoder (decoder), K=V comes from Encoder (encoder);
  • Q is the query variable, Q is the generated word; K=V is the source sentence
  • When we generate this word, use the words already generated by the Decoder as Q and the K and V provided by the source sentence
    as Self-Attention, which is to determine which words in the source sentence are more effective in generating the next word, and then generate the next word The word continues as Q.

For example: "I love China" translates to "I love China"

When we translate "I", since the Decoder provides the Q matrix, it can find the most useful word for "I" translation among the four words "I love China" through calculation with the K e and V e matrices Which ones are they, and based on this, the word "I" is translated, which well reflects the purpose that the attention mechanism wants to achieve, focusing on information that is more important to oneself .


5. Transformer output result

The two modules of Transformer encoding and decoding have been explained above, so we return to the original question and translate "machine learning" into "machine learning". The output of the decoder is a floating-point vector. How to convert it into "machine learning"? What about two words? Let's look at the process of Encoders and Decoders interaction to find the answer:

insert image description here
As can be seen from the above figure, the final job of Transformer is to let the output of the decoder pass through the linear layer Linear and then connect to a softmax

  1. The linear layer is a simple fully connected neural network, which projects the vector A generated by the decoder onto a higher-dimensional vector B. Assuming that the vocabulary of our model is 10,000 words, then the vector B has 10,000 dimensions , each dimension corresponds to a unique word score.
  2. A softmax layer then converts these scores into probabilities. Select the dimension with the highest probability, and correspondingly generate the words associated with it as the output of this time step is the final output!

6. Transformer dynamic process demonstration

insert image description here
Assuming that the above figure is a certain stage of the training model, let's describe this dynamic flow chart in combination with the complete framework of Transformer:
insert image description here

Now let's explain why Decoder needs to do Mask:

Answer: In order to solve the gap (mismatch) between the training phase and the testing phase

Example 1: Machine translation : source sentence (I love China), target sentence (I love China)

  • Training phase : The decoder will have input, which is the target sentence, that is, I love China, through the generated words, to make the decoder generate better (every time, all the information will be told to the decoder)
  • Work stage : The decoder will also have input, but at this time, the target sentence is not known during the test. At this time, every time you generate a word, one more word will be put into the target sentence, and each time you generate At the time, they are all generated words (the test stage will only tell the generated words to the decoder).
    In order to match, in order to solve this gap, masked Self-Attention is on the stage. In the training stage, I will make a masked, When you generate the first word, I tell you nothing, when you generate the second word, I tell the first word

insert image description here
Example 2:
insert image description here

The smaller the cross-entropy between the output answer and the correct answer is, the better:

insert image description here
But there is another problem:

  • There is a correct answer during training, but there is no correct answer during work. During the test, the Decoder receives its own output, which may be wrong; during the training, the Decoder receives completely correct; the phenomenon of inconsistency between the test and the training is called "exposure bias".

  • Assuming that Decoder sees correct information during training, if it receives wrong information during testing, it will make a wrong step and magnify the error.

  • Therefore, during training, giving Decoder some wrong information will actually improve the accuracy of the model.


So far, all the explanations of Transformer are over, and the reference materials are as follows:

  1. National Taiwan University Li Hongyi's self-attention mechanism and Transformer detailed explanation
  2. The past and present of pre-trained language models

Guess you like

Origin blog.csdn.net/weixin_68191319/article/details/129228502