Get to the bottom of it! Transformer serial 18 questions

The author loves to ask questions Wang Chen@zhihu editor Jishi platform

Source https://www.zhihu.com/question/362131975/answer/3058958207

Why do you want to summarize Transformer through eighteen questions?

There are two reasons:

First, Transformer is the fourth largest feature extractor after MLP, RNN, and CNN, and is also known as the fourth largest basic model; the bottom-level principle of the recently popular chatGPT is also Transformer, and the importance of Transformer can be seen.

Second, I hope that by asking questions, I can better help everyone understand the content and principles of Transformer.

1. What is the major breakthrough in the field of deep learning in 2017?

Transformer. There are two reasons:

1.1 On the one hand, Transformer is the fourth largest feature extractor (also known as the basic model) after MLP, RNN, and CNN in the field of deep learning. What is a feature extractor? The brain is the way people interact with the external world (images, text, voice, etc.); the feature extractor is the way a computer interacts with the external world (images, text, voice, etc.) in order to imitate the brain, as shown in Figure 1. For example: the Imagenet dataset contains 1000 categories of images. People have divided these million images into 1000 categories based on their own experience, and each category of images (such as jaguar) has unique characteristics. At this time, the neural network (such as ResNet18) also wants to extract or identify the unique features of each type of image as much as possible through this classification method. Classification is not the ultimate goal, but a means of extracting image features, mask complementing images is also a way of extracting features, and image block order disruption is also a way of extracting features.

81c51992b47b966f1a4662e79bccb9c6.jpeg
Figure 1 Neural network in order to mimic the neurons in the brain

1.2 On the other hand, the role of Transformer in the field of deep learning: the cornerstone of the third and fourth upsurge, as shown in Figure 2 below.

895e6de1b9754ae42d77ddc0e36202d9.jpeg
Figure 2 Four stages of deep learning development

2. What is the background of Transformer?

2.1 At the background level of field development : At the time of 2017, deep learning has been popular in the field of computer vision for several years. From Alexnet, VGG, GoogLenet, ResNet, DenseNet; from image classification, target detection to semantic segmentation; but it has not caused much repercussions in the field of natural language processing.

2.2 Technical background level: (1) The solution to the mainstream sequence transcription tasks (such as machine translation) at that time is shown in Figure 3 below. Under the Sequence to Sequence architecture (a type of Encoder-Decoder), RNN is used to extract features, and the Attention mechanism efficiently transfers the features extracted by Encoder to Decoder. (2) This method has two disadvantages. On the one hand, the structure of RNN’s natural forward-to-backward sequence transfer determines that it cannot perform parallel operations when extracting features. Secondly, when the sequence length is too long, the information of the front sequence may be forgotten. Therefore, it can be seen that under this framework, RNN is relatively weak and urgently needs improvement.

122be4d174714f2624ef19facb43a766.jpeg
Figure 3 Mainstream solutions for sequence transcription tasks

3. What exactly is Transformer?

3.1 Transformer is an architecture composed of Encoder and Decoder. So what is architecture? The simplest structure is A+B+C.

3.2 Transformer can also be understood as a function, the input is "I love study", and the output is "I love study".

3.3 If the Transformer architecture is split, as shown in Figure 4.

bf5a2c07b5924ca8ffc6757d5f9a5f0d.jpeg
Figure 4 Architecture diagram of Transformer

4. What is Transformer Encoder?

4.1 From a functional point of view, the core function of Transformer Encoder is to extract features, and Transformer Decoder is also used to extract features. For example, when a person learns to dance, Encoder is to see how others dance, and Decoder is to show the learned experience and memory

4.2 From a structural point of view, as shown in Figure 5, Transformer Encoder = Embedding + Positional Embedding + N* (sub-Encoder block1 + sub-Encoder block2);

子Encoder block1 = Multi head attention + ADD + Norm;

子Encoder block2 = Feed Forward + ADD + Norm;

4.3 From the perspective of input and output, the input of the first Encoder block among the N Transformer Encoder blocks is a set of vectors X = (Embedding + Positional Embedding), and the vector dimension is usually 512*512. The input of the other N TransformerEncoder blocks is the output of the previous Transformer Encoder block, and the dimension of the output vector is also 512*512 (the input and output sizes are the same).

4.4 Why is it 512*512? The former refers to the number of tokens. For example, "I love learning" has 4 tokens. Here, it is set to 512 to cover different sequence lengths, and padding is not enough. The latter refers to the vector dimension generated by each token, that is, each token is represented by a vector with a sequence length of 512. It is often said that the Transformer cannot exceed 512, otherwise the hardware will be difficult to support; in fact, 512 refers to the former, that is, the number of tokens, because each token needs to be self-attentioned; but the 512 of the latter should not be too large, otherwise the calculation will be very slow.

45d1bbb5da5da62b1b7235b91d2e264c.jpeg
Figure 5 Architecture diagram of Transformer Encoder

5. What is Transformer Decoder?

5.1 From a functional perspective, Transformer Decoder is better at doing generative tasks than Transformer Encoder, especially for natural language processing problems.

5.2 From a structural point of view, as shown in Figure 6, Transformer Decoder = Embedding + Positional Embedding + N* (sub-Decoder block1 + sub-Decoder block2 + sub-Decoder block3) + Linear + Softmax;

子Decoder block1 = Mask Multi head attention + ADD + Norm;

子Decoder block2 = Multi head attention + ADD + Norm;

子Decoder block3 = Feed Forward + ADD + Norm;

bee5d75332f3dd88f765881fc793777f.jpeg
Figure 6 Architecture diagram of Transformer Decoder

5.3 From (Embedding+Positional Embedding) (N Decoder blocks) (Linear + softmax) each of these three individual action angles:

Embedding + Positional Embedding : Taking machine translation as an example, input "Machine Learning" and output "Machine Learning"; Embedding here is to convert "Machine Learning" into a vector form.

N Decoder blocks : feature processing and transfer process.

Linear + softmax : softmax is the probability of predicting the appearance of the next word. As shown in Figure 7, the previous Linear layer is similar to the MLP layer before the last classification layer of the classification network (ResNet18).

b501e3040ec63e2bc8b23f07d2d23866.jpeg
Figure 7 The role of softmax in Transformer Decoder

5.4 What is the input and output of Transformer Decoder? It is different in Train and Test.

In the Train phase, as shown in Figure 8. At this time, the label is known. The first input of the decoder is the begin character, and the first vector and the first character in the label are output using cross entropy loss. The second input of the Decoder is the label of the first vector, and the output corresponding to the Nth input of the Decoder is the End character, and this is the end. It can also be seen here that parallel training is possible in the Train phase.

092d73b1300dba142b15d98545162455.jpeg
Figure 8 Input and output of Transformer Decoder in the training phase

In the Test phase, the input at the next moment is the output at the previous moment, as shown in Figure 9. Therefore, during Train and Test, Mismatch will appear in the input of Decoder, and it is indeed possible that one step will be wrong during Test. There are two solutions: one is to occasionally give some errors during training, and the other is Scheduled sampling.

63bc0b512890922f3255fd227f24d92f.jpeg
Figure 9 Transformer Decoder input and output in the Test phase

5.5 What is the output and output inside the Transformer Decoder block?

What is mentioned above is the output and output of the Decoder in the overall train and test phases, so what is the input and output of the Transformer Decoder block inside the Transformer Decoder, as shown in Figure 10?

11576fe2fdabd8ef00dddd963b705e39.jpeg
Figure 10 Architecture diagram of Transformer Decoder block

For the first cycle in N=6 (when N=1): the input of sub-Decoder block1 is embedding +Positional Embedding, the input Q of sub-Decoder block2 comes from the output of sub-Decoder block1, and KV comes from the output of the last layer of Transformer Encoder.

For the second cycle of N=6: when the input of sub-Decoder block1 is N=1, the output of sub-Decoder block3, KV also comes from the output of the last layer of Transformer Encoder.

In general, it can be seen that whether in Train or Test, the input of Transformer Decoder not only comes from (ground truth or the output of Decoder at the previous moment), but also from the last layer of Transformer Encoder.

During training: input of the i-th decoder = encoder output + ground truth embedding.

When predicting: input of the i-th decoder = encoder output + (i-1)th decoder output.

6. What are the differences between Transformer Encoder and Transformer Decoder?

6.1 Functionally, Transformer Encoder is often used to extract features, and Transformer Decoder is often used for generative tasks. Transformer Encoder and Transformer Decoder are two different technical routes, Bert uses the former, and GPT series models use the latter.

6.2 Structurally, the Transformer Decoder block includes 3 sub-Decoder blocks, and the Transformer Encoder block includes 2 sub-Encoder blocks, and Mask multi-head Attention is used in the Transformer Decoder.

6.3 From the perspective of the input and output of the two, after the N Transformer Encoder operations are completed, its output is officially input into the Transformer Decoder, and used as K and V in QKV for the Transformer Decoder. So how is the output of the last layer of TransformerEncoder sent to Decoder? As shown in Figure 11.

1fbf3ac78363ab46763c28f479e0c190.jpeg
Figure 11 How Transformer Encoder and Transformer Decoder interact

So, why do Encoder and Decoder have to use this way of interaction? In fact, it is not necessarily the case. Different interaction methods will be proposed in the future, as shown in Figure 12.

0916ed6b965ab59515e4665ed2a947b7.jpeg
Figure 12 Interaction between Transformer Encoder and Decoder

7. What is Embedding?

7.1 The position of Embedding in the Transformer architecture is shown in Figure 13.

7.2 Background: A computer cannot directly process a word or a Chinese character. It needs to convert a token into a vector that the computer can recognize. This is the embedding process.

7.3 Implementation method: The simplest embedding operation is one hot vector, but one hot vector has a drawback that it does not consider the relationship between words before and after words, and WordEmbedding was produced later, as shown in Figure 13.

c4ec813f0bf0e60c590509c20963d259.jpeg
Figure 13 Some descriptions of Embedding, from left to right: the position of embedding in Transformer, one hot vector, Word embedding.

8. What is Positional Embedding?

8.1 The position of Positional Embedding in the Transformer architecture is shown in Figure 14.

8.2 Proposed background: As a feature extractor, RNN has its own sequence information of words; while the Attention mechanism does not consider the sequence information, but the sequence information has a great impact on semantics, so it is necessary to add the front and rear position information to the input Embedding through Positional Embedding.

8.3 Implementation method: traditional position coding and neural network automatic training.

8f45dbf096f86dbd78ec40446fd420a9.jpeg
Figure 14 Some descriptions of Positional Embedding, from left to right: the position of positional embedding in Transformer, the implementation of traditional positional encoding, the image obtained by traditional positional encoding ei, and each column is the positional encoding of a token.

9. What is Attention?

9.1 Introducing Transformer, why introduce Attention? Because the most multi head attention and Mask multi head attention in Transformer come from Scaled dot product attention, and scaled dot product attention comes from self attention, and self attention is a kind of attention, so you first need to understand Attention, as shown in Figure 15.

eb4efcb5746723a380984ac1edf156cd.jpeg
Figure 15 Relationship between Attention and Transformer

9.2 What exactly does Attention mean?

For images, attention is the core area that people see in the image, and it is the focus of the image, as shown in Figure 16. For sequences, the Attention mechanism is essentially to find the relationship between different tokens in the input, and to find the relationship between words spontaneously through the weight matrix.

fe48558c9774b63aa07e106055d1354f.jpeg
Figure 16 attention in the image

9.3 How is Attention implemented?

It is realized through QKV.

So what is QKV? Q is query, K is keys, and V is values. As shown in Figure 17, for example, Q is the signal sent by the brain, I am thirsty; K is the environmental information, the world seen by the eyes; V is to assign different proportions to different items in the environment, and the proportion of water is increased.

In short, Attention is to calculate the similarity of QK and multiply it with V to obtain the attention value.

49d21441aff6ad92ead46cf8c133d80b.jpeg
Figure 17 Implementation of Attention

9.4 Why must there be QKV three?

Why not just Q? Because of the relationship weight between Q1 and Q2, not only a12 is needed, but also a21 is needed. you may ask? Can't we let a12=a21? You can also try, but in principle, the effect should not be as good as a12 and a21.

Why not only QK? The obtained weight coefficient needs to be put into the input, and can be multiplied by Q or K. Why do we need to multiply V again? I think there may be an additional set of trainable parameters WV, which makes the network have a stronger learning ability.

10. What is Self attention?

10.1 Introducing Transformer, why introduce self Attention? Because the most multi head attention and Mask multi head attention in Transformer come from Scaled dot product attention, and scaled dot product attention comes from self attention, as shown in Figure 15.

10.2 What is self attention? Self attention, local attention, and stride attention are all types of attention; self attention calculates the attention coefficient for each Q and each K in turn, as shown in Figure 18, and like local attention, Q only calculates the attention coefficient with adjacent K, and stride attention means that Q calculates the attention coefficient with K by skipping.

8c7fb03ee9c57f1002db2be6b5c008d3.jpeg
Figure 18 From left to right: self attention, local attention, stride attention

10.3 Why can Self attention be used to process sequence data like machine translation?

The data of each position in the input sequence can pay attention to the information of other positions, so as to extract features or capture the relationship between each token of the input sequence through the Attention score.

10.4 How is Self attention implemented? There are 4 steps in total, as shown in Figure 19

5f11d45aa5d6edfe230ec80536c8bf41.jpeg
Figure 19 Implementation process of Self attention

11. What is Scaled dot product attention?

11.1 There are two most common types of self attention, one is dot product attention and the other is additive attention. As shown in Figure 20, the former is more computationally efficient.

6a061778e55dbe7d93a884d453efc00f.jpeg
Figure 20 The difference between dot product attention and additive attention

11.2 What is Scaled?

The specific implementation of scaled is shown in Figure 21. The purpose of this operation is to prevent the inner product from being too large. From the perspective of gradient, it should avoid being close to 1, which is easy to train; it has some similar functions to batch normalization.

f76e42bae4ef38a74905a80c6809987f.jpeg
Figure 21 The position of the scaled operation in the attention

12. What is Multi head attention?

12.1 The position of Multi head attention in the Transformer architecture is shown in Figure 15.

12.2 Proposed background: CNN has multiple channels, which can extract feature information of different dimensions of the image, so can Self attention have a similar operation, and can extract multiple dimensional information of tokens with different distances?

12.3 What is group convolution? As shown in Figure 22, the input feature multiple channels are divided into several groups for separate convolution, and finally the con c operation is performed.

b7039ba21c7ae62b9a5cda5cdfab535b.jpeg
Figure 22 group convolution

12.4 How to implement Multi head attention? What is fundamentally different from self attention?

As shown in Figure 23, taking two heads as an example, the input Q, K, and V are divided into two parts, and each small part of Q is operated separately from the corresponding K and V, and the finally calculated vector is then subjected to the conc operation. It can be seen from this that Multi head attention and group convolution have a similar implementation.

de79ae99d31b7275fa7d467c3cb63a2c.jpeg
Figure 23 The difference between Multi head attention and self attention

12.5 How to understand Multi head attention from the perspective of input and output dimensions? As shown in Figure 24.

80007ecd7da9c5933ff0424fb3a6db7d.jpeg
Figure 24 Input and output dimensions of Multi head attention

13. What is Mask Multi head attention?

13.1 The position of Mask Multi head attention in the transformer architecture is shown in Figure 15.

13.2 Why is there such an operation as Mask?

Transformer predicts the output at the Tth moment, and cannot see the inputs after the T moment, so as to ensure that the training and prediction are consistent.

The Masked operation can prevent the i-th word from knowing the information after the i+1 word, as shown in Figure 25.

cab85a0f3952073a2bfaef6c0a7b495f.jpeg
Figure 25 Position of Mask operation in Transformer

13.3 How is the Mask operation implemented?

Q1 is only calculated with K1, Q2 is only calculated with K1 and K2, and for K3, K4, etc., a very large negative number is given before softmax, so it becomes 0 after softmax, and its calculation principle on the matrix is ​​shown in Figure 26.

a9958f8b1ffd2cc9294dc34d440c9dee.jpeg
Figure 26 The implementation of the matrix calculation of the Mask operation

14. What is ADD?

14.1 Add is the residual connection, carried forward by the ResNet article in 2015 (currently the number of citations exceeds 160,000), and the difference with Skip connection is that the size and dimension need to be all the same.

14.2 As the ultimate idea of ​​simplicity, almost every deep learning model will use this technology, which can prevent network degradation and is often used to solve the problem of difficult training of multi-layer networks.

585c834a5d4f3d9b4b150a709a5ac0e5.jpeg
Figure 27 The position of ADD in the Transformer architecture (left) and the schematic diagram of the residual connection principle (right)

15. What is Norm?

15.1 Norm is layer normalization.

15.2 Core function: In order to make the training more stable, it has the same function as batch normalization, both to make the mean value of the input samples zero and the variance 1.

15.3 Why not use batch normalization instead of layer normalization? Because of a time series data, the sentence input length can be long or short. If batch normalization is used, it is easy to cause "training instability" due to different sample lengths. BN operates on the same feature data of all data in the same batch; while LN operates on the same sample.

5d3ab7e191783b13e206056caeab9014.jpeg
Figure 28 The difference between the position of layer Normalization in the Transformer architecture (left) and batch normalization (right)

16. What is FFN?

16.1 FFN is feed forward networks.

16.2 Why is there a Self attention layer and FFN? Attention already has the desired sequence information features. The role of MLP is to project information into a specific space, and then do a nonlinear mapping, which is used alternately with Self attention.

16.3 Structurally: It includes two layers of MLP, the dimension of the first layer is 512*2048, the dimension of the second layer is 2048*512, and the second layer of MLP does not use the activation function, as shown in Figure 29.

efac18c61b7a646777427b71332ea0a0.jpeg
Figure 29 The specific implementation process of FFN

17. How is Transformer trained?

On the 17.1 data, it is mentioned in the Transformer paper that 4.5M and 36M translation sentence pairs are used.

On 17.2 hardware, the base model was trained on 8 P100 GPUs for 12 hours, and the large model was trained for 3.5 days.

17.3 Model parameters and tuning level:

First, the trainable parameters include WQ, WK, WV, WO, and the parameters of the FFN layer.

Second, the adjustable parameters include: the dimension (d_model) represented by each token vector, the number of heads, the number of block repetitions N in the Encoder and Decoder, the dimension of the FFN intermediate layer vector, Label smoothing (confidence 0.1) and dropout (0.1).

18. Why does Transformer work so well?

18.1 Although the topic is Attention is all you need, some follow-up studies have shown that Attention, residual connection, layer normalization, FFN, these factors together make Transformer.

18.2 Transformer advantages include:

First, the fourth largest feature extractor after deep learning, MLP, CNN, and RNN, is proposed.

Second, it was first used in machine translation, and with GPT and Bert completely out of the circle; it was a turning point. After this point, the NLP field developed rapidly, and then multimodality, large models, and visual Transformers began to rise.

Third, to give people confidence that after the original CNN and RNN, there can be better feature extractors.

18.3 What are the shortcomings of Transformer?

First, the amount of calculation is large and the hardware requirements are high.

Second, because there is no inductive bias, a lot of data is needed to achieve good results.

Finally, the reference materials of this article are based on Transformer papers, teacher Li Hongyi's courses, Li Mu's courses, and some excellent sharing about Transformer on Zhihu. I will not introduce them here (the Reference was not recorded in time during the learning process). If there is any infringement, please let me know, and I will make notes or revisions in time.

Lying down, 60,000 words! 130 articles in 30 directions! CVPR 2023's most complete AIGC paper! Read it in one sitting.

Pay attention to the official account [Machine Learning and AI Generation Creation], more exciting things are waiting for you to read

Simple explanation of stable diffusion: Interpretation of the potential diffusion model behind AI painting technology

In-depth explanation of ControlNet, a controllable AIGC painting generation algorithm! 

Classic GAN has to read: StyleGAN

6ef03f00794cfbf6db957c65e0f6b862.png Click me to view GAN's series albums~!

A cup of milk tea, become the frontier of AIGC+CV vision!

The latest and most complete 100 summary! Generate Diffusion Models Diffusion Models

ECCV2022 | Summary of some papers on generating confrontation network GAN

CVPR 2022 | 25+ directions, the latest 50 GAN papers

 ICCV 2021 | Summary of GAN papers on 35 topics

Over 110 articles! CVPR 2021 most complete GAN paper combing

Over 100 articles! CVPR 2020 most complete GAN paper combing

Dismantling the new GAN: decoupling representation MixNMatch

StarGAN Version 2: Multi-Domain Diversity Image Generation

Attached download | Chinese version of "Explainable Machine Learning"

Attached download | "TensorFlow 2.0 Deep Learning Algorithms in Practice"

Attached download | "Mathematical Methods in Computer Vision" share

"A review of surface defect detection methods based on deep learning"

A Survey of Zero-Shot Image Classification: A Decade of Progress

"A Survey of Few-Shot Learning Based on Deep Neural Networks"

"Book of Rites·Xue Ji" has a saying: "Learning alone without friends is lonely and ignorant."

Click on a cup of milk tea and become the frontier waver of AIGC+CV vision! , join  the planet of AI-generated creation and computer vision  knowledge!

Guess you like

Origin blog.csdn.net/lgzlgz3102/article/details/131842853