Transformer model learning route

Transformer learning route

I don’t understand transformer at all. I recently came to get started with it. Here is my learning route. Transformer and CNN are two branches! ! Therefore, we need to learn separately
that Transformer is a Seq2seq model, and the Seq2seq model uses the self-attention mechanism, and the self-attention mechanism is in the Encoder and Decode.
So learning will start from self-Attention->Seq2seq->Encoder-Decode->Transformer

1. Problem background:

At present, what we input is a vector, and the final output category or value (such as a picture classification or detection task)
is assuming that the input now becomes a row of vectors, and the length will change at the same time. How to solve it?
Insert image description here

The first example:
vocabulary sequence conversion: a word corresponds to a vector, and a sentence is a row of vectors of different lengths. How to represent a word as a vector?
There are two methods: one-hot Encoding and Word Embedding.
The disadvantage of one-hot Encoding is that we cannot see any connection between each word. In
Word Embedding, each word vector has a semantic order. When drawn, animals will appear together ( dog, cat, rabbit), plants together (tree, flower), etc.
Insert image description here

Second example:
A piece of sound is actually a row of vectors. We take a piece of speech signal as a window and describe the information in it as a vector (frame). By sliding this window, we get all the vectors of this speech, 6,000 of them per minute. sound vector
Insert image description here

The third example:
Graph theory and knowledge graph are also a row of vectors. Each node of the social network is a person, and the relationship between the nodes is connected by lines. Everyone is a vector
Insert image description here

The input can be a piece of text, speech, or a picture, so what is the output?

The output is divided into three types:

  • A word/short piece of speech corresponds to a label (first line in the picture below)
  • A whole sentence/speech corresponds to a label (second line in the picture below)
  • The amount of output is determined by the machine itself (third line in the picture below)
    Insert image description here

The first type: one-to-one (Sequence Labeling)

Insert image description here
Word processing: Sequence tagging POS tagging, each input word outputs the corresponding part of speech.
Speech processing: There is a string of vectors in a sound signal, each vector corresponds to a phonetic symbol.
Image processing: In social networks, recommend a user's product (possibly Will buy or not)

Sequence Labeling problem

Each input word outputs the corresponding part of speech, but! ! ! When the same word has different parts of speech, you need to consider the semantics of the context (eg: I saw a saw. I saw a saw). Solution:
Use a sliding window, and each vector looks at the properties of other adjacent vectors in the window. (For example, the window in the red box)
However, if the statement is long, this method cannot solve the analysis problem of the entire statement, that is, semantic analysis.
This leads to the Self-attention technology.
Insert image description here

Type 2: many-to-one

Insert image description here
Semantic analysis: positive and negative evaluation of a whole sentence.
Speech recognition: A piece of speech is used to identify someone's timbre as a whole.
Image: Given the structure of the molecule, determine its hydrophilicity.

Type 3: Self-determined by the model (seq2seq)

It is not known how many tags should be output, the machine decides on its own.
Translation: Language A to Language B, word number of characters is different
Speech recognition, etc.


2.Self-Attention

The biggest difference from the sliding window is that all words consider the semantics of the entire sentence through Self-Attention.
Insert image description here

principle:

What does self-attention ultimately want?
The essential problem is to give an input, and the output can see all the input sample information, and then calculate the correlation (different weights) to select your own attention point.
Principle: Input a matrix I (I may contain multiple vectors a1, a2, a3, a4), multiply by Wq, Wv, Wk respectively to obtain three matrices QKV, each vector a1, a2, a3, a4 internal correspondence is done separately Q*K calculates the correlation degree A, and then converts it to the weight A' through softmax. A' is multiplied by V to calculate the weighted sum to get the final result. The final result is a number (correlation weight).

(Here I is equivalent to a sentence, and a1, a2, a3, a4 are equivalent to each word in the sentence)
Insert image description here
Insert image description here
Insert image description here

Specific steps reference: self-attention

How to calculate correlation?
Two methods: 1. Do inner product (most commonly used) 2. Additive
Insert image description here

Mathematical derivation: hands-on derivation of Self-Attention


3. Multi-head Self-Attention

Multi-head is equivalent to two input ai, aj parallel processing, somewhat similar to the feature map multi-channel in CNN

After dividing the segments, calculate them individually, and then summarize them. The qkv of head1 is calculated as that of head1. The heads do not interfere with each other when calculating the similarity. Each segment is calculated independently and then aggregated.

The advantage of multi-head is diversity: to put it simply, in multi-head calculations, each head may have different calculation concerns and different views.
Insert image description here
Insert image description here


4. Positional Encoding

For Self-attention, there is no information about the character position in the sequence. For example, a verb is unlikely to appear at the beginning of a sentence, so the possibility of a verb being at the beginning of a sentence can be reduced, but the self-attention mechanism does not have this ability. Therefore, it is necessary to add Positional Encoding technology to mark the position information of each word in the sentence.
Principle: You only need to add an ei to the input ai
Insert image description here
. So how to obtain the Positional Encoding?
1. Positional encoding can be learned through data training, which is similar to training to learn word vectors. The positional encoding of Goole in the subsequent bert is obtained through training.
2. Sine and cosine positional encoding. The positional encoding is generated by using sine and cosine functions of different frequencies, and then added to the word vector of the corresponding position. The position vector dimension must be consistent with the word vector dimension.
Insert image description here
pos represents the absolute position of the word in the sentence, pos=0, 1, 2..., for example: Jerry's pos=2 in "Tom chase Jerry"; dmodel represents the dimension of the word vector, here dmodel=512; 2i and 2i +1 represents parity, i represents the dimension in the word vector, for example, here dmodel=512, so i=0, 1, 2...255.


Compared with the previous application without self-attention

  • In terms of speech,
    the Window sliding window method was used to read the entire speech. Either a segment of speech corresponds to a label, or the entire segment of speech corresponds to a label. If the speech segment is very long, the sliding window method is not suitable for a very large number of parameters. , so introduce Self-Attention.
    Put each speech segment into Self-Attention to learn the semantics of the preceding and following speech segments38e48eba9154f2af7632f62.png)

- on the image

A picture in CNN can be regarded as a very long vector. It can also be viewed as a set of vectors: a 5*10 RGB image can be viewed as three (channel) matrices of 5*10, and the same position of the three channels is regarded as a three-dimensional vector.
Insert image description here

Insert image description here
- Graph theory GNN
Insert image description here

The disadvantage of the self-attention mechanism is that the amount of calculation is very large, so how to optimize its amount of calculation is the focus of future research.

5. Self-Attention VS CNN

Self-Attention finds relevant pixels globally, just like the window (reception field) in CNN is automatically learned

For example: the pixel of 1 generates query, and each other pixel generates key. When making inner-products, what we consider is not a small range, but an entire picture.
Insert image description here

Self-attention is a complicated CNN. When doing CNN, only the information in the red box of the receptive field is considered. The range and size of the receptive field are determined by people, but self-attention finds the relevant pixels by attention. , as if the range and size of the receptive field are automatically learned.
Insert image description here
Reference paper: On the Relationship between Self-Attention and Convolution Layers

If different amounts of data are used to train CNN and self-attention, different results will appear.
Use CNN if the amount of data is small, and use Self-Attention if the amount of data is large.


6. Self-Attention VS RNN

RNN has no way to process it in parallel. It can only be stored in Memory and passed back and forth step by step. However,
Self-Attention does QKV on each vector. It only needs to match different vectors with each other, and four outputs can be processed in parallel.
Insert image description here

 
 
 
 
 

------------------------------------------------Update Transform----- below ---------------------------------

1.Seq2seq

Transformer is a Seq2seq (Sequence-to-sequence) model

Seq2seq belongs to the third output category: Input a seqence, output a sequence (The output length is determined by model) Input a row of vectors, and the output length is determined by the machine itself

1.1 Background

The first example
is speech recognition. The input speech signal is a series of vectors, and the output is the text corresponding to the speech signal. However, there is no direct relationship between the length of the voice signal and the number of words output, so the machine needs to decide on its own. The same is true for machine translation and voice translation. At the same
Insert image description here
time, seq2seq can also train chat robots. Enter a 'Hi', and Seq2seq outputs a long sentence 'Hello! How are you today'
Insert image description hereVarious NLP questions can often be regarded as QA questions, such as a question and answer system (QA), which allows the machine to read an article, read a question, and output an answer. This problem can be solved using the Seq2seq model:
Insert image description here

The second example
is grammatical analysis:
input a piece of text, what the machine has to do is to generate a tree structure, telling us which combinations of text or words are nouns, which combinations are adjectives, etc.
Insert image description here

The third example is
Multi-label Classification.
Each thing can correspond to many categories
. For example, enter an article, and the machine will help you decide how many classes it belongs to (class9, class7, class13).
Insert image description here

2. The principle of seq2seq (Encoder-Decoder)

Generally, Seq2seq will be divided into two parts: Encoder and Decoder

2.1 Encoder part:

What the Encoder has to do is to give a row of vectors and output a row of vectors.
The Encode part can be implemented using a variety of models, such as CNN, RNN, etc. (Picture in the upper left corner)
This article only talks about the Encoder in Transformer. Bert is the Encoder of Transformer , which is composed of multiple Blocks to continuously input and output a row of vectors (Picture in the upper right corner)

My own simple summary is: (Bottom left and bottom right pictures)
Enter a row of vectors and Positional Encoding position information------>Send it to Multi-head Attention----->The input row of vectors do self -attention (take a head as an example) and the input are used to make the residual and Layer Norm---->Send it to FC----->Then use the input and output of FC to make the residual and Layer Norm--- --->Finally output a row of vectors (repeat Block multiple times to get the output).

The entire picture flow is as follows: Pictures are very important! ! ! ! ! ! !Insert image description here

The specific steps of each Block in Transformer Encode are as follows:
1. After inputting a row of vectors B, send it to Self-Attention to get an output row of vectors A, and make the residual A+B between the two.
2. Make the result of A+B Layer Norm (the difference between LN and BN: BN is calculated by dimension, LN is calculated by row)

Insert image description here
3. Send the normalized result of LN to the FC layer, and also do the residual, and send the result of the residual of FC to LN layer, the final LN output result is the final Encode output result.
Insert image description here
Tips:
1. The residual structure is to solve the gradient disappearance problem and can increase the complexity of the model.
2. Batchnorm normalizes a batch of samples, while layernorm normalizes each sample once. 3. The
input of the Encoder contains two, which is a sequence + positional embedding. The sine and cosine functions are used to normalize the sequence. Positions are calculated (sine is used for even positions and cosine is used for odd positions)

LN and BN: one by row, one by column
Insert image description here

Reference paper: Why use LN instead of BN? Finally, PowerNorm is proposed
PowerNorm: Rethinking BatchNormalization in Transformers


2.2 Decoder part:

First assume that the Decoder is an independent block and study its inputs and outputs.

The input of the Decoder is divided into two: one is the output of the Encoder part, and the other is the begin token (similar to the flag bit, represented by one-hot). The output is
: the probability of the output word at the corresponding position.

Decoder is divided into two modes: Autoregression (AT) / Non-Autoregression (NAT), while Transformer uses Autoregression (AT). You can choose not to see the NAT part.

1. Autoregressive (Autoregressive)
If the decoder is regarded as a black box, this black box first accepts a special symbol start, reads the output of the Encoder first, inputs the vectors to the decoder in turn, generates a vector through the decoder, and obtains a vector through softmax score, select the output with the highest score, and then the input of the Decoder will be its previous output (somewhat similar to RNN remembering the previous information), and finally output an end Token termination symbol.

Input: One is the output of the Encoder part, and the other is the begin token.
The output is: the probability of the output word at the corresponding position.
Insert image description here
But when there is a text output error, the Decoder will then use the result of this error [one wrong step, one wrong step]: The
Insert image description here
most common problem caused by Teacher Forcing is Exposure Bias, and the solution is in the fourth point

2. Not-Autoregressive (NAT non-autoregressive)
Autoregressive means input one by one and output one by one.
Not-Autoregressive means inputting a row of starts and finally outputting a whole sentence directly.
So how to determine the output length of the NAT decoder?
1. You can use a prediction network to predict the length of the output, and then output the sentence.
2. You can also keep outputting, and then truncate the sentence through end.
Insert image description here
Advantages of NAT: parallel processing (can be input together and then output together) , fast ;The output length is controllable.
Disadvantages: The performance of NAT is often inferior to AT: multi-modality problem.
 

It can be considered that AT and NAT are exactly the same on the Encoder side, but they are different on the Decoder side. AT uses the results of all previous moments to predict the generated words at the next moment in the Decoder, while NAT uses a hidden variable to generate each word in the Decoder. The result at a moment makes the entire decoding process independent and parallel. The key lies in how to define this hidden variable.

 
In fact, Decoder is not an independent block. It is connected with Encoder. Let’s study how to connect it in Transformer. Please refer to the second point of Cross Attention below.


3. How to connect Encoder and Decoder in Transformer

The following figure shows the specific structure of the Transformer. The Decoder of the Transformer using the Encoder-Decoder architecture
Insert image description here
differs from the Encoder in two places:
1. Self-Attention becomes Masked Self-Attention
2. The Decoder has an additional layer of Cross Attention

1.Self-Attention becomes Masked Self-Attention.
 
Ordinary Self-Attention inputs the entire row of vectors, and the output is the entire row of vectors, while Masked Self-Attention input is input one by one, and the output can only consider the previous input. Vectors are output one by one without considering the following.
For example, when you want to output b2, Masked Self-Attention only considers the input of a1 and a2, and a3 and a4 are not considered after the output of b2.
Insert image description here

2. Decoder has one more layer.
 
How do Cross Attention Encoder and Decoder connect to each other?
 
Cross Attention is located in the Decoder and is the bridge connecting the Encoder and Decoder . It has three inputs, two of which come from the Encoder and one from the Decoder.
The output of two encoders + the input of one decoder are sent to Multi-head self-Attention.
Insert image description here
Cross Attention crosses attention
. The Cross Attention mechanism is not Transformer. It appeared before Transformer. Later, Self-attention appeared and only then did Transformer
Decoder input . Start to generate a Q in Masked Self-Attention. This Q is correlated with the K generated after doing self-Attention in the Encoder part to get α. The result obtained after α is dot multiplied with v is sent to FC.
Insert image description here


4. Training and inference

Transformer's Decoder needs to be divided into the training part and the inference part to discuss.

The execution process of the Encoder module is the same. The Encoder can be calculated in parallel and encoded in one go .
The execution process of the Decoder module is different. The training process is to output the prediction results of all words in b sequences at one time, while the inference process is to generate b sequences at a time. The prediction of one word in the sequence requires multiple steps to predict all words in the sequence.

My own understanding is that the input during training is Ground Truth, and the predicted input is the output at the previous moment. The training uses Teacher Forcing for parallel training, and the inference is serial.
 

1. The training part
Decoder will input Groud Truth during training, which is called Teacher Forcing
Train. It is similar to classification. It outputs the largest category and introduces cross entropy to record the predicted words and real labels.
Insert image description here

2. Inference part:
Decoder will input the output of the last Decode during reasoning.
Decoder reasoning does not solve all the sequences at once, but solves them one by one, because the output of the previous position is used as the Q of attention.

For example:
Insert image description here
Step 1 of the training and inference phase:
Initial input: Start symbol start + Positional Encoding (positional encoding)
Intermediate input: Encoder Embedding
Final output: Generate a prediction "machine"

Step 2:
Initial input: start symbol start + “machine” + Positonal Encoding
Intermediate input: Encoder Embedding
Final output: generate prediction “device”

Step 3:
Initial input: start symbol start + “machine” + “devices” + Positonal Encoding
intermediate input: Encoder Embedding
final output: generate prediction “learning”

Step 4:
Initial input: start symbol start + “machine” + “tool” + “learn” +Positonal Encoding
Intermediate input: Encoder Embedding
Final output: generate prediction “learn”

The training phase is equivalent to steps 1-4, which can be carried out at the same time, because we know that the final output is "machine learning", while the inference phase can only be carried out step by step, and the next step can only be predicted based on the previous step, because we do not know what the final output will be.

Insert image description here

Detailed reference: Can any expert explain what the input and output of Transformer's Decoder are? Can you explain what each part is?


5. Tips

1.Copy Mechanism
For many tasks, the Decoder may not need to generate output, but may "copy" something from the output.
For example, in a chat robot, the machine does not need to process a person's name input by the user, and can directly output it (such as Kurolo); or it may repeat a piece of unrecognizable text; or it may extract abstracts from a large number of articles for reference paper
Insert image description here
: Get To The Point: Summarization with Pointer-Generator Networks

2. Guided Attention Speech Recognition
is often used in speech recognition and speech synthesis.
Machines may read four words in a row without any pronunciation problems. Reading one word alone may result in missed readings. Therefore, Guided Attention is needed, and the machine is required to have a fixed method when doing Attention. For
example: when requiring speech synthesis, the machine proceeds from left to right instead of out of order.
Insert image description here
Specific methods: Monotonic Attention, Location-aware attention

3. Beam Search.
Each time you find the highest score, the connected red path is called Greedy Decoding.
If you lose a little bit of the initial input, but the final result is that the green path is better than the red, so this technology is called Beam Search.
Insert image description here

4. Exposure Bias
Teacher Forcing The most common problem is Exposure Bias
. When training, the Decoder sees the Ground Truth, and when testing, the Decoder sees the wrong thing. This inconsistent phenomenon is called Exposure Bias.
Insert image description here

How to solve the problems caused by Teacher Forcing?
——Scheduled Sampling
Just add some errors in the train, which is similar to "adding disturbance" to solve it-scheduled sampling (scheduled sampling). This method will affect the parallelization of Transformer.

参考paper:
1.Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
2. Scheduled Sampling for Transformers
3.Bridging the Gap between Training and Inference for Neural Machine Translation

Related:

1. Li Hongyi's "Deep Learning" - Self-attention Mechanism
2. Three articles to help you understand Query, Key and Value in Transformer
3. Hands-on derivation of Self-Attention
4. Detailed explanation of the most simple Transformer in history
5. Vision Transformer , Vision MLP Super Detailed Interpretation (Principle Analysis + Code Interpretation) (Table of Contents)
6. Random thoughts about Teacher Forcing and Exposure Bias
7. Which master can explain what the input and output of Transformer's Decoder are? Can you explain what each part is?

Guess you like

Origin blog.csdn.net/qq_42740834/article/details/124943826