Study Notes: Deep Learning (7) - From Encoder-Decoder to Transformer

Study time: 2022.04.22~2022.04.24

6. From Encoder-Decoder to Transformer

6.1 Encoder-Decoder framework

Encoder-Decoder, often called encoder-decoder, is a common model framework in deep learning, and many common applications are designed using the encoder-decoder framework.

  • Encode, an encoder converts the input sequence into a dense vector of fixed dimension;

  • Decoding is to convert the previously generated dense vector of fixed dimension into an output sequence.

insert image description here

img

Encoder-Decoder is not a specific model, but a general framework, not a specific algorithm. The Encoder and Decoder parts can be any text, voice, image, video data, and the model can be CNN, RNN, LSTM, GRU, Attention, etc. Therefore, based on the Encoder-Decoder, we can design various models.

Special Note:

  • Regardless of the length of the input sequence and output sequence, the length of the "vector C" in the middle is fixed, while the length of the input sequence and output sequence is variable;
  • The fixed length of the intermediate vector is also a defect: vector compression loses information; there is a problem of long-range gradient disappearance. For longer sentences, it is difficult for us to hope that the input sequence will be converted into a fixed-length vector and save all valid information , even if LSTM adds a gating mechanism to selectively forget and remember, the effect of this structure is still not ideal as the difficulty of the sentence to be translated can be increased;
  • Different tasks can choose different encoders and decoders (RNN, CNN, LSTM, GRU);
  • A salient feature of Encoder-Decoder is that it is an end-to-end learning algorithm. As long as the model conforms to this framework structure, it can be collectively referred to as the Encoder-Decoder model.

6.2 Seq2Seq sequence-to-sequence

Seq2Seq (Sequence-to-Sequence) literally means that one sequence is input and another sequence is output. Seq2Seq is a typical representative of the Encoder-Decoder model framework, which can be regarded as the model framework of the Encoder-Decoder for a certain type of task.

img

The so-called Seq2Seq task mainly refers to some Sequence to Sequence mapping problems. Sequence can be understood as a string sequence here. When we give a string sequence, we hope to get another string sequence corresponding to it ( such as translated, such as semantically corresponding), this task can be called Seq2Seq.

Encoder-Decoder emphasizes model design (a process of encoding-decoding), and Seq2Seq emphasizes task type/purpose (sequence-to-sequence problem), tasks that satisfy input sequences and output sequences can be collectively referred to as Seq2Seq models.

Application scenarios: machine translation, text generation, language modeling , speech recognition, abstract text summarization, text summarization, semantic analysis, question answering, speech conversion.

Taking machine translation as an example, French can be translated into English, and a model that satisfies such a task can also be called Seq2Seq.

insert image description here

The early Seq2Seq used RNN, LSTM, GRU, etc. for encoding and decoding (the earliest proposal used two RNNs), but the shortcomings are obvious:

One is that the semantic vector cannot fully represent the information of the entire sequence; the other is that the information carried by the first input content will be diluted by the later input information, or, if it is covered, the longer the input sequence, the more serious this phenomenon will be. This makes it impossible to obtain enough information of the input sequence at the beginning of decoding, so the accuracy of decoding will naturally be reduced.

Therefore, the Attention mechanism is proposed later, or CNN is used (CNN can be parallelized, which improves the performance bottleneck of lstm) to improve the task effect.

The current Seq2Seq models are mainly divided into three types:

  1. One is an RNN-based model, generally LSTM+Attention, which processes input information sequentially;
  2. One is a CNN-based model, such as Fairseq;
  3. One is a model that relies entirely on Attention, such as Google's Transformer.

Supplement: Seq2Seq Model Usage Tips - "Teacher Forcing"

Teacher Forcing is used in the training phase, mainly for the third Decoder model above. The input of the third Decoder model neuron includes the output of the previous neuron y ' y'Y' . If the output of the previous neuron is wrong, the output of the next neuron is also prone to error, causing the error to be passed on.

Teacher Forcing can alleviate the above problems to a certain extent. When training the Seq2Seq model, each neuron of the Decoder does not necessarily use the output of the previous neuron, but a certain proportion of the correct sequence is used as the input. That is, treat the correct output as part of the input.

6.3 Attention attention mechanism

In order to solve the above drawbacks and improve the effect of the model, it is necessary to use the Attention mechanism to solve this problem (Attention is generally used with the Seq2Seq model).

When decoding, let the generated words not only focus on the global semantic encoding vector C, but add an "attention range" to dynamically process and decode (that is, do not put all the content into a vector). When decoding the current word, it looks for several words corresponding to the original sentence, and then combines the previously translated sequences to translate the next word.

In this way, the model is able to selectively focus on useful parts of the input sequence, and thus understand the alignment relationship between them, which helps the model to better handle the input of longer sentences.

Intuitive understanding:

When we translate Knowledge, we only need to focus on "knowledge", when translating "is", we only need to focus on "is", when translating "power" focus on "power" " above. In this way, when the Decoder predicts the target translation, it can see all the information of the Encoder, not only limited to the fixed-length hidden vector of the model, and will not lose important information.

insert image description here

6.3.1 Model overview

This section refers to: Detailed explanation of the Attention principle . The implementation of Seq2Seq + Attention is roughly as follows:

Semantic encoding C changes dynamically with the different moments of decoding. Use the output of the hidden layer at each moment of decode and the output of the hidden layer at all times of encode to calculate an Attention Score, that is, the correlation between the current moment of decode and each moment of enocde. Finally, a weighted sum is made on the hidden layer of enocde as the semantic code C of the decoded current time. This C is connected to the hidden layer of the decoded, and then a fully connected layer (the dimension becomes the same) is used as the output of the decoded current time.

In the Attention model, the semantic codes c1, c2, and c3 are different, and the generation methods of y1, y2, and y3 are: y 1 = f 1 ( C 1 ) ; y 2 ​​= f 1 ( C 2 , y 1 ) ; y 3 = f 1 ( C 3 , y 1 , y 2 ) y_1 = f1(C_1);\ y_2 = f1(C_2, y_1); y_3 = f1(C_3,y_1,y_2)Y1=f1(C1) ; Y2=f1(C2,Y1) ;Y3=f1(C3,Y1,Y2) .

insert image description here

For example, if the input is an English sentence: Tom chase Jerry, it will generate: "Tom", "Chase", "Jerry". The general calculation process of the probability distribution value of attention distribution is as follows (take the first word "Tom" as an example):

insert image description here

Current output word Y i Y_iYiThe attention weight for a certain input word j is determined by the current hidden layer H i H_iHi, and the hidden state hj h_j of the input word jhjCommon decision; then take a Softmax to get [ 0 , 1 ] [0, 1][ 0 ,1 ] the probability value. That is, through the functionF ( hj , H i ) F(h_j, H_i)F(hj,Hi) to get the target wordY i Y_iYiAlignment likelihood corresponding to each input word. aij a_{ij}aijMeasure the jjth in codinghj h_jat stage jhjand when decoding the iiCorrelation at stage i , iithThe input context informationC i C_i of the i stageCifrom all hj h_jhjaij a_{ij}aijweighted sum of .

Then the whole translation process can be expressed as (each CCC will automatically select theyyy most appropriate contextual information):
C Tom = g ( 0.6 × f 2 ( T om ) , 0.2 × f 2 ( C hase ) , 0.2 × f 2 ( J erry ) ) C chase = g ( 0.2 × f 2 ( Tom ) , 0.7 × f 2 ( C hase ) , 0.1 × f 2 ( J erry ) ) C jerry = g ( 0.3 × f 2 ( Tom ) , 0.2 × f 2 ( C hase ) , 0.5 × f 2 ( J erry ) ) C_{Tom} = g(0.6×f2(Tom),\ 0.2×f2(Chase),\ 0.2×f2(Jerry))\\ C_{Chase} = g(0.2×f2(Tom) ,\ 0.7×f2(Chase),\ 0.1×f2(Jerry))\\ C_{Jerry} = g(0.3×f2(Tom),\ 0.2×f2(Chase),\ 0.5×f2(Jerry))CTom=g(0.6×f 2 ( T o m ) , 0.2 _ _×f2(Chase), 0.2 _ _×f 2 ( J e r r y ) _Cchase=g(0.2×f 2 ( T o m ) , 0.7 _ _×f2(Chase), 0.1 _ _×f 2 ( J e r r y ) _CJerry=g(0.3×f 2 ( T o m ) , 0.2 _ _×f2(Chase), 0.5 _ _×f 2 ( J e r r y ) _

As shown in the figure below, red means that the input and output are highly correlated, and the corresponding weight will be very large. For example, the correlation between "I" and "I" is very large, and the two inputs of "China" and "中" and "country" is highly correlated.

The input sequence is "I love China". Therefore, h1, h2, h3, and h4 in the Encoder can be regarded as the information represented by "I", "love", "中" and "country" respectively. When translating into English, the first context c1 should be most relevant to the word "I", so the corresponding a11 is relatively large, and the corresponding a12, a13, and a14 are relatively small. c2 should be most related to "love", so the corresponding a22 is relatively large. The last c3 is most related to h3 and h4, so the values ​​of a33 and a34 are relatively large.

img

To sum up: by assigning a weight to each word, the attention mechanism can ensure that the currently translated word has a different focus on each word in the original text (that is, the translation against the original text). Since this weight may be greater than 1, for convenience we use softmax for normalization to obtain the normalized weight, and then calculate the weighted sum of the Encoder hidden state and its corresponding normalized weight to obtain the context vector C (semantic encoding).

6.3.2 Process details

The implementation of the attention layer can be divided into 7 steps.

  1. Calculate the hidden state of the Encoder and the hidden state of the Decoder

First the first decoder hidden state (red) and all available encoder hidden states (green) are computed. In the figure below there are 4 hidden states of the encoder and the hidden state of the current decoder. To output the first hidden state of the Decoder, you need to give the Decoder an initial state and an input. For example, the last state of the Encoder is used as the initial state of the Decoder, and the input is 0.

insert image description here

  1. Get the corresponding score for each encoder hidden state (Attention Score)

Calculate the correlation between the first hidden state of the Decoder (the current word) and all the hidden states of the Encoder. There are many methods for calculating similarity/similarity (such as dot, general, concat, etc.), and the dot product method is used here (by default, the lengths of the two vectors are the same).

insert image description here

  1. Normalize scores by softmax

The obtained scores are input to the softmax layer for normalization, the normalized scores (scalars) add up to 1, and the normalized scores represent the weight of attention allocation.

insert image description here

  1. Multiply each encoder's hidden state by its softmax score

By multiplying the hidden state of each encoder with its score after softmax (a scalar), we get the alignment vector or label vector. This is exactly the mechanism by which alignment arises.

insert image description here

  1. add up all aligned vectors

Sum the alignment vectors to generate the context vector C 1 C_1C1(semantic encoding). The context vector is the aggregated information of the previous step alignment vector.

insert image description here

  1. Input context vector into Decoder

Input the context vector into the Decoder for decoding, and the specific decoder input method is related to the model.

The Decoder will input the attention weight of step 5 into the Decoder at every moment. At this time, the inputs in the Decoder are: the output vector of the Encoder, and the hidden vector of the Decoder at the previous moment.

insert image description here

  1. backpropagation

    By constantly iterating, updating the encoder and decoder weight parameters (and the weights in the Score, if any), the Decoder can output the final translated sequence.

  2. The complete process is as follows

insert image description here

6.3.3 Score function

The score function involves dot product operations (dot product, cosine similarity, etc.), and the idea is to measure the similarity between two vectors. For feedforward neural network scoring functions, the idea is to let the model learn to align weights while transforming .

The following figure shows several main calculation methods:

Summary of score functions img

6.3.4 Attention examples based on seq2seq

Next, three Seq2Seq-based NMT (Neural Machine Translation) architectures will be introduced to deepen the understanding of Attention by understanding the specific usage methods. Source: Detailed explanation of the Attention principle .

  1. Neural Machine Translation by Learning to Jointly Align and Translate

The opening work of attention mechanisms for machine translation, the authors achieved a BLEU score of 26.75 on the WMT'14 English-French dataset.

[BLEU score](https://blog.csdn.net/weixin_36815313/article/details/106649367#:~:text=BLEU score is a useful single real number evaluation metric for evaluating algorithms that generate text, judging output Whether the result is similar to the meaning of a human-written reference text. However, it is not used for speech recognition (,speech recognition).): is a useful single real number evaluation metric for evaluating generated Text-based algorithms that determine whether the output results are similar to the human-written reference text.

The "Align" in the title means directly adjusting the weights responsible for the score while training the model. Here are the features of this model:

  • The encoder is a bidirectional (forward + reverse) gated recurrent unit (BiGRU). The decoder is a GRU whose initial hidden state is a vector modified from the last hidden state of the inverse encoder GRU;

  • Note that the score function in the layer is Additive/Concat;

  • The input to the next decoder time step is the concatenation between the output of the previous decoder time step (pink) and the context vector of the current time step (dark green).

img
  1. 《[Effective Approaches to Attention-based Neural Machine Translation](http://diyhpl.us/~bryan/papers2/ai/speech-recognition/Effective approaches to attention-based neural machine translation - 2015.pdf)》

In the WMT'15 English-German test, the model achieved a BLEU score of 25.9. The key points of this paper are as follows:

  • Encoder is a two-layer LSTM network. The same is true for Decoder, whose initial hidden state is the hidden state of the last encoder;
  • The score functions they experimented with were (i) additive/concat, (ii) dot product, (iii) location-based, and (iv) "general";
  • The concatenation between the output from the current decoder timestep and the context vector from the current timestep is fed into the feedforward neural network to give the final output (pink) for the current decoder timestep.
img
  1. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

GNMT, Google's Neural Machine Translation, is a combination of the two previous examples (heavily inspired by the first). The model achieved 38.95 BLEU on the WMT'14 English-French test and 24.17 BLEU on the WMT'14 English-German test. The key points of this paper are as follows:

  • The encoder consists of 8 LSTMs, where the first LSTM is bidirectional (its outputs are concatenated) with residual connections between outputs from successive layers (starting from layer 3). Residual connections are indicated by curved arrows. The decoder is an independent stack of 8 unidirectional LSTMs;
  • The score function used is additive/concat, similar to the first example;
  • Again, just like in the first example, the input to the next decoder time step is the concatenation between the output of the previous decoder time step (pink) and the context vector of the current time step (dark green).
img

6.4 Self-Attention Mechanism

Self Attention is a new attention mechanism proposed in the Transformer paper " Attention is all you need ".

In machine translation, the input Source and output Target content are generally different. For example, when translating English into Chinese, Source is English, and Target is Chinese. The Attention mechanism mentioned in the previous section occurs between the Target element and all elements in the Source. .

As the name suggests, Self-Attention refers not to the Attention mechanism between Target and Source, but to the Attention mechanism that occurs between elements within Source or between elements within Target. It can also be understood as the Attention mechanism in the special case of Target = Source . The specific calculation process is the same, but the calculation object has changed.

Therefore, we can call the Attention mechanism mentioned in the previous section Seq2Seq Attention or the traditional Attention mechanism, and Self Attention can also be called intra Attention (internal Attention).

Seq2Seq Attention: It is essentially a word alignment mechanism between target language words and source language words ;

Self Attention: It can capture some syntactic features or semantic features between words in the same sentence .

6.4.1 Detailed Explanation of Mechanism

Source of this section: Super-detailed illustration Self-Attention .

Before talking about Self Attention, let's first understand the Attention mechanism explained by the KQV model , which is also the core part of Transformer. The core formula of the key-value pair Attention is as follows.
A attention ( Q , K , V ) = softmax ( QKT dk ) V Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt d_k})VAttention(Q,K,V )=softmax(d kQKT) V
where,QQQ is Query,KKK is Key,VVV is Value. Assume that the input of the word isQQQ (Query), in memory as a key-value pair( K , V ) (K,V)(K,V ) to store its context. Then any attention mechanism is actually a mapping function from Query to a series of key-value pairs (Key, Value). And Self Attention is also Key=Value=Query.

Next step by step disassembly:

1. Key-value pair attention

Let's put aside the three matrices Q, K, and V first. The most primitive form of Self-Attention actually looks like this (X is a matrix):
Softmax ( XXT ) X Softmax(XX^T)XSoftmax(XXT)X

  1. The first step, look at the XXT XX^T in itXXT , a matrix multiplied by its own transpose.img

    Matrix XXT XX^TXXT is a square matrix, which we understand from the perspective of row vectors, which stores the result of inner product operation of each vector with itself and other vectors. The inner product of vectors represents the angle between two vectors and represents the projection of one vector on the other. The larger the projected value, the higher the correlation between the two vectors.

  2. The second step is to perform the Softmax operation.

    preview

    The meaning of Softmax operation is to normalize. After Softmax, the sum of these numbers is 1.

    The core of the Attention mechanism is the weighted summation, and the source of the weight is the normalized data.

  3. Finally, and matrix XXX is multiplied again.

    preview

    S o f t m a x ( X X T ) Softmax(XX^T) Softmax(XXT )a row vector withXXMultiplying X , we get a withXXA row vector with the same X dimension. In the new vector, the value of each dimension is obtained by the weighted summation of the three word vectors in this dimension.This new row vector is the weighted summation of the "early" word vector through the attention mechanism. later representation.

A more vivid picture is like this. The color of the right half of the picture is actually the size of the value in the yellow vector in the above picture. The meaning is the correlation between words.

preview

2. Q, K, V matrix

The three matrices Q, K, and V are not the most essential content in the formula, but what they are, you can see the following figure:

preview

In fact, the source of the three matrices Q, K, and V is XXX and weight matrixWWThe product of W is essentiallyXXLinear transformation of X. The reason for adding the weight matrix is ​​to improve the fitting ability of the model, the weight matrixWWW can be optimized as the network is trained.

3. Scale dk \sqrt d_kd k

We can see that Attention ( Q , K , V ) = softmax ( QKT dk ) V Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt d_k})VAttention(Q,K,V )=softmax(d kQKT) V , QKT QK^Tin the final formulaQKT will have one moredk \sqrt d_kd k d k d_k dkis a key vector ( KKK vector) dimension. The reason for joining is:

Suppose Q , KQ, KQ,The mean of the elements in K is 0 and the variance is 1, thenAT = QTKA^T = Q^TKAT=QThe elements in T Khave a mean of 0 and a variance ofdk d_kdk. when dk d_kdkbecomes very large, AAThe variance of the elements in A also becomes large, ifAAThe variance of the elements in A is large, then Softmax ( A ) Softmax(A)The distribution of S o f t max ( A ) tends to be steep (the variance of the distribution is large, and the distribution is concentrated in the region of large absolute value) .

To sum up it is Softmax ( A ) Softmax(A)The distribution of S o f t m a x ( A ) will be the same asdk d_kdkrelated. Therefore AADivide each element in A by dk d_kdkAfter that, the variance becomes 1 again. This makes Softmax(A) Softmax(A)The distribution of S o f t m a x ( A ) is as "steep" asdk d_kdkDecoupling, so that the gradient value remains stable during training.

6.4.2 Calculation process

Knowing the core formula, we can have an understanding of the entire calculation process of Self-Attention (Source: The Illustrated Transformer ), a total of 6 steps:

1. Input vector

The first step in calculating Self-Attention is to create three vectors for each input vector (in this case, the word embedding for each word embedding), based on the encoder. So for each word we have to create a Query vector QQQ , Key vectorKKK , Value vectorVVIn the

These 3 vectors are obtained by multiplying the word vector (Embedding) by the three weight matrices during the training process ( WQ , WK , WVW^Q, W^K, W^VInQWKWV ) to generate. The dimensions of these 3 vectors are smaller than the Embedding vector, which is 64 dimensions (in this case).

img

Note: The dimension of Embedding and Decoder input/output vectors are both 512, which is a structural choice of Transformer, which can make most of the Multi-Head Attention calculations constant, and the dimension does not need to be smaller.

2. Calculate the score

The second step in calculating Self-Attention is to calculate the score (relevance). Suppose we are computing the Self-Attention for the first word "Thinking" in this example, we need to score this word against each word of the input sentence. The score determines how much weight (attention) is placed on other words of the input sentence when we encode a word in a certain position.

The score is calculated by multiplying the dot product of the query vector with the key vector of the corresponding word we scored ( QKT QK^TQKT ). So if we deal with "Thinking" self-attention, the first score will beq 1 q_1q1and k 1 k_1k1The dot product of , the second fraction is q 1 q_1q1and k 2 k_2k2dot product of .

img

3. Normalization

Divide the score from the second step by dk \sqrt d_kd k(8= 6 4 \sqrt64 in this example6 4 ), make it have a more stable gradient, and then normalize the score through softmax to pass the result.

At this point the Softmax score determines how much each word is expressed at this position (for the current word). Obviously the word at this position (the current word itself) will have the highest softmax score, but sometimes it is useful to focus on another word that is related to the current word.

img

4. Multiply vector

The fifth step is to convert each Value vector VVV is multiplied by the Softmax score calculated in the previous step.

The purpose of this step is to keep the values ​​of the words we want to focus on and reduce the weight of the irrelevant words (for example, multiplying them by small numbers like 0.001).

image-20220423175648352

5. Aggregate Weighting

The sixth step is to aggregate the Value vector multiplied by Softmax in the weighted previous step. This is the output of Self-Attention generated for the current word ("Thinking" in this case).

img

6. Traverse the loop

Finally, Self Attention will do Attention once for each input word vector, and then continuously update the iteration parameters in the network.

It is also because it does attention with each input vector, so the order of the input sequence is not considered. More generally speaking, each word vector calculates the inner product with other word vectors, and the result obtained loses the order information of our original text. If you disrupt the order of the word vectors, the result is still the same.

But for text data with time series, using this Self-Attention structure will destroy the time series. The Transformer paper uses position embedding to solve it, but it is only a stopgap measure.

6.4.3 Three types

There are three types of Self-Attetnions in Transformer, which are described as follows (detailed later):

img

6.4.4 Functions and advantages

  1. Relative to RNN:
  • RNN itself has a certain ability to capture long-distance dependencies, but because the sequence model keeps information flowing through gating units, and selectively transmits information. But in this way, the ability to capture dependencies is getting lower and lower under the condition that the length of the text is getting longer and longer, because each recursion is accompanied by the loss of information, so there is an Attention mechanism to enhance the part we pay attention to. Capture of dependencies. In addition, the sequence model cannot effectively express the hierarchical information.

    • After introducing Self Attention, it will be easier to capture the long-distance interdependent features in the sentence. During the calculation process, the connection between any two words in the sentence will be directly linked through a calculation step, so the distance between the long-distance dependent features are greatly shortened, which facilitates the effective use of these features;

    • Self Attention does not rely on sequential operations, which can increase the parallelism of computation.

  1. Relative to CNN:
  • CNN is also widely used in the field of NLP. The CNN model can be regarded as an n-gram detector, and the n of the n-gram corresponds to the size of the CNN convolution kernel. The assumption that CNN is based on is that local information has interdependencies, and the convolution kernel can extract these dependencies in a form similar to n-grams. In addition, CNN has Hierachicial Receptive Filed, so that the length distance between any two positions is logarithmic.
    • The distance between elements in Self Attention is further shortened from the logarithmic path length of CNN to constant path length;
    • From CNN fixed size perceptive to variable-sized perceptive, the specific length is equal to the text length (variable length), which is also the advantage of self-attention over ordinary attention.
  1. Relative to Seq2Seq Attention:
  • If the ordinary attention mechanism calculates the attention score under a window, then the receptive field of this attention mechanism is only the window, and it needs to be calculated multiple times as the window moves.
    • Self Attention obtains the similarity of any two elements in the text sequence through one-step matrix calculation, and takes the entire text as the observation range.

6.5 Transformer

The Transformer architecture was first introduced by Google in the 2017 paper " Attention Is All You Need ", abandoning the CNN and RNN used in previous deep learning tasks. The main reason for its popularity is that its architecture introduces parallelization, Transformer takes advantage of powerful TPUs and parallel training, which reduces training time. Since then, excellent models such as GPT and Bert have appeared on this basis. These excellent models are all derived on the basis of Transformer.

6.5.1 Overall structure

Here is an introduction to the general framework of Transformer, and the specific explanation will be given later.

Transformer is composed of 6 Encoders and 6 Decoders stacked, and its structure is as follows (where: N × N ×N × means there are 6 such structures):

img

1. Encoder layer structure

First, the model needs to perform an Embedding operation on the input data, which can also be understood as an operation similar to w2c. After the Embedding is completed, it is input to the Encoder layer.

Each Encoder has two sub-layers: Multi-Head Self-Attention layer and Position-Wise Feed-Forward Network fully-connected feed-forward neural network layer (ie, fully-connected layer), using residual (Redidual Connection) and LayerNorm between sub-layers Connect to avoid vanishing and exploding gradients. Among them, the fully connected layer can be calculated in parallel, and the obtained output will be input to the next Encoder layer.

img

Add stands for residual Residual Connection, which is to solve the problem of difficult multi-layer neural network training. By transmitting the information of the previous layer to the next layer without difference, it can effectively only focus on the difference part. It is based on the idea of ​​the ResNet model in CNN .

Norm stands for Layer Normalization, which accelerates the training process of the model and makes it converge faster by normalizing the activation value of the current layer.

2. Decoder layer structure

Each Decoder also has a hierarchical structure similar to that of the Encoder, with the exception that there is an Encoder-Decoder Attention layer between the Masked Multi-Head Self-Attention layer and the Position-Wise Feed-Forward Network fully connected layer (also Called Multi-Head Context-Attention), it is used to help the Decoder focus on the corresponding word in the input sentence when decoding.

Masked operation: Its role is to prevent future output words from being used during training. For example, during training, the first word cannot refer to the generated result of the second word. Masking will turn this information into 0, which is used to ensure that the information for predicting position i can only be based on the output before i.

3. Positional Encoding

The Transformer lacks a way to interpret the order of words in the input sequence, which is not the same as the sequence model. In order to deal with this problem, the Transformer adds an additional vector Positional Encoding to the input of the Encoder layer and the Decoder layer, which saves the relative or absolute position of the word in the sequence, and the dimension is the same as that of Embedding.

This vector uses a very unique approach to let the model learn this value, this vector can determine the position of the current word, or the distance between different words in a sentence. There are many ways to calculate this position vector. The trigonometric function is selected in the paper, as follows:
PE ( pos , 2 i ) = sin ( pos 1000 0 2 i / d ) PE ( pos , 2 i + 1 ) = cos ( pos 1000 0 2 i / d ) PE(pos,2i)=sin(\frac{pos}{10000^{2i/d}})\\ PE(pos,2i+1)=cos(\frac{pos }{10000^{2i/d}})P E ( p o s ,2 i )=sin(1 0 0 0 02 i / dpos)P E ( p o s ,2 i+1 )=cos(1 0 0 0 02 i / dpos)
wherepos posp o s refers to the position of the current word in the sentence,iii is the index pointing to each value in the vector,ddd stands forPE PEDimension of PE (same as word Embedding) . It can be seen that at even positions (2 i 2i2 i ), using the sine encodingsin sinsin ; in odd positions ( 2 i + 1 2i+12 i+1 ), use cosine encodingcos coscos

Finally, this Positional Encoding and the value of Embedding are added (addition) and sent to the next layer as input.

insert image description here

If we assume that the dimension of the word vector is 4, a real example of Positional Encoding might look like this:

img

For a sentence with 8 words (Token Position is 10) and its Embedding Dimension is 64, its appearance may be as follows:

img

4. Calculation process

  • Step 1: Obtain the representation vector X of each word in the input sentence, X is obtained by adding the Embedding of the word and the Embedding of the word position;

  • Step 2 : Pass the obtained word Embedding matrix into the encoder. After 6 Encoder blocks, the encoding information matrix ZZ of all words in the sentence can be obtainedZ , (the input of the first Encoder is the representation vector matrix of the sentence words, the input of the subsequent Encoder is the output of the previous Encoder, and the matrix output by the last Encoder is the encoding information matrixZZZ) ;

  • The third step : the encoding information matrix ZZ output by the EncoderZ is passed to the decoder, and the encoder will in turn translate the next word i+1 according to the currently translated 1~i words. In the process of using, when translating to word i+1, the word after i+1 needs to be covered by the Mask (mask) operation;

  • Step 4 : The decoder receives the encoder's information matrix ZZZ , then pass in a translation start character "< Begin >", predict the first word "I", and output; then pass in the translation start character "< Begin >" and the predicted first word "I", Predict the next word "have" and output it, and so on.

The overall dynamic flow chart is as follows:

insert image description here

6.5.2 Details

Next, according to the order from the Encoder layer to the Decoder layer, the components inside the Transformer are analyzed in detail one by one.

1. Multi-Head Attention multi-head attention mechanism

Multi-Head Attention, also known as Multi-Head Seft-Attention. In the Self Attention section, we already know how Self-Attention calculates the output, and Multi-Head Self-Attention is formed by combining multiple Self-Attentions.

The following figure is the Multi-Head Attention structure diagram in the paper ( hhh is the number of Self-Attention):

img

will input matrix XXX is passed tohhAmong the h different Self-Attentions, each Self-Attention will have a different Query/Key/Value weight matrix (ie differentWQ, WK, WVW^Q, W^K, W^VInQWKWV ), from whichhhh different output matricesZ i , i = 1 , 2 , … , h Z_i,\ i = 1,2,…,hFROMi, i=1 ,2 ,,h . h = 8 h=8in the original paperh=8 , that is, there are 8 Self-Attentions.

img

But the fully connected layer following Self-Attention does not require 8 matrices, but only 1 matrix (vector for each word). So we need a way to reduce these 8 into 1 matrix. The approach taken in the paper is to convert the 8 matrices Z i Z_iFROMiconnected, then multiplied by an additional weight matrix WOW^OInThe

img

Finally, the calculation process of the entire Multi-Head Attention can be represented by a graph:

img

The use of Multi-Head can improve the performance of Self-Attention from two aspects:

  • It extends the model's ability to focus on different locations. In a single Self-Attention, ZZZ may actually be dominated by the actual word itself, although it contains a little bit of other encoded information. If we translate a sentence like "The animal didn't cross the street because it was too tired", we would want to know that "it" refers to "animal";

  • It provides multiple "representation subspaces" for the Attention layer. After training, each set of Query/Key/Value weight matrices will project the input word embedding (or the vector of the previous encoder/decoder) into a different representation subspace.

    Comparable to the effect of using multiple convolution kernels at the same time in CNN, intuitively speaking, multiple heads help the network to capture richer features/information.

2. Add&Norm

In Transformer, after each sublayer in each encoder, there will be an Add (residual connection, residual connection) and a Norm ( layer-normalization , layer normalization) operation. Represented as follows:
LN ( X + F ( X ) ) LN(X+F(X))L N ( X+F ( X ) )
whereF ( X ) F(X)F ( X ) can be XXof Multi-Head AttentionX , or XXfrom Feed ForwardX . The network is shown as follows:

img
  1. Residual Connection

Residual: In mathematical statistics, it refers to the difference between the actual observed value and the estimated value (fitted value).

A simplified representation of the ResNet residual connection is as follows:

insert image description here

Residual connection is simply: calculate the output of several layers ( F ( X ) F(X)F ( X ) ) and then putXXThe addition of X allows the network to only focus on the current difference, thereby effectively solving the problem of gradient disappearance and network degradation, which is usually used to solve the problem of multi-layer network training.

  1. Layer Normalization

Norm refers to Normalization, which is usually used in RNN structures. There are many methods of normalizing data, but they all have a common goal: to convert the input of each layer of neurons into data with a mean of 0 and a variance of 1, so that It can standardize the optimization space and accelerate the convergence. The data is then fed into the activation function (to prevent the input data from being concentrated in the saturation region of the activation function).

  • Batch Normalization (BN): Normalization is performed on each batch of data in each layer.

    • When we use the gradient descent algorithm for optimization, we may normalize the input data, but after the action of the network layer, our data is no longer normalized. As the number of network layers increases, the data distribution changes continuously, and the deviation becomes larger and larger, which leads us to use a smaller learning rate to stabilize the gradient and prevent the gradient from disappearing or exploding.
    • The specific method of BN is to normalize each small batch of data in the direction of batch.
  • Layer normalization (LN): Calculate the mean and variance on each sample.
    LN ( X i ) = α X i − μ i σ 2 + ε + β LN(X_i) = α\frac{X_i−μ_i}{\sqrt{σ^2+ε}}+βL N ( Xi)=ap2+e Ximi+b

img

3. Feed Forward

The Feed Forward layer is relatively simple, which is a 2-layer fully connected layer. The activation function of the first layer is Relu, and the activation function of the second layer is not used. The corresponding formula: FFN ( X ) = Max ( 0 , X ⋅ W 1 + b 1 ) ⋅ W 2 + b 2 FFN(X) = Max(0, X·W_1 + b_1)·W_2 + b_2F F N ( X )=Max(0,X In1+b1) In2+b2.

The function of this FFN is space transformation . The addition of FFN introduces nonlinearity (ReLu activation function), which transforms the space of Attention Output, thereby increasing the performance of the model (removing FFN from the model can also be used, but the effect will be much worse. ).

4. Masked Mutil-Head Attetion

The first Multi-Head Attention of the Decoder layer uses the Masked operation, because it is translated sequentially during the translation process, that is, after the i-th word is translated, the i+1-th word can be translated. The Masked operation can prevent the i-th word from knowing the information after the i+1 word (mask the words after the i+1-th word).

Mask represents a mask, which masks certain values ​​so that they have no effect when parameters are updated.

The Decoder is trained using Teacher Forcing during the training process, that is, the correct word sequence and corresponding output are passed to the Decoder. The steps are basically the same as Mutil-Head Attetion:

  • Step 1: Enter the vector matrix XX of the DecoderX andMask matrices, via input matrixXXX calculatesQ , K , VQ, K, VQ,K,V matrix. Then calculateQQQ andKTK^TKProduct of T QKT QK^TQKT

  • Step 2 : Before Softmax, you need to use the Mask matrix to block the information after each word. The blocking operation is as follows (the black part is 0):

insert image description here
  • Step 3 : Use Mask QKT Mask\ QK^TM a s k Q K T and matrixVVMultiply V to get the output matrixZ i Z_iFROMi. Then, similar to Encoder, multiple outputs Z i Z_i are spliced ​​through Mutil-HeadFROMiThen calculate the output ZZ of the first Multi-Head AttentionWith

In fact, the Transformer model involves two kinds of masks, namely Padding Mask and Sequence Mask. Among them, Padding Mask needs to be used in all Scaled dot-product Attention, and Sequence Mask is only used in Self-Attention of Decoder (Sequence Mask is the one we mentioned above).

  • Padding Mask:
    • Because each batch of input sequences has different lengths, that is, we need to align the input sequences. Specifically, padding zeros after shorter sequences. But if the input sequence is too long, the content on the left is intercepted, and the excess is discarded directly. Because these filled positions are actually meaningless, our attention mechanism should not focus on these positions, so we need to do some processing.
    • The specific method is to add a very large negative number (negative infinity) to the value of these positions, so that after Softmax, the probability of these positions will be close to 0. And our Padding Mask is actually a tensor, each value is a Boolean, and the place where the value is false is where we want to process.
  • Sequence Mask:
    • Sequence Mask is to prevent Decoder from seeing future information.
    • The specific method is as above. That is: to generate an upper triangular matrix, the values ​​of the upper triangle are all 0. By applying this matrix to each sequence, we can achieve our purpose.

5. Encoder-Decoder Attention

Encoder-Decoder Attention is the second Multi-Head Attention in Decoder, which is not much different from Multi-Head Attention. The main difference is that K, VK, V of Self-AttentionK,The V matrix is ​​not calculated using the output of the previous Decoder, but the encoding information matrixZZZ calculated.

That is, according to the output ZZ of the EncoderZ is calculated to getK , VK, VK,V , calculateQQQ (if it is the first Decoder then use the input matrixXXX is calculated), and the subsequent calculation method is the same as that described earlier. The advantage of this is that when Decoder, each word can use the information of all words in the Encoder (the information does not need to be Masked).

6. Linear & Softmax

Output layer: When the Decoder layer is all executed, a real vector will be output at the end. Just add a fully connected layer and a softmax layer at the end to map the obtained vector to the word we need.

A fully-connected neural network that projects the vector produced by the decoding component into a much larger vector called logits. The next Softmax layer turns those scores into probabilities (all positive, capped at 1.0). The cell with the highest probability is selected and its corresponding word is used as the output for this time step.

If our dictionary is 1w words, then the output word vector is upgraded to 1w dimension (called logits vector) through the full connection layer, and then converted into probability through softmax, the corresponding word with the largest probability value is our final word. Result (One Hot).

insert image description here

6.5.3 Summary

At this point, the content of Transformer is introduced. Finally, let's review the structure of Transformer as follows:

insert image description here

Advantages :

  • Although Transformer did not escape the traditional learning routine in the end, Transformer is just a combination of fully connected (or one-dimensional convolution) and Attention. But its design is innovative enough, because it abandons the most fundamental RNN or CNN in NLP and has achieved very good results. The design of the algorithm is very exciting, and it is worthy of careful study and taste by every deep learning related personnel;
  • The key to the design of Transformer that brings the biggest performance improvement is to set the distance between any two words to be 1, which is very effective for solving the thorny long-term dependency problem in NLP;
  • Transformer can not only be applied in the field of machine translation of NLP, but also not limited to the field of NLP, it is a direction with great scientific research potential;
  • The parallelism of the algorithm is very good, in line with the current hardware (mainly GPU) environment.

Disadvantages :

  • Rough abandonment of RNN and CNN is very dazzling, but it also makes the model lose the ability to capture local features. The combination of RNN + CNN + Transformer may bring better results;
  • The position information lost by Transformer is actually very important in NLP, and adding Position Embedding to the feature vector in the paper is only a stopgap measure, and does not change the inherent defects of Transformer structure.

other

In actual specific tasks, the modules in the Transformer can be adjusted accordingly, so that the model can be adapted to a variety of specific tasks. There are several models that introduce additional Embedding on the basis of Transformer, transform the Attention mechanism model, and extend the Transformer model to complete the adaptation of specific tasks. The specific content can be seen: Transformer Structure Extension (1) , Transformer Structure Extension (2) , Transformer structure extension (3) .

6.6 Transformer-XL

A new paper " Transformer-XL: Attentive Language Models beyond a Fixed-Length Context " jointly launched by CMU and Google Brain in January 2019 combines the advantages of RNN sequence modeling and Transformer's self-attention mechanism. Transformer's attention module is used on each segment and a recurrent mechanism is used to learn the dependencies between consecutive segments.

The specific introduction can be found in: Transformer-XL Interpretation .

Guess you like

Origin blog.csdn.net/Morganfs/article/details/124373914