Transformer model analysis

It’s the first time to write a blog by myself...Record my study and daily life... Have fun during the school closure, and hope to continue working hard in the future.

Reference blog @ https://blog.csdn.net/weixin_42691585/article/details/108994556

And @ https://zhuanlan.zhihu.com/p/48508221

“Stay hungry , Stay young”

Transformer's theoretical derivation

Preface

The background of the model:

The existing RNN-related algorithms can only be calculated sequentially from left to right or from right to left. This mechanism brings two problems:

  1. The calculation of time slice t depends on the calculation result at t-1, which limits the parallel ability of the model;

  2. Information will be lost in the process of sequential calculation. Although the structure of gate mechanisms such as LSTM alleviates the problem of long-term dependence to a certain extent, LSTM is still powerless for particularly long-term dependence.

    Therefore, the proposal of Transform solves the above two problems. First, the Attention mechanism is used to reduce the distance between any two positions in the sequence to a constant; secondly, it is not a sequential structure similar to RNN, so it has better parallelism. Performance, in line with the existing GPU framework.

Introduction

Transformer is a model based on the encoder-decoder structure. It discards the RNN in the previous seq2seq model, and uses Self-attention or Mulit-head-self-attention to make the input data can be processed in parallel and improve operating efficiency. The specific structure can refer to the following figure:
Insert picture description here

Specific structure

1.Encoder (encoder) part

Mainly composed of self-attention (self-attention) components and feed-forward neural network, let’s focus on self-attention components

1.1self-attention component

1. First, we embedding the features to convert words into various vectors or tensors

Embedding: It is a dense vector representation, which transforms natural language into mathematical language. Compared with the one-hot sparse matrix representation, it can take up less space through dimensionality reduction, and the principle of dimensionality reduction It is the multiplication of matrices. In terms of semantic understanding, interpreting different characters into vectors can perform mathematical operations on them.

For the transformer, each word vector has its own path into the encoder, and the path and the path are connected to each other.

2. Enter the attention mechanism

The first step: After entering the self-attention module from the word vector, three word vectors are generated by multiplying three matrices, q, k, v. Below I will explain the process through a simple example. You will understand in the process Use of different vectors:

For example, when we deal with the sentence "Head First Java", for the word vector of Head, q1 multiplies the dot by k1, k2, k3 to obtain three scores. This is the word vector q1 for Head, First and The degree of attention of the three words of Java,

Okay, now we have three scores of word vectors representing the evaluation degree of q1 on the three parts:

The second step, what we need to do is to divide by the result of the square root of the first dimension under the root sign:

The specific formula is:
Insert picture description here

The third step is to pass this score through the softmax function, which is the normalization operation, to get[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-lBLZwgQI-1610413482406)(file:///C:\Users\李子毅\AppData\Local\Temp\ ksohtml5400\wps2.png)]

The softmax function is a classic activation function, often used in multi-class problems: it can map the output of multiple neurons to the (0,1) interval, which is the probability of different mapping values:

The picture below comes from knowing @PP鲁

preview

In this way, in the end we get the degree of attention of this word vector to other parts.

Step 4: Multiply each obtained by the [External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-3VTem6It-1610413482408)(file:///C:\Users\李子毅\AppData\Local\Temp\ ksohtml5400\wps2.png)]
corresponding [External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-denyqyKn-1610413482408)(file:///C:\Users\李子毅\AppData\Local\Temp\ ksohtml5400\wps3.png)]
(value vector), retain the value of the word of interest, weaken the value of other non-focused words, add all the vectors to get the final [External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-dyw3dJbv-1610413482409)(file:///C:\Users\李子毅\AppData\Local\Temp\ ksohtml5400\wps4.png)]
, but for the sake of speed, in fact, the above The process is realized by the matrix, which will not be shown here.

The newly added mechanism of the paper: Multi-head Attention:

Based on the self-attention mechanism, each q, k, v vector is decomposed into multiple heads, and each head can only perform the corresponding dot and product operations to obtain two bs, which are multiplied by weights. The matrix is ​​spliced ​​into a b

2.Add&Normalization操作

For the obtained vector b, we perform Add&Normalization normalization processing, the processing flow is as follows:

For each b, we perform the operation b'=a+b, where a is the input vector, and then perform LayerNormalization (horizontal normalization) processing,

LayerNormalization: the classic normal distribution standardization method to standardize, and then add the activation function output

Finally, we input the output data into the feedforward neural network and perform the Add&Normalization operation

2.Decoder (decoder) part

There is one more masked Multi-head Self-Attention than Encoder, which is used to hide future information and only pay attention to the sequence information that has been generated

The input of the Decoder part is the output generated in the previous time step, and enters the decoder through output embedding plus position information. The specific process is as follows:

The first step: Enter Multi-head Self-Attention, hide future information, and only focus on current sequence information

Step 2: Add&Normalization normalization processing

Step 3: Enter the Multi-head Attention component, the input at this time is the final output K and V of the encoder, and the vector (context vector) obtained by the normalization process in the previous step

Step 4: The same Add&Normalization as Encoder, feedforward neural network, Add&Normalization process.

Step 5: Convert floating-point vectors into words: softmax and linear layers are needed at this time

The linear layer is a simple fully connected layer that maps the final output of the decoder to a very large logits vector. Suppose the model is known to have 10,000 words (the output vocabulary) learned from the training set. Then, the logits vector has 10,000 dimensions, and each value represents the possible tendency value of a certain word.

The softmax layer converts these values ​​into probabilities. This process has been discussed above. The word corresponding to the dimension corresponding to the highest value is the final predicted output word

Guess you like

Origin blog.csdn.net/weixin_45717055/article/details/112506230