transformer model profile

Transformer model consists of "Attention is All You Need" suggested that there is a complete Encoder-Decoder framework, which mainly consists of attention (attention) mechanism. Papers Address: https://arxiv.org/abs/1706.03762 .

Its overall structure as shown:

 

Model two parts encoder (Encoder) and a decoder (Decoder), comprising an internal configuration of the overall structure as shown below:

 Figure II

                                                      Figure II

In the paper bundle by the encoder section 6 with the same encoder, the decoder portion 6 are stacked together in the same decoders, not shared between the encoder parameters. (This does not have to 6)

The word vector representation into the encoder, before the decoder, do first positional encoding, sequentially following positional encoding, encoding, decoding introduced:

1、positional encoding

 

 

As shown, since no attention mechanisms include location information, thus embedding the sentence is first obtained word vector representation, and in order to increase the position information, the embedded word is added to the position information of the position-coding vector of words in a sentence, the position-coded paper added the method is: an input configured with a matrix embedding the same dimensions, and with an input obtained by adding the embedding multi-head attention input.

positional encoding used by the author as follows:

 

Wherein, PE is a two-dimensional matrix, with the same size as the embedding dimension input line represents a word, a column represents a word vector; POS represents the position of the words in the sentence; vector represents the dimension words; I indicates the position of word vectors. Accordingly, the above formula represents the position of the added even-numbered word vectors for each word of the variable sin, cos variable add odd positions, in order to fill the PE matrix, and then added to the input embedding, thus completing the introduction of the position-coding .

2、encoding

As shown on the left two structures, from the focus encoder main layer consists layer feedforward neural networks and long, it is noted that each layer in each sub-encoder (since attention, feedforward network) around has a residual connection, and related as a "layer - normalization" step. Here we introduce attention mechanism, or give chestnuts:

Suppose we want to translate this sentence:

“The animal didn't cross the street because it was too tired”

So it is in this sentence refers to animal or street, good human beings to understand this sentence, but the machine is very difficult. When dealing with this model word "it", the self-attention mechanism would allow "it" and "animal" to establish contact. With each word model processes the input sequence, since all attention will focus on the entire input sequence of words, word of this model helps to better encode. As shown below.

 

When we encode in the encoder # 5 (uppermost layer encoder stack) "it" in the word, part of the mechanism of attention will be to focus on "The Animal", will be incorporated into a part of "it" it represented coding.

Next comes thought attention to achieve.

The first step is calculated from the attention vectors generated from each of the three input vector encoder (word vectors for each word) in. In other words, for each word, we create a query vector, a vector and a key value vector. These three vectors are embedded with three words by multiplying the right to re-create the matrix. In the paper the three dimensions of Vector is lower than the word embedding, less practical dimension is not required, just select the architecture that can calculate the most attention bulls remain unchanged.

Calculated from the second step is to calculate the score of attention. Suppose we need to first word 'Thinking' is calculated from the vector you need to get the attention of each word in the input sentence "Thinking" scoring. These scores determine how much attention to other parts of the sentence in the coded word "Thinking" in the process.

These scores by scoring words (all words of the input sentence) to calculate the dot product with the vector with the key "Thinking" query vector. So if we deal with the most forward position word from the attention, then the first score is the dot product of k1 and q1, q1 fractions of a second and k2 is the dot product.

 

The third step and the fourth step is divided by 8 points (8 square root is used in paper dimension vector key 64, which makes a more stable gradient. Other values ​​may also be used herein, but the default value 8), and then transmitting results softmax. Softmax role is to make all the words score normalization, and the resulting scores are positive and 1.

 

The softmax score determines the contribution of each word to encode the current position ( "Thinking") of. Obviously, the word has been in this position will receive the highest score softmax, but sometimes other concerns associated with the current word word also help.

The fifth step is multiplied by each value vector softmax score (this is to prepare them after summation). The intuition here is to focus on semantically related words, and weaken irrelevant words.

The sixth step is the vector sum of the weighted values, and obtain the output from the focus position of the layer.

 

Such attention from the calculation is complete. The resulting vector can be passed to a feed-forward neural network.

Self-attention mechanism is implemented through the matrix, as above ideas in the real world:

The first step is to calculate the query matrix, a key matrix, and the matrix values, as shown below:

 

The foregoing steps may be combined into calculated:

 

Since the introduction to the attention mechanism, introduced bulls used in the paper self-attention mechanism "multi-headed" attention.

 

Each head is a separate query / key / value of the weight matrix to generate different query / key / value matrix. Used in the paper is 8, then go through eight different weight matrix operations right, we will get 8 different Z matrix.

 

Then we have these eight matrix compressed into a matrix, the principle is to achieve these eight matrix spliced ​​together, then use a weight matrix is ​​multiplied with them, get a fusion of all the attention header information matrix Z, then it feedforward layer was summed and normalized to pass.

 

Decoding (Decoder):

Internal components of the decoder similar to the encoder, it is noted that the first layer decoder attention is called MaskedMulti-Head Attention, by adding the MASK operation, so that we are only allowed to processed more forward output sequence those positions that we can attend to the previously processed statement. The second layer is referred Attention encoder-decoder attention layer, can be seen from Figure II, it is a query from a previous output of decoder layers, key, value of the output from the encoder, the encoder output may help the decoder input Follow sequence which suitable position. Next, into feedforward layer, and then repeating these steps until reaching the termination of a special symbol, which represents transformer decoder has completed its output. The output of each step in the next time step are provided to the bottom end of the decoder, the encoder and the like before doing so, the decoder outputs the decoding result thereof. In addition, as we enter the encoder is done, we will embed and add those to the position encoder decoder, to indicate the location of each word.

 

After the completion of decoding outputs a vector of real numbers, through a simple fully connected neural network (linear transformation layer) is mapped to one it is referred to as vector logarithmic probability (logits)'s assumed learning ten thousand words from the training set, then logarithmic probability vector is ten thousand cells vector length - each cell corresponds to a fraction of a word. The next layer will Softmax those scores become probabilities (are positive, the upper limit of 1.0). The cells with the highest probability is selected, and its corresponding word is outputted as the time step.

 

 

 参考:attention is all you need

   BERT fire but do not know Transformer? Reading this one is enough

     The transformer positional encoding (encoding position )

 

 

(Finish)

Guess you like

Origin www.cnblogs.com/wangzhilun/p/11869498.html