Transformer 总结(self-attention, multi-head attention)

《Attention Is All You Need》:[1706.03762] Attention Is All You Need (arxiv.org)

Note: This article is just a simple summary of my knowledge points for my future review. For details, please refer to: http://t.csdn.cn/dz2TH

Transformer

advantage:

The shortcoming of slow RNN training is improved, and the self-attention mechanism is used to achieve fast parallelism.

It can be increased to a very deep depth, fully explore the characteristics of the DNN model, and improve the accuracy of the model.

Overall frame diagram:

 

 All encoders are structurally identical, but they share no parameters. There is also an attention layer in the decoder, which is used to focus on relevant parts of the input sentence (similar to the attention function of the seq2seq model).

Encoder process:

1. The input first goes through a self-attention layer.

This layer helps the encoder focus on other words of the input sentence while encoding each word.

2. The output of the self-attention layer will be passed to the feed-forward neural network.

The feedforward neural network corresponding to the word in each position is exactly the same (Annotation: Another interpretation is a one-dimensional convolutional neural network with a layer window of one word).

self-attention

Query vector (Query), key vector (Key), value vector (Value).

calculation steps:

1. Generate three vectors QKV

For each input Xi, three vectors are created by multiplying word embeddings with three weight matrices: a query vector, a key vector, and a value vector.

2. Calculate the score

A self-attention vector is computed for each input Xi, and other inputs X are used to score the current Xi. These scores determine how much attention is paid to other parts of the input in the process of encoding Xi.

3. Stable Gradient

Divide the score by 8 (the default value), which is the square root of the dimension 64 of the key vector used in the paper, this will make the gradient more stable, other values ​​can also be used.

4. Normalization

The result is passed through softmax, whose function is to normalize the scores of all X, and the obtained scores are all positive and sum to 1.

5. Each value vector is multiplied by the softmax score

In preparation for summing them later, focus on semantically related and downplay their irrelevance (by multiplying by very small decimals).

6. Sum the vector of weighted values

Meaning: When encoding a certain Xi, all other X representations (value vectors) are weighted and summed, and the weight is the dot product of the Xi representation (key vector) and the encoded Xi representation (query vector) and through softmax get. That is, the output of the self-attention layer at this position is obtained.

Use the input to generate three vectors QKV

self-attention calculation

multi-headed attention

The multi-head attention mechanism further improves the self-attention layer.

advantage:

1. Expand the ability of the model to focus on different locations.

In the above structure, although each encoding is more or less represented in z1, it may be dominated by the actual word itself. If we were translating a sentence such as "The animal didn't cross the street because it was too tired", we would want to know which word "it" refers to, and this is where the model's "multi-headed" attention mechanism comes into play.

2. Multiple "representation subspaces" of the attention layer are given.

For "multi-head" attention mechanisms, there are multiple sets of query/key/value weight matrices (Transformer uses eight attention heads, so for each encoder/decoder there are eight sets of matrices). Each of these sets is randomly initialized, and after training, each set is used to project input word embeddings (or vectors from lower encoders/decoders) into different representation subspaces.

Challenges and approaches:

The same self-attention calculation as above, only needs eight different weight matrix operations to get eight different Z matrices. However, the feed-forward layer doesn't need 8 matrices, it only needs one matrix (consisting of the representation vector for each word). Therefore, these eight matrices need to be compressed into one matrix, and these matrices can be directly stitched together, and then multiplied with an additional weight matrix WO.

location code

The sequence order is indicated using positional encoding.

The Transformer adds a vector to each input Xi embedding. Useful for determining the position of each word, or the distance between different words in a sequence. Adding position vectors to Xi embeddings allows them to better express word-to-word distances in their operations.

 Each line corresponds to the position encoding of a word vector, that is, the first line corresponds to the first word of the input sequence. Each row contains 512 values, each between 1 and -1.

Example of positional encoding for 20 words (rows), word embedding size 512 (columns). You can see it split in half down the middle. The values ​​in the left half are generated by one function (sine) and the right half by another function (cosine). Putting them together yields each position-encoding vector.

Advantages of this approach: Can be extended to unknown sequence lengths.

residual

Each sublayer (self-attention, feed-forward network) in each encoder is surrounded by a residual connection, and is followed by a "layer-normalization" step.

 layer-normalization step: https://arxiv.org/abs/1607.06450

Guess you like

Origin blog.csdn.net/qq_41750911/article/details/124189983