Translation: Detailed illustration of Transformer's multi-head self-attention mechanism Attention Is All You Need

1 Introduction

The Transformer - a model that uses attention to improve the training speed of these models. Transformer outperforms Google's neural machine translation model in certain tasks. However, the biggest benefit comes from how The Transformer lends itself to parallelization. In fact, Google Cloud recommends using The Transformer as a reference model for their Cloud TPU product. So let's try to take the model apart and see how it works.

Transformer was proposed in the paper Attention is All You Need . Its TensorFlow implementation is provided as part of the Tensor2Tensor package. The NLP group at Harvard University has created a guide to annotating papers with a PyTorch implementation . In this article, we will try to oversimplify things and introduce these concepts one by one in order to make it easier for those who do not have a deep understanding of the topic.

insert image description here

The concepts of Query, Key, and Value are taken from the information retrieval system, for example, a simple search. When you search for a product on an e-commerce platform (a red thin down jacket worn by young women in winter), the content you enter on the search engine is Query, and the search engine matches Key for you based on the Query (such as the type of product) , color, description, etc.), and then get the matching content (Value) according to the similarity between Query and Key.

Q, K, and V in self-attention also play a similar role. In matrix calculations, dot product is one of the methods to calculate the similarity between two matrices, so QKT QK^{T} is used in formula 1QKT calculates the similarity. Next is the matching of the output based on the similarity. Here, the weighted matching method is used, and the weight is the similarity between the query and the key.

Below is an example of Encoding and Decoding translated from English to French, the sentence "I arrived at the"
Please add a picture description

2. Self-Attention and Transformer

The self-attention (Attention) mechanism [2] was proposed by the Bengio team in 2014 and has been widely used in various fields of deep learning in recent years, such as in computer vision to capture the receptive field on the image, or in NLP for Locate key tokens or features. The BERT[3] algorithm recently proposed by the Google team for generating word vectors has achieved significant improvements in 11 NLP tasks, which can be called the most exciting news in the field of deep learning in 2018. The most important part of the BERT algorithm is the concept of Transformer proposed in this article.

As the title of the thesis says, the traditional CNN and RNN are abandoned in Transformer, and the entire network structure is composed entirely of Attention mechanisms. More precisely, Transformer consists of and only consists of self-Attenion and Feed Forward Neural Network. A trainable neural network based on Transformer can be built in the form of stacking Transformers. The author's experiment is to build an Encoder-Decoder with 6 layers of encoder and decoder, a total of 12 layers, and achieved BLEU worth in machine translation. new highs.

The reason why the author adopts the Attention mechanism is that the calculation of RNN (or LSTM, GRU, etc.) is limited to be sequential, that is to say, RNN-related algorithms can only be calculated sequentially from left to right or from right to left. This mechanism This brings up two questions:

  1. tThe calculation of time slice depends on t-1the calculation results at time, which limits the parallel capability of the model;
  2. Information will be lost in the process of sequential calculation. Although the structure of gate mechanisms such as LSTM alleviates the problem of long-term dependence to a certain extent, LSTM is still powerless for particularly long-term dependence.

The proposal of Transformer solves the above two problems. First, it uses the Attention mechanism to reduce the distance between any two positions in the sequence to a constant; second, it is not a sequential structure similar to RNN, so it has better parallelism. compatibility with existing GPU frameworks. The definition of Transformer given in the paper is: Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.

3. Transformer simplified architecture diagram

Let's start by treating the model as a black box. In a machine translation application, it would input a sentence in one language and output its translation in another language. Example: French to English translation.
insert image description here
Simplifying the advantages of Transformer, we will see an encoding component Encoders, a decoding component Decoders and the connection between them.
insert image description here
The Encoder encoding component is a bunch of encoders (six of them are stacked together in the paper - there is nothing magical about the number 6, of course you can try other arrangements). The decode component is a stack of equal number decoders.
insert image description here
Encoders are all identical in structure (but they do not share weights). Each is divided into two sublayers:

insert image description here

  • The input to the encoder first flows through a self-attention layer — a layer that helps the encoder look at other words in the input sentence while encoding a specific word. We'll take a closer look at self-attention layers in a later post.

  • The output of the self-attention layer is fed to the Feed Forward Neural Network. The exact same feed-forward network is applied to each location independently.

The decoder has these two layers, but there is an additional attention layer Encoder-Decoder Attention between them, which helps the decoder focus on the relevant parts of the input sentence (similar to the attention in the seq2seq model ).

insert image description here

4. Enter code

Now that we understand the main components of the model, let's start looking at the various vector/tensor Tensors and how they flow between these components to transform the input of the trained model into the output.

As is the case in general NLP applications, we first convert each input word into a vector using the Embedding algorithm.
insert image description here
Each word is embedded into a vector of size 512. We'll represent these vectors with these simple boxes.

Embedding only happens in the bottommost encoder. The abstraction common to all encoders is that they receive a list of vectors of size 512 - in the bottom encoder this will be the word embedding, but in the others it will be the output of the encoder directly below. The size of this list is a hyperparameter we can set - basically it's the length of the longest sentence in our training dataset.

After embedding words in our input sequence, each of them flows through each of the two layers of the encoder.
insert image description here
Here we start to see a key property of the Transformer, which is that the words at each position flow through their own path in the Encoders. In the self-attention layer Self-Attention, there is a dependency between these paths. Feedforward layers, however, do not have these dependencies, so various paths can execute in parallel while flowing through the feedforward layers.

Next, we'll convert this example into a shorter sentence, and we'll see what happens in each sublayer of the encoder.

4.1 Now we are coding!

As we already mentioned, an encoder receives a list of vectors as input. It processes this list by passing these vectors to a "Self-Attention" layer, then to a Feed Forward Neural Network, which then sends the output up to the next encoder.
insert image description here
Words at each position go through a self-attention process. Then, they each pass through a feed-forward neural network — the exact same network, with each vector flowing through it separately.

5. Self-Attention self-attention mechanism

Suppose the following sentence is the input sentence we want to translate:

” The animal didn’t cross the street because it was too tired”

What does the "it" in this sentence refer to? Does it refer to street or animal? This is a simple problem for humans, but not so simple for algorithms.

When the model processes the word "it", self-attention allows it to associate "it" with "animal".

As the model processes each word (each position in the input sequence), self-attention allows it to look at other positions in the input sequence for clues that help encode that word better.

If you're familiar with RNNs, consider how maintaining a hidden state allows the RNN to combine representations of previous words/vectors it has processed with the current word/vector it is processing. Self-attention is what the Transformer uses to incorporate "understanding" of other related words into the word we are currently processing.
insert image description here
When we encode the word "it" in Encoder #5 (the top encoder in the stack), the partial attention mechanism focuses on "The Animal" (the wire weight is relatively large), and integrates a part of its representation into the encoding of "it".

Be sure to check out the Tensor2Tensor jupyter notebook , where you can load a Transformer model and inspect it with this interactive visualization.

5.1 Self-attention details

Let's first see how to compute self-attention using vectors, and then move on to how it's actually implemented - using matrices.

The first step in computing self-attention is to create three vectors Q, K, V from each encoder's input vector (in this case, the embedding for each word). So for each word we create a query vector Q, a key vector K and a value vector V. These vectors are created by multiplying the embeddings by the three matrices we train during training.

Note that these new vectors Q, K, V have smaller dimensions than the embedding vectors. Their dimensions are 64, compared to 512 for the embedding and encoder input/output vectors. They don't have to be smaller, which is an architectural choice to keep the computation of multi-head attention (mostly) the same.
insert image description here
put x 1 x_1x1Multiply by WQW^QWQ weight matrix will produceq 1 q_1q1, the "query" vector associated with the word. We end up creating a "query Q", a "key K" and a "value V" projection for each word in the input sentence.

5.2 What are the "query Q", "key K" and "value V" vectors?

They are abstractions for computing and thinking about attention. Once you read on how attention is calculated below, you'll know pretty much everything you need to know about the role each vector plays.

The second step in computing self-attention is to compute a score. Suppose we are computing self-attention for the first word "Thinking" in this example. We need to score each word of the input sentence against this word. When we encode a word at a certain position, the score determines how much attention is paid to other parts of the input sentence.

Scores are computed by taking the dot product of the query vector with the key vector for each word we are scoring. So if we are dealing with self-attention for the word in position #1, the first score will be q 1 q_1q1and k 1 k_1k1the dot product. The second fraction is q 2 q_2q2and k 2 k_2k2the dot product.
insert image description here
The third and fourth steps are to divide the score by 8 (the square root of the key vector dimension used in the paper is 64. This results in a more stable gradient. There may be other possible values ​​here, but this is the default), and then pass through softmax The operation passes the result. Softmax normalizes the scores so they are all positive and add up to 1.

insert image description here
This softmax score determines the expression of each word at this position. Obviously, the word at this position will have the highest softmax score, but sometimes it is useful to focus on another word related to the current word.

The fifth step is to multiply each value vector by the softmax score (ready to add them). The intuition here is to keep the values ​​of the words we want to focus on constant, and drown out irrelevant words (e.g. multiply them by a small number like 0.001).

The sixth step is to sum the vector of weighted values. This produces the output from the attention layer at this position (for the first word).
insert image description here
The self-attention calculation ends here. The resulting vector is a vector that we can send to the feedforward neural network. However, in practical implementations, this computation is done in matrix form for faster processing. Now that we've seen computational intuition at the word level, let's take a look.

5.3 Matrix calculation of self-attention

The first step is to compute the query Q, key K and value V matrices. We do this by packing the embeddings into a matrix X and multiplying it by our trained weight matrix ( WQW^QWQ W K W^K WK W V W^V WV ) to do this.
insert image description here
Each row in the X matrix corresponds to a word in the input sentence. Again we see the difference in the size of the embedding vector (512, or 4 boxes in the graph) and the q/k/v vector (64, or 3 boxes in the graph)

Finally, since we are dealing with matrices, we can combine step two into the six-in-one formula to compute the output of the self-attention layer.
insert image description here
Self-attention calculation in matrix form

5.4 "Multi-head" attention Multi-Head Attention

The paper further refines the self-attention layer self attention by adding a mechanism called "multi-head" attention.
insert image description here

This improves the performance of the attention layer in two ways:

  1. It extends the ability of the model to focus on different locations. Yes, in the example above, z1 contains a little bit of other encoding, but it's probably dominated by the actual word itself. If we were to translate a sentence like "The animal didn't cross the street because it was too tired", it would be useful to know which word "it" refers to.

  2. It provides multiple "representation subspaces" for the attention layer. As we'll see next, with multi-head attention, we have not just one, but multiple sets of query/key/value weight matrices (the Transformer uses eight attention heads, so we end up with have eight collections). Each of these collections is initialized randomly. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into different representation subspaces.

insert image description here
With multi-head attention, we maintain separate Q/K/V weight matrices for each head, resulting in different Q/K/V matrices. As we did before, we multiply X by WQW^QWQ/ W K W^K WK/ W V W^V WV matrix to generate a Q/K/V matrix.

If we do the same self-attention calculation as above, but do it eight different times with different weight matrices, we end up with eight different Z matrices.
insert image description here
This presents us with a bit of a challenge. The feedforward layer doesn't need eight matrices - it needs one matrix (vector for each word). So we need a way to condense these eight Concatenates into a matrix.

How do we do this? We concatenate these matrices and then multiply them by an additional weight matrix WOW^OWO. _
insert image description here
This is what long self-attention is all about. That's quite a lot of matrices, I realize. Let me try to put them all in one visual so we can look at them in one
insert image description here
place ” where the different attention heads focus:
insert image description here
when we encode the word “it”, one attention head pays most attention to “animal” while the other focuses on “tire”—in some In this sense, the model's representation of the word "it" contains some representations of "animal" and "tire".

But things can be harder to explain if we add all the attention to the picture:
insert image description here

6. Using positional encodings to indicate sequence order

As we've described so far, one thing that's missing from the model is a way to account for the order of words in the input sequence.

To solve this problem, the transformer adds a vector to each input embedding. These vectors follow specific patterns learned by the model, which help determine the position of each word, or the distance between different words in a sequence. The intuition here is that adding these values ​​to the embeddings, once projected into the Q/K/V vectors and during dot product attention, provides meaningful distances between embedding vectors.
insert image description here
In order for the model to understand the order of words, we add positional encoding vectors - whose values ​​follow a specific pattern.

If we assume the dimensionality of the embedding to be 4, then the actual positional encoding would look like this:
insert image description here
Real example of positional encoding with embedding size 4

What might this pattern look like?

In the figure below, each row corresponds to the positional encoding of a vector. So the first row will be the vector of embeddings we add to the first word in the input sequence. Each row contains 512 values ​​- each value is between 1 and -1. We've color coded them so the pattern is visible.
insert image description here
A real example of positional encoding of 20 words (rows) with an embedding size of 512 (columns). You can see it is split in half in the middle. This is because the values ​​in the left half are generated by one function (using the sine Sin) and the right half are generated by another function (using the cosine Cos). They are then concatenated to form each position encoding vector.

The formulation of positional encoding is described in the paper (Section 3.5). You can see the code that generates the positional encoding in get_timing_signal_1d() . This is not the only possible way of position encoding. However, it has the advantage of being able to scale to unseen sequence lengths (for example, if the model we train is asked to translate sentences longer than any of the sentences in our training set).

The positional encodings shown above are from Transformer's implementation of Tranformer2Transformer. The method presented in the paper is slightly different, instead of directly concatenating, it interleaves the two signals together. The image below shows what it looks like. Here's the code to generate it :
insert image description here

7. The Residuals Network

One detail in the encoder architecture that we need to mention before proceeding is that each sublayer (self-attention) in each encoder has a residual connection around it, followed by a layer normalization step .
insert image description here
If we were to visualize the vector and layer-norm operations associated with self attention, it would look like this:

insert image description here
This also applies to the sublayer of the decoder. If we imagine a Transformer consisting of 2 stacked encoders and decoders, it would look like this:
insert image description here

8. The Decoder Side

Now that we've covered most of the concepts on the encoder side, we basically know how the components of the decoder work. But let's take a look at how they work together.

The encoder first processes the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These will be used by each decoder in its "encoder-decoder attention" layer, which helps the decoder focus on the appropriate place in the input sequence: the following steps repeat the process until a special arrival symbol
Please add a picture description
representation The transformer decoder has completed its output. The output of each step is fed to the bottom decoder at the next time step, and the decoders bubble their decoded results just like the encoders. Just like we did for the encoder inputs, we embed and add positional encodings to these decoder inputs to indicate the position of each word.

insert image description here
The self-attention layer in the decoder operates slightly differently than in the encoder:

In the decoder, the self-attention layer only allows attention to earlier positions in the output sequence. This is done by masking future positions (set them to ) before the softmax step in the self-attention computation.

The "Encoder-Decoder Attention" layer works similarly to multi-head self-attention, except that it creates its query matrix from the layer below it, and takes the Keys and Values ​​matrices from the output of the encoder stack.

9. Final Linear and Softmax Layers

The decoder stack outputs a float vector. How do we turn this into a word? This is the job of the last linear layer followed by a Softmax layer.

The linear layer is a simple fully connected neural network that projects the vector produced by the decoder stack into a larger vector called the logits vector.

Suppose our model knows 10,000 unique English words learned from the training dataset (our model's "output vocabulary"). This will make the logits vector 10,000 cells wide - each cell corresponds to the score of a unique word. This is how we interpret the output of the model, followed by the linear layer.

A softmax layer then converts these scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is selected and the word associated with it is generated as output for that time step.
insert image description here

10. Summary

Now that we've gone through the entire forward pass through a trained Transformer, it can be useful to look at the intuition for training a model.

During training, the untrained model goes through the exact same forward pass. But since we're training it on a labeled training dataset, we can compare its output to the actual correct output.

To visualize, assume our output vocabulary contains only six words ("a", "am", "i", "thanks", "student", and "" (short for "end of sentence")).
insert image description here
We The model's output vocabulary is created during the preprocessing phase before we start training.

Once we define our output vocabulary, we can use a vector of the same width to represent each word in our vocabulary. This is also known as one-hot encoding. So, for example, we can represent the word "am" using the following vector:
insert image description here
Example: one-hot encoding of our output vocabulary

After this review, let's discuss the model's loss function - the metric we are optimizing during the training phase to produce a trained and hopefully very accurate model.

10.1 Loss function

Suppose we are training our model. Assuming this is our first step in the training phase, we are training it on a simple example - translating "merci" to "thanks".

This means, we want the output to be a probability distribution over the word "thank you". But since the model hasn't been trained, it's unlikely to happen just yet.
insert image description here
Since the parameters (weights) of the model are all randomly initialized, the (untrained) model generates a probability distribution with arbitrary values ​​for each cell/word. We can compare this to the actual output and then use backpropagation to adjust the weights of all the models to bring the output closer to the desired output.

How do you compare two probability distributions? We just subtract one from the other. See Cross Entropy and Kullback-Leibler Divergence for more details .

Note, however, that this is an oversimplified example. More practically, we'll use sentences that are longer than a single word. For example - input: "je suis étudiant", expected output: "I am a student". What this really means is that we want our model to continuously output a probability distribution where:

  • Each probability distribution is represented by a vector of width vocab_size (6 in our example, but more realistically a number such as 30,000 or 50,000)
  • The first probability distribution has the highest probability in the cell associated with the word "i"
  • The second probability distribution has the highest probability in the cell associated with the word "am"
  • And so on until the fifth output distribution indicates a ' ' symbol, which also has a cell from the 10,000-element vocabulary associated with it.
    insert image description here
    We will train the target probability distribution of the model on a sample sentence in the training example.

After training the model for a sufficient amount of time on a sufficiently large dataset, we would like the resulting probability distribution to look like this:

insert image description here
Hopefully, after training, the model will output the correct translation we expect. Of course, this has no real indication if the phrase is part of the training dataset (see: cross-validation ). Note that each location has a bit of probability, even if it is unlikely to be the output at that time step - this is a very useful property of softmax that helps the training process.

Now, because the model produces one output at a time, we can assume that the model is picking the word with the highest probability from this probability distribution and discarding the rest. Here's one approach (called greedy decoding). Another approach is to keep the first two words (for example, "I" and "a"), and run the model twice in the next step: assuming the first output position is the word 'I', and another time assuming the first output The position is the word 'a', and considering positions #1 and #2, whichever version produces less errors. We repeat this for positions #2 and #3 etc. This approach is called "beam search", and in our example, beam_size is 2 (meaning that at any time, two partial hypotheses (outstanding translations) are kept in memory), and top_beams are also two ( means we will return two translations). These are all hyperparameters that you can experiment with.

11. Further reading

I hope you find this a useful place to start breaking the ice with the main concepts of Transformer. If you want to go deeper, I recommend the following steps:
Watch the video of the original work: https://youtu.be/-QH8fRhqFHM
insert image description here

reference

  • https://arxiv.org/abs/1706.03762
  • https://jalammar.github.io/illustrated-transformer/
  • https://zhuanlan.zhihu.com/p/48508221

Guess you like

Origin blog.csdn.net/zgpeace/article/details/126635650