[NLP] Graphical transformer (transformer)

1. Description

        In this post, we'll take a look at  The Transformer — a model that uses attention to improve the speed at which these models are trained. Transformer outperforms Google Neural Machine Translation models in specific tasks. However, the biggest benefit comes from how transformers lend themselves to parallelization. In fact, Google Cloud recommends using The Transformer as a reference model to use their Cloud TPU product. So let's try to break down the model and see what it does.

2. Appearance of the outermost layer

        Let's first consider the model as a single black box. In a machine translation application, it would take a sentence in one language and output a translation in another language.

        Open the top cover of the engine, we see an encoding component, a decoding component, and the connection between them.

        The encoding component is a stack of encoders (the paper stacks six of them - there's nothing magical about the number six, and one can definitely experiment with other arrangements). A decoding component is a stack of identically numbered decoders.

        The encoders are all structurally identical (but they do not share weights). Each sublayer is divided into two sublayers:

        The input to the encoder first flows through a self-attention layer - a layer that helps the encoder look at other words in the input sentence while encoding a specific word. We'll take a closer look at self-focus later in the post.

        The output of the self-attention layer is fed to a feed-forward neural network. The exact same feed-forward network is applied to each location independently.

        The decoder has these two layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar to  attention in seq2seq models ).

3. Bring the tensor into the picture

Now that we understand the main components of the model, let's start looking at the various vectors/tensors and how they flow between these components to transform the input to the output of the trained model.

As in general NLP applications, we first convert each input word into a vector using an embedding algorithm .


Each word is embedded into a vector of size 512. We'll represent these vectors with these simple boxes.

Embedding only happens in the bottommost encoder. The common abstraction for all encoders is that they receive a list of vectors of size 512 - in the bottom encoder this will be the word embedding, but in the others it will be the output of the encoder directly below. The size of this list is a hyperparameter we can set - basically it's the length of the longest sentence in our training dataset.

        After embedding words into our input sequence, each word flows through each of the encoder's two layers.

        Here we start to see a key property of transformers, that the word at each position flows through its own path in the encoder. In the self-attention layer, there are dependencies between these paths. However, feed-forward layers do not have these dependencies, so various paths can execute in parallel as they flow through the feed-forward layer.

        Next, we switch examples to shorter sentences and see what happens in each sublayer of the encoder.

4. Now we are coding!

        As we already mentioned, the encoder receives a list of vectors as input. It processes this list by passing these vectors to a "self-attention" layer, then to a feed-forward neural network, which then sends the output up to the next encoder.


Words at each position go through a process of self-attention. Then, they each pass through a feed-forward neural network — the exact same network, with each vector flowing through it separately.

5. High Levels of Self-Care

        Don't be fooled by me throwing around the term "self-focus" like it's a concept everyone should be familiar with. I personally never came across this concept until reading the "Attention Is All You Need" paper. Let's distill how it works.

        Suppose the following sentence is the input sentence we want to translate:

""The animal didn't cross the street because it was too tired

        What does the "it" in this sentence refer to? Does it refer to the street or the animals? This is a simple problem for humans, but not so simple for algorithms.

        When the model processes the word "it", self-attention allows it to associate "it" with "animal".

        As the model processes each word (each position in the input sequence), self-attention allows it to look at other positions in the input sequence for clues that help encode that word better.

        If you're familiar with RNNs, consider how keeping the hidden state allows the RNN to merge the representation of the word/vector it processed previously with the word/vector it's currently processing. Self-attention is what Transformers use to bake "understanding" of other related words into the word we're currently working on.


When we encode the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism focuses on "animal" and bakes part of its representation into the encoding of "it" .

        Be sure to check out  the Tensor2Tensor Notebook , where you can load a transformer model and inspect it with this interactive visualization.

6. Self-Attention to Details

        Let's first look at how to compute self-attention using vectors, and then move on to see how it's actually implemented using matrices.

        The first step in computing self- attention is to create three vectors from each input vector to the encoder (in this case the embeddings for each word). So for each word we create a query vector, a key vector and a value vector. These vectors are created by multiplying the embeddings by the three matrices we train during training.

        Note that these new vectors have smaller dimensions than the embedding vectors. Their dimensions are 64, compared to 512 for the embedding and encoder input/output vectors. They don't have to be smaller, it's an architectural choice that keeps the computation of multi-head attention (mostly) constant.


Multiplying  x1  by  the WQ  weight matrix yields  q1 , the "query" vector associated with that word. We end up creating a "query", "key" and "value" projection for each word in the input sentence.


 

What are "query", "key" and "value" vectors?

They are abstractions useful for computing and thinking about attention. Once you read on how attention is calculated below, you'll know pretty much everything you need to know about the role these vectors play.

The second step in calculating self-attention is to calculate the score. Suppose we are computing the self-attention for the first word "think" in this example. We need to score each word of the input sentence against that word. The score determines how much attention we pay to other parts of the input sentence when we encode a word at a certain position.

The score is calculated by taking the dot product of the query vector with the keyvector of the corresponding word we scored . So if we were dealing with self-attention for words in position  #1  , the first score would be   the dot product of q1  and  k1 . The second score is  the dot product of q1  and  k2  .

The third and fourth steps are to divide the score by 8 (the square root of the key vector dimension used in the paper – 64. This results in having a more stable gradient. There may be other possible values ​​here, but this is the default), The result is then passed through a softmax operation. Softmax normalizes the scores so they are all positive and add up to 1.

This softmax score determines the expression of each word at this position. Obviously, the word at this position will have the highest softmax score, but sometimes it is useful to focus on another word related to the current word.

The fifth step is to multiply each value vector by the softmax score (ready to add them). The intuition here is to keep the values ​​of the words we want to focus on constant, and drown out irrelevant words (e.g. multiply them by a small number like 0.001).

The sixth step is to sum the weighted value vector. This produces the output of the self-attention layer at this position (for the first word).

        This concludes the self-attention calculation. The resulting vectors are vectors that we can send to a feed-forward neural network. However, in practical implementations, this computation is done in matrix form for faster processing. So now let's see, we've seen computational intuition at the word level.

7. Matrix calculation of self-attention

        The first step is to compute the query, key and value matrices. To do this, we pack the embeddings into a matrix X and multiply it by our trained weight matrix ( WQ, WK, WV ).


 Each row in the X matrix corresponds to a word in the input sentence. Again we see the difference in the size of the embedding vector (512 or 4 boxes in the graph) and the q/k/v vector (64 or 3 boxes in the graph)

Finally , since we are dealing with matrices, we can condense steps 2 to 6 into one formula to compute the output of the self-attention layer.


Self-Attention Computation in Matrix Form

8. Hydra

        The paper further refines the self-attention layer by adding a mechanism called "multi-head" attention. This improves the performance of the attention layer in two ways:

  1. It extends the model's ability to focus on different locations. Yes, in the example above, z1 contains a little bit of other encoding, but it's probably dominated by the actual word itself. If we are translating a sentence such as "the animal did not cross the road because it was too tired", it would be useful to know which word "it" refers to.

  2. It provides multiple "representation subspaces" for the attention layer. As we'll see next, with multi-head attention we have not just one, but multiple sets of query/key/value weight matrices (the converter uses eight attention heads, so we end up with device has eight groups). Each of these collections is initialized randomly. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into different representation subspaces.


With multi-head attention, we maintain separate Q/K/V weight matrices for each head, resulting in different Q/K/V matrices. As before, we multiply X by the WQ/WK/WV matrix to produce the Q/K/V matrix.


If we do the same self-attention calculation outlined above, only with different weight matrices eight different times, we end up with eight different Z matrices

This presents us with some challenges. The feedforward layer doesn't need eight matrices - it expects a single matrix (vector for each word). Therefore, we need a way to compress these eight into one matrix.

How do we do this? We concatenate the matrices and multiply them by an additional weight matrix WO.

That's pretty much all the bulls are ego-focused on. I realize this is quite a lot of matrices. Let me try and put them all in one visual so we can see them all in one place

Now that we've touched the attention heads, let's revisit our previous example to see where the different attention heads focus when we encode the word "it" in our example sentence:


When we encoded the word "it", one attention head paid most attention to "animal", while the other focused on "tired" - in a sense, the model's response to the word "it" The representation incorporates some representations of "animal" and "exhaustion".

But things can be harder to explain if we add all the attention to the picture:

Nine, using positional encoding to indicate the order of the sequence

        One thing missing from the model so far is a way to account for the order of words in the input sequence.

        To solve this problem, the transformer adds a vector to each input embedding. These vectors follow specific patterns learned by the model, which help it determine the position of each word or the distance between different words in a sequence. The intuition here is that adding these values ​​to the embeddings provides a meaningful distance between embedding vectors once they are projected into Q/K/V vectors and during dot product attention.

In order for the model to understand the order of words, we add positional encoding vectors - whose values ​​follow a specific pattern.

        If we assume the dimensionality of the embedding is 4, the actual positional encoding looks like this:


A real example of positional encoding with a toy embedding of size 4

        What might this pattern look like?

        In the figure below, each row corresponds to a vector of positional encodings. So the first row is the vector we add to the embedding of the first word in the input sequence. Each row contains 512 values ​​- each value has a value between 1 and -1. We've color coded them so the pattern is visible.


A real example of positional encoding of 512 words (rows) with an embedding size of 20 (columns). You can see that it seems to split in half down the center. This is because the values ​​in the left half are generated by one function (using sine) and the values ​​in the right half are generated by another function (using cosine). They are then concatenated to form each position encoding vector.

        The formulation of positional encoding is described in the paper (Section 3.5). You can   see the code used to generate the positional encoding in get_timing_signal_1d() . This is not the only possible way of position encoding. However, it has the advantage of being able to scale to unseen sequence lengths (for example, if the model we train is asked to translate a sentence longer than any sentence in our training set).

        Update February 2020: The positional encodings shown above are from the Transformer's Tensor<>Tensor implementation. The method shown in the paper is slightly different in that instead of connecting directly, it interleaves the two signals. The image below shows what it looks like. Here's the code to generate it :

10. Residues (residuals)

        One detail in the encoder architecture that we need to mention before continuing is that each sublayer (self-attention, ffnn) in each encoder is surrounded by a residual connection followed by a layer normalization step .

        If we were to visualize the vector and layer norm operations associated with self-attention, it would look like this:

        This also applies to the sublayer of the decoder. If we were to consider a Transformer consisting of 2 stacked encoders and decoders, it would look like this:

11. Decoder side

        Now that we've covered most of the concepts on the encoder side, we also basically know how the components of the decoder work. But, let's take a look at how they work together.

        The encoder first processes the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. Each decoder uses these in its "encoder-decoder attention" layer, which helps the decoder focus on the appropriate position in the input sequence:

After completing the encoding phase, we start the decoding phase. Each step of the decoding stage outputs an element (in this case, an English translation sentence) from the output sequence.

The following steps repeat the process until a special symbol is reached indicating that the converter-decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoder bubbles out the decoded results just like the encoder. Just like we did for the encoder inputs, we embed and add positional encodings in these decoder inputs to indicate the position of each word.

The self-attention layer in the decoder works a little differently than the self-attention layer in the encoder:

In the decoder, a self-attention layer only allows attention to earlier positions in the output sequence. This is done by masking out future positions (setting them to ) before the softmax step in the self-attention computation.-inf

The "encoder-decoder attention" layer works similarly to multi-head self-attention, except that it creates its query matrix from the layer below it, and takes the key and value matrices from the output of the encoder stack.

12. The final linear layer and soft maximum layer

The decoder stack outputs a vector of floating point numbers. How do we turn this into a word? This is the job of the final linear layer, followed by the Softmax layer.

The linear layer is a simple fully connected neural network that projects the vector produced by the decoder stack into a larger vector called the logits vector.

Let's say our model knows 10,000 unique English words (our model's "output vocabulary") that it learned from its training dataset. This will make the logarithmic vector 10,000 cells wide - each cell corresponds to a unique word's score. This is how we interpret the output of the model, followed by the linear layer.

A softmax layer then converts these scores into probabilities (all positive and add up to 1.0). The cell with the highest probability is selected and the word associated with it is generated as output for this time step.


The graph starts at the bottom, as the decoder stack outputs the resulting vector. It is then converted to an output word.

13. Training review

        Now that we've covered the entire forward pass process with a trained transformer, it's useful to take a look at the intuition of training a model.

        During training, the untrained model will go through the exact same forward pass. But since we trained it on a labeled training dataset, we can compare its output with the actual correct output.

        To visualize this, let's assume our output vocabulary contains only six words ("a", "am", "i", "thank you", "student", and "<eos>" ("end of sentence" abbreviation of)).


The output vocabulary of our model is created in a preprocessing phase before we start training.

Once we have defined the output vocabulary, we can use vectors of the same width to denote each word in the vocabulary. This is also known as one-hot encoding. For example, we can use the following vector to represent the word "am":


Example: One-Hot Encoding of Output Vocabulary

After this review, let's discuss the model's loss function - the metric we optimize during the training phase to result in a trained and hopefully surprisingly accurate model.

14. Loss function

        Suppose we are training our model. Assuming this is our first step in the training phase, we are training it on a simple example - translate "mercy" to "thank you".

        This means, we want the output to be a probability distribution indicating the word "thank you". But since the model hasn't been trained, it's unlikely to happen right now.


Since the parameters (weights) of the model are all randomly initialized, the (untrained) model generates a probability distribution with arbitrary values ​​for each cell/word. We can compare this to the actual output and then use backpropagation to adjust all the weights of the model to bring the output closer to the desired output.

How do you compare two probability distributions? We just subtract one from the other. Check out Cross Entropy and Kullback-Leibler Divergence for more details .

Note, however, that this is an oversimplified example. More realistically, we'll use a sentence that's longer than a single word. For example – Input: "je suis étudiant" and expected output: "I am a student". What this really means is that we want our model to continuously output a probability distribution where:

  • Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistic numbers are 30,000 or 50,000)
  • The first probability distribution has the highest probability at the cell associated with the word "i"
  • The second probability distribution has the highest probability at the cell associated with the word "am"
  • And so on until the fifth output distribution indicates the '' symbol, which also has a cell associated with the 10,000-element vocabulary.<end of sentence>


We will train the target probability distribution of the model on a training example of a sample sentence.

        After training a model on a large enough dataset for a long enough time, we would like the resulting probability distribution to look like this:


Hopefully after training, the model will output the correct translation we expect. Of course, this doesn't really indicate whether the phrase was part of the training dataset (see: cross-validation ). Note that each location has a bit of probability, even if it is unlikely to be the output at that timestep - this is a very useful property of softmax that helps the training process.

        Now, since the model produces one output at a time, we can assume that the model is selecting the word with the highest probability from this probability distribution, and discarding the rest. Here's one approach (called greedy decoding). Another approach is to keep, say, the first two words (e.g., "I" and "a"), and then in a next step, run the model twice: once assuming the first output position is the word "I", and another Assuming once that the first output position is the word "a", whichever version produces the smaller error considering positions #1 and #2 is kept. We repeat this for positions #2 and #3...etc. This approach is called "beam search", and in our example, beam_size is two (meaning that two partial hypotheses (outstanding translations) are always kept in memory), and top_beams is also two (meaning we will returns two translations). These are all hyperparameters that you can experiment with.

15. Transformation

        I hope you've found this a useful place to start breaking down the main concepts of Transformers. If you want to go deeper, I suggest these next steps:

References and follow-up work:

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/131896915