[Transformers 01] All information about attention and transformer

Figure 1. Source: Photo by Arseny Togulev on Unsplash

1. Description

        This is a long post that discusses pretty much everything one needs to know about attention mechanisms, including self-attention, queries, keys, values, multi-head attention, masked multi-head attention, and transformers, including on BERT and GPT some details. Therefore, I have divided this article into two parts. In this article, I will introduce all the attention blocks, and in the next story, I will dive into the transformer network architecture.

2. Summary of RNN background knowledge 

  1. Challenges for RNNs and how transformer models can help overcome them
  2. attention mechanism

2.1 Self-attention

2.2 Queries, keys and values

2.3 Neural Network Representation of Attention

2.4 Multi-head attention

3. Transformers (continued in part two)

2.1 Introduction

        Attention mechanisms were first used in computer vision in 2014 to try to understand what a neural network was looking at when making predictions. This is the first step in trying to understand the output of a Convolutional Neural Network (CNN). Attention was first used to align natural language processing (NLP) in machine translation in 2015. Finally, in 2017, attention mechanisms were used for language modeling in Transformer networks. Transformers have since surpassed the prediction accuracy of recurrent neural networks (RNNs) and become the state-of-the-art for NLP tasks.

2.2. Challenges of RNNs and how transformer models can help overcome them

        1.1  RNN problem 1  - Encountered long-term dependency problems. RNNs are not suitable for long text documents.

        Transformer solution  — Transformer networks use almost exclusively attention blocks. Attention helps create connections between any part of the sequence, so long-term dependencies are no longer an issue. For transformers, long-term dependencies are as likely as any other short-range dependencies.

        1.2.  RNN Problem 2  — Suffering from vanishing and exploding gradients.

        Transformer Solution  - Few vanishing or exploding gradients. In Transformer Networks, the entire sequence is trained simultaneously and only a few layers are added on top of it. Therefore, vanishing or exploding gradients are rarely a problem.

        1.3. RNN  Problem 3 — RNNs  require larger training steps to reach local/global minima. RNN can be visualized as a very deep unfolded network. The size of the network depends on the length of the sequence. This produces many parameters, and most of these parameters are interrelated. Therefore, optimization requires longer training time and many steps.

        Transformer solution  — Requires fewer training steps than RNNs.

        1.4. RNN  Problem 4 — RNN  does not allow parallel computation. GPUs facilitate parallel computing. But RNNs work as sequential models, that is, all calculations in the network are performed sequentially and cannot be parallelized.

        Transformer solution  - There is no duplication in the transformer network, allowing parallel computing. Therefore, each step can be computed in parallel.

3. Attention mechanism

3.1 Self-attention

Figure 2. Example explaining self-attention (Source: Image created by the author)

        Consider this sentence - " It's cute to bark, he's a dog ". This sentence has 9 words or tokens. If we consider only the word "he" in the sentence, we see that "and" are" are two words that are very close to it. But these words don't give the word "he" any context. Instead, "bark" The words "dog" and "dog" are more related to "he" in the sentence. From this, we learn that proximity is not always relevant, but context is more relevant in sentences.

        When this sentence is fed to the computer, it treats each word as a token t , and each token has a word embedding V. But these word embeddings have no context. So the idea is to apply some kind of weight or similarity to get a final word embedding Y which has more context than the initial embedding V.

        In the embedding space, similar words appear closer or have similar embeddings. For example, the word "king" will be more related to the words "queen" and "royalty" than to the word "zebra". Likewise, "zebra" has more to do with "horse" and "stripes" than with the word "emotion." To learn more about embedding spaces, visit this video by Andrew Ng ( NLP and word embeddings ).

        So, intuitively, if the word "king" appears at the beginning of a sentence and the word "queen" appears at the end, they should provide each other with better context. We use this idea to find the weight vector  W by multiplying word embeddings (dot product) to gain more context. So, in the sentence, Bark is very cute and he is a dog , instead of using word embeddings as is, we multiply the embeddings of each word. Figure 3 should illustrate this better.

Figure 3. Finding the weights and getting the final embedding (Source: Image created by the author)

        As shown in Figure 3, we first find the weights by multiplying (dot-product) the initial embedding of the first word with the embeddings of all other words in the sentence. These weights (W11 to W19) are also normalized to sum to 1. Next, these weights are multiplied by the initial embeddings of all the words in the sentence.

        W11 V1 + W12 V2 + ....W19 V9 = Y1

        W11 to W19 are all weights with the context of the first word V1. So when we multiply these weights to each word, we are actually re-weighting all other words to the first word. So, in a sense, the word " bark " is now leaning more toward the words " dog " and " cute " than the words that come after them. This somewhat provides some context.

        Repeat this for all the words so that each word gets some context from the other words in the sentence.

Figure 4. Graphical representation of the above steps (Source: Image created by the author)

        Figure 4 uses a graphical diagram to better understand the above steps to obtain Y1.

        Interestingly, without training weights, the order or proximity of words has no effect on each other. Furthermore, the process does not depend on the length of the sentence, that is, it does not matter if there are more or fewer words in the sentence. This method of adding some context to words in a sentence is called self-attention .

3.2 Queries, keys and values

        The problem with self-attention is that nothing is trained. But maybe if we add some trainable parameters, the network can learn some patterns that provide better context. This trainable parameter can be a matrix whose values ​​are trained. Therefore, the concept of query, key and value is introduced.

        Let's consider again the previous sentence - " It's cute to bark, he's a dog ". In Figure 4 of the self-attention, we see that the initial word embedding ( V ) is used 3 times. 1st as a dot product between the embedding of the first word in the sentence and all other words (including itself, 2nd) to get the weights, then multiply them again (3rd time) by the weights to get the final embedding with context . These 3 occurrences of V can be replaced by the three terms query, key and value .

        Suppose we want to make all words similar to the first word V1. Then, we send V1 as the query term. This query term will then do a dot product of all the words in the sentence (V1 to V9) - these are the keys. So the combination of query and key gives us the weights. These weights are then multiplied again with all the words (V1 to V9) serving as values. We have it, query, key and value. If you still have some doubts, Figure 5 should clear them up.

Figure 5. Representing queries, keys and values ​​(Source: Image created by the author)

        But wait, we haven't added any matrices that can be trained yet. this is very simple. We know that if we multiply a 1 xk shape vector by a kxk shape matrix, we get a 1 xk shape vector as output. Keeping this in mind, let's multiply each key from V1 by V10 to V1 (each of shape 6 xk), and by a matrix Mk (key matrix) of shape kxk. Similarly, the query vector is multiplied by the matrix Mq (query matrix), and the value vector is multiplied by the value matrix Mv. All values ​​in these matrices Mk, Mq and Mv can now be trained by the neural network and provide better context than just using self-attention. Again, for better understanding, Figure<> shows a graphical representation of what I just explained.

Figure 6. Key Matrix, Query Matrix, and Value Matrix (Source: Image created by the author)

        Now that we know the intuition of keys, queries, and values, let's look at the official steps and formulas behind database analysis and attention.

        Let's try to understand the attention mechanism by looking at an example of a database. So, in a database, if we want to  retrieve a certain value v i based on a query q and a key ki , there are operations where we can use the query to identify the key corresponding to a certain value . Attention can be thought of as a similar process to this database technique, but in a more probabilistic way. The figure below demonstrates this.

        Figure 7 shows the steps to retrieve data from the database. Suppose we send a query to the database, some operation will figure out which key in the database is most similar to the query. When a key is found, it sends the value corresponding to that key as output. In the graph, the operation found that the query was most similar to key 5, so it gave us the value 5 as output.

Figure 7. The value retrieval process in the database (source: image created by the author)

        An attention mechanism is a neural architecture that mimics this retrieval process.

  1. The attention mechanism measures the similarity between the query q and each key-value ki .
  2. This affinity returns a weight for each key value.
  3. Finally, it produces an output that is a weighted combination of all the values ​​in our database.

        In a sense, the only difference between database retrieval and attention is that in database retrieval we only get a single value as input, but here we get a weighted combination of values. In the attention mechanism, if the query is most similar to key 1 and key 4, then these two keys will get the most weight and the output will be a combination of value 1 and value 4.

        Figure 8 shows the steps required to obtain the final attention value from query, key and value. Each step is explained in detail below. (The key value  k  is a vector, the similarity value  S  is a scalar, the weight value (softmax) value  a  is a scalar, and the value  V  is a vector)

Figure 8. Steps to get the attention value (Source: Image created by the author)

step 1.

        Step 1 contains keys and queries and corresponding similarity measures. The query  q  affects the similarity. What we have is query and key, and calculate similarity. Similarity is   some function of query q  and key  k . Both query and key are some embedding vectors. The similarity  S  can be calculated using various methods, as shown in FIG. 9 .

Figure 9. Method to calculate similarity (Souce: Image created by the author)

        Similarity can be a simple dot product of query and key. It can be a scaled dot product, where the dot product of q and k is divided by the square root of the dimensionality d of each key These are the two most common techniques for finding similarities.

        Typically, the query is projected into the new space using the weight matrix W , and then  the dot product is created using the key  k . Kernel methods can also be used as similarities.

step 2.

        Step 2 is to find the weight  a . This is done using " SoftMax ". The formula is shown below. (exp is exponential)

        The similarity is connected to the weights like a fully connected layer.

Step 3.

Step 3 is the weighted combination of the result of         softmax ( a ) and the corresponding value ( V ). The first value of a is multiplied by the first value of V, then  added to the product  of the 1st value of  a and  the 2nd value of V , and so on. The final output we obtain is the desired resulting attention value.

Summary of the three steps:

W With the help of query  q  and key  k  , we obtain the attention value, which is a weighted sum/linear combination of the value V , the weight comes from some similarity between the query and the key.

3.3 Neural Network Representation of Attention

Figure 10. Neural network representation of the attention block (Source: Image created by the author)

Figure 10 shows the neural network representation of the attention block. Word embeddings are first passed into some linear layers. These linear layers have no "bias" term, so are nothing more than matrix multiplications. One of the layers is represented as "keys", another as "queries", and the last layer as "values". If we perform a matrix multiplication between the key and the query, followed by normalization, we get the weights. These weights are then multiplied by values ​​and summed to get the final attention vector. This block is now available in neural networks and is called an attention block. Multiple such attention blocks can be added to provide more context. The best part is, we can get gradient backpropagation to update attention blocks (weights for keys, queries, values).

3.4 Multi-head attention

        To overcome some pitfalls of using single-head attention, multi-head attention is used. Let's go back to that sentence - " It's cute to bark, he's a dog ". Here, if we use the word "dog", grammatically we understand that "barking", "cute" and "he" should have some meaning or correlation with the word "dog". These words, the dog's name is Bark, is a male dog, and is a lovely dog. Only one attention mechanism may not be able to correctly identify that these three words are related to "dog". We can say that it is better to use the word "dog" to represent these three words. This reduces the burden on one attention to find all the important words and also increases the chances of easily finding more related words.

        So let's add more linear layers for keys, queries and values. These linear layers are trained in parallel and have independent weights from each other. So now, each value, key, and query gives us three outputs instead of one. These 3 keys and queries now provide three different weights. These three values ​​are then multiplied by the matrix to get a triple multiple output. These three attention blocks are finally concatenated to give a final attention output. This representation is shown in Figure 11.

Figure 11. Multi-head attention with 3 linear layers (Source: Image created by the author)

        But 3 is just a random number we choose. In a real scene, these could be any number of linear layers, called heads ( h ). That is, there can be h linear layers, giving  h  attention outputs, which are then concatenated together. That's exactly why it's called multi-headed attention (multi-headed). A simplified version of Figure 11, but with the number of heads  h  is shown in Figure 12.

Figure 12. Multi-head attention with "h" layer (Source: Image created by the author)

        N We have covered all the important building blocks of Transformer Networks since we learned about the mechanisms and ideas behind Attention, Query, Key, Value, and Multi-Head Attention. In the next story, I'll discuss how all these blocks stack together to form Transformer networks, and discuss some Transformer-based networks such as BERT and GPT.

4. Citation:

2017. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132247481