[Transformers 02] Attention mechanism and BERT and GPT

 

1. Description

        I have explained what the attention mechanism is, and some important keywords and blocks related to transformers, such as self-attention, query, key and value, and multi-head attention. In this part, I explain how these attention blocks help to create Transformer Networks, Attention, Self-Attention, Multi-Head Attention, Masked Multi-Head Attention, Transformers, BERT, and GPT.

2. Content:

  1. Challenges of RNNs and how Transformer models can help overcome them (introduced in Part 1)
  2. Attention mechanisms  — self-attention, query, key, value, multi-head attention (introduced in Part 1)
  3. transformer network
  4. Basics of GPT (covered in Part 3)
  5. Basics of BERT (will be covered in part 3)

3. Transformer network

        Paper -  Attention is All You Need  (2017)

Figure 1. The Transformer Network (Source: The picture comes from the original text)

        Figure 1 shows the transformer network. This network has replaced RNN as the best model for NLP and even computer vision ( Vision Transformer ).

        The network consists of two parts - encoder and decoder.

        In machine translation, an encoder is used to encode the original sentence and a decoder is used to generate the translated sentence. The Transformer's encoder can process entire sentences in parallel, making it faster and better than RNNs -- RNNs process sentences one word at a time.

3.1 Encoder block

Figure 2. The encoder part of the transformer network (source: the picture comes from the original text)

        The encoder network starts with an input. Here, the entire sentence is fed at once. They are then embedded in the " input embedding " block. A " positional encoding " is then added to each word in the sentence. This encoding is crucial to understanding the position of each word in a sentence. Without positional embeddings, the model sees the entire sentence as a bag full of words, without any sequence or meaning.

In detail:

3.1.1 Input embedding 

        — The word "dog" in a sentence can use the embedding space to obtain vector embeddings. Embedding is simply converting a word in any language to its vector representation. An example is shown in Figure 3. In the embedding space, similar words have similar embeddings, for example, the word "cat" and the word "kitty" will be very close in the embedding space, while the words "cat" and "emotion" will fall farther in the space.

Figure 3. Input embedding (Source: Image created by the author)

3.1.2 Position coding

        Words in different sentences can have different meanings. For example, the word dog in a. I have a cute dog (animal/pet - position 5) and b. What a lazy dog ​​you are! (no value - position 4), has a different meaning. To help with position encoding. It is a vector that provides information based on the context and position of a word in a sentence.

        In any sentence, words appear one after the other and carry significance. If the words in a sentence are jumbled, then the sentence will not make sense. But when the converter loads the sentences, it doesn't load them sequentially, it loads them in parallel. Since the Transformer architecture does not include the order of words when loaded in parallel, we must explicitly define the position of words in sentences. This helps the converter understand that one word comes after another in a sentence. This is where location embeddings come in handy. This is a vector encoding that defines word positions. This positional embedding is added to the input embedding before entering the attention network. Figure 4 gives an intuitive understanding of input embeddings and position embeddings before they are fed into the attention network.

Figure 4. Intuitive understanding of location embeddings (Source: Image created by the author)

        There are various ways to define these positional embeddings. For example, in the original paper "Attention is All You Need" , the authors define embeddings using alternating sine and cosine functions, as shown in Figure 5.

Figure 5. Positional embeddings used in the original paper (Source: Image from the original paper)

        Although this embedding works for text data, this embedding does not work for image data. Thus, there can be multiple ways to embed object locations (text/images), and they can be fixed or learned during training. The basic idea is that this embedding allows the transformer architecture to understand where words are in a sentence, rather than messing up meaning by confusing words.

After the word/input embedding and position embedding are done, the embedding then flows into the most important part of the encoder, which contains two important blocks - the "multi-head attention " block and the " feed-forward " network.

3.1.3 Multi-head attention

        This is the main block where the magic happens. To learn about Multi-Head Attention, please visit this link —  2.4 Multi-Head Attention .

        As input, the block receives a vector (sentence) containing subvectors (words in the sentence). Then, multi-head attention computes the attention between each position and the other positions of the vector.

Figure 6. Scaled dot product attention (source: image from original paper)

        The figure above shows scaled dot product attention. This is exactly the same as self-attention, with the addition of two blocks (scale and mask). To learn more about Sef-Attention, please visit this link —  2.1 Self-Attention .

        As shown in Figure 6, scaled attention is exactly the same except that it adds a scale after the first matrix multiplication (Matmul).

        The scaling ratios are as follows,

        The scaled output goes into a mask layer. This is an optional layer, useful for machine translation.

Figure 7. Attention block (Source: Image created by the author)

        Figure 7 shows the neural network representation of the attention block. Word embeddings are first passed into some linear layers. These linear layers have no "bias" term, so are nothing more than matrix multiplications. One of the layers is represented as "keys", another as "queries", and the last layer as "values". If we perform a matrix multiplication between the key and the query, then normalize, we get the weights. These weights are then multiplied by values ​​and summed to get the final attention vector. This block is now available in neural networks and is called an attention block. Multiple such attention blocks can be added to provide more context. The best part is, we can get gradient backpropagation to update attention blocks (weights for keys, queries, values).

        Multi-head attention takes in multiple keys, queries, and values, feeds it through multiple scaled dot-product attention blocks, and finally concatenates the attentions to give us a final output. As shown in Figure 8.

Figure 8. Multi-head attention (Source: Image created by the author)

        Simpler explanation: the main vector (sentence) contains sub-vectors (words) - each word has a positional embedding. The attention computation treats each word as a " query " and finds some " key " that corresponds to some other word in the sentence , and then takes a convex combination of the corresponding " values ". In multi-head attention, multiple values, queries, and keys are selected to provide multiple attentions (better word embeddings and context). These multiple attentions are concatenated to give the final attention value (context combination of all words from all multiple attentions), which is much better than using a single attention block.

        On simple words, the idea of ​​multi-head attention is to take a word embedding, combine it with some other word embeddings (or multiple words) to use attention (or multiple attentions) to produce a better embedding for that word ( embedding more context of surrounding words).

        The idea is to compute multiple attentions per query, with different weights.

3.1.4 Add and norm and feedforward

        The next block is "addition and normalization ", which takes the residual connections of the original word embeddings, adds them to the multi-head attention embeddings, and then normalizes them to zero mean and variance one.

        This is fed into a ' feedforward ' block which also has an 'add and normalize' block on its output.

        The whole multi-head attention and feed-forward block is repeated n times (hyperparameters) in the encoder block.

3.2 Decoder block

Figure 9. The decoder part of the transformer network (Souce: image from the original paper)

        The output of the encoder is again a series of embeddings, one embedding for each position, where each position embedding contains not only the original word's embedding at that position, but also information about other words, which it has learned using attention.

        This is then fed into the decoder part of the transformer network, as shown in Figure 9. The purpose of a decoder is to produce some output. In the paper "Attention is All You Need", this decoder is used for sentence translation (e.g. from English to French). So, the encoder will receive an English sentence and the decoder will translate it into French. In other applications, the decoder part of the network is not necessary, so I won't elaborate on it too much.

        Steps in the decoder block—

1. In sentence translation, the decoder block receives a French sentence (for English to French translation). Like the encoder, here we add a word embedding and a position embedding and feed it to a multi-head attention block.

2. The self-attention block generates an attention vector for each word in a French sentence to show the relevance of one word to another in the sentence.

 3. The attention vectors in the French sentences are then compared with those in the English sentences. This is the part where the English to French word mapping happens.

4. In the last few layers, the decoder predicts the translation of the English word to the best possible French word.

5. The whole process is repeated multiple times to obtain a translation of the entire text data.

The modules used for each of the above steps are shown in Figure 10.

Figure 10. The role of different decoder blocks in sentence translation (Source: Image created by the author)

There is a new block in the decoder - masked multi-head attention. All other blocks, we have seen before in the encoder.

3.2.1Mask multi-head attention

        This is a multi-head attention block where certain values ​​are masked out. The probability of a masked value being empty or unchecked.

        For example, when decoding, the output value should only depend on previous outputs, not on future outputs. Then we mask future output.

3.3 Results and conclusions

Figure 11. Results (Source: original paper image)

        In the paper, language translation between English to German and English to French is compared with other state-of-the-art language models. BLEU is a metric used in language translation comparisons. From Figure 11, we see that the large transformer model achieves higher BLEU scores in both translation tasks. They also significantly improved training costs.

        In conclusion, the Transformer model can reduce computational cost while still obtaining state-of-the-art results.

        In this part, I explain the encoder and decoder blocks of Transformer Networks and how each block is used in language translation. In the next and final part (Part 3), I will discuss some important transformer networks that have become very famous recently, such as BERT (Bidirectional Encoder Representation from Transformers) and GPT (General Transformer).

4. Citation 

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132247642