Transformer detailed

Transformer detailed

July 12, 2020 • Read: 3264 • Deep LearningReading settings

Video explanation at station B

Transformer is the seq2seq model proposed by Google Brain in the paper Attention is all you need published at the end of 2017 . Now it has achieved a wide range of applications and extensions, and BERT is a pre-trained language model derived from Transformer

This article is divided into the following sections

  1. Transformer intuitive understanding
  2. Positional Encoding
  3. Self Attention Mechanism
  4. Residual connection and Layer Normalization
  5. Transformer Encoder overall structure
  6. Transformer Decoder overall structure
  7. to sum up
  8. Reference article

0. Intuitive understanding of Transformer

The biggest difference between Transformer and LSTM is that LSTM training is iterative and serial. You must wait for the current word to be processed before the next word can be processed. The training of Transformer is parallel, that is, all words are trained at the same time, which greatly increases the computational efficiency. Transformer uses Positional Encoding to understand the order of the language, uses the Self Attention Mechanism (Self Attention Mechanism) and the fully connected layer for calculations, which will be discussed later

Transformer model is mainly divided into two parts, namely Encoder and Decoder . The Encoder is responsible for mapping the input (language sequence) into a hidden layer (the part represented by the nine-square grid in step 2 in the figure below), and then the decoder maps the hidden layer to a natural language sequence. For example, the following machine translation example

Most of the content of this article is to explain the Encoder part, that is , the process of mapping natural language sequences into hidden layer mathematical expressions . Understand the structure of Encoder, and then understand Decoder is very simple

The above picture is the structure diagram of Transformer Encoder Block. Note: The content title numbers below correspond to the serial numbers of the 1, 2, 3, and 4 boxes in the picture.

1. Positional Encoding

Since the Transformer model does not have the iterative operation of the cyclic neural network, we must provide the position information of each word to the Transformer so that it can recognize the order relationship in the language

Now define a concept of positional embedding , which is Positional Encoding, the dimension of [max_sequence_length, embedding_dimension]positional embedding is , the dimension of positional embedding and the dimension of word vector are the same, both embedding_dimension. max_sequence_lengthIt is a hyperparameter, which refers to the limitation of how many words each sentence can consist of

Note that we generally train the Transformer model in units of words . First initializes a size of coded words [vocab_size, embedding_dimension], vocab_sizethe number of all words in the character, embedding_dimensionas a dimension word vector, which corresponds to PyTorch in factnn.Embedding(vocab_size, embedding_dimension)

The paper uses the linear transformation of sin and cos functions to provide model position information:

PE {(pos, 2i)} = sin (pos / 10000 ^ {2i / d _ {text {model}}}) \\ PE {(pos, 2i + 1)} = cos (pos / 10000 ^ {2i / d _ {\ text {model}}})

Pos above formula refers to the location of a word in a sentence, is in the range [0, \ max \_ sequence \_ length), irefers to the number word vector of dimension, ranges [0, \ embedding\_ dimension/2), d_{\text {model}}refers to a embedding\_ dimensionvalue

There above withoutand cosa set of equations, i.e. corresponding to the embedding \ dimensionnumber of odd and even set of dimensions of a dimension, for example, a group of 0,1, 2,3 a group, each with the above sin and cos functions to do the processing, resulting in different The cyclical change of the position embedding in the dimension, as the dimension number increases, the cyclical change will become slower and slower, and finally a texture containing position information is produced, as mentioned in the sixth page of the original paper, the position embedding function from cycle 2 \pito 10000 * 2 \pichange, and each location embedding \ dimensionwill receive a combination of values of sin and cos function of the dimensions of different periods, thereby generating a unique texture position information, so that the final learned dependencies between the model position and the natural language Timing characteristics

If you don’t understand why this design is done here, you can see the position embedding under Positional Encoding in Transformer in this article
. Observe vertically, it can be seen that as the embedding \ dimensionserial number increases, the periodic change of the position embedding function becomes more and more gentle.



    import numpy as np

    import matplotlib.pyplot as plt

    import seaborn as sns

    import math

    def get_positional_encoding(max_seq_len, embed_dim):

    # 初始化一个positional encoding

    # embed_dim: 字嵌入的维度

    # max_seq_len: 最大的序列长度

    positional_encoding = np.array([

    [pos / np.power(10000, 2 * i / embed_dim) for i in range(embed_dim)]

    if pos != 0 else np.zeros(embed_dim) for pos in range(max_seq_len)])

    positional_encoding[1:, 0::2] = np.sin(positional_encoding[1:, 0::2]) # dim 2i 偶数

    positional_encoding[1:, 1::2] = np.cos(positional_encoding[1:, 1::2]) # dim 2i+1 奇数

    return positional_encoding

    positional_encoding = get_positional_encoding(max_seq_len=100, embed_dim=16)

    plt.figure(figsize=(10,10))

    sns.heatmap(positional_encoding)

    plt.title("Sinusoidal Function")

    plt.xlabel("hidden dimension")

    plt.ylabel("sequence length") 

plt.figure(figsize=(8, 5))

plt.plot(positional_encoding[1:, 1], label="dimension 1")

plt.plot(positional_encoding[1:, 2], label="dimension 2")

plt.plot(positional_encoding[1:, 3], label="dimension 3")

plt.legend()

plt.xlabel("Sequence length")

plt.ylabel("Period of Positional Encoding") 

2. Self Attention Mechanism

For the input sentence X, the word vector of each word in the sentence is obtained through WordEmbedding, and the position vector of all words is obtained through Positional Encoding, and they are added (the dimensions are the same and can be added directly) to get the true vector representation of the word . The first tword vector denotedx_t

Then we define three matrices W_Q,W_K.W_V, and use these three matrices to perform three linear transformations on all word vectors, so all word vectors derive three new vectors q_t,k_t,v_t. We will all q_tvectors makes up a large matrix, denoted by the query matrixQ , all the k_tvector makes up a large matrix, denoted by a key matrixK , all the vector makes up a large matrix, denoted by the value of the matrixV (see figure below)

In order to get the attention weight of the first word, we need to use the query vector of the first word

Multiply by the key matrix K (see figure below)

[0, 4, 2]

[1, 0, 2] x [1, 4, 3] = [2, 4, 4]

[1, 0, 1] 

 

After that, you need to pass the obtained value through softmax so that their sum is 1 (see the figure below)

 
  • softmax([2, 4, 4]) = [0.0, 0.5, 0.5]

After having the weight, multiply the weight by the value vector of the corresponding word v_t(see the figure below)

0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]

0.5 * [2, 8, 0] = [1.0, 4.0, 0.0]

0.5 * [2, 6, 3] = [1.0, 3.0, 1.5] 

Finally, these weighted value vectors are summed to get the output of the first word (see the figure below)

[0.0, 0.0, 0.0]

+ [1.0, 4.0, 0.0]

+ [1.0, 3.0, 1.5]

-----------------

= [2.0, 7.0, 1.5] 

Perform the same operation on other input vectors to get all the output after self-attention

Matrix calculation

The method described above requires a loop to traverse all the words x_t. We can turn the vector calculation above into a matrix form to calculate the output at all times at once

The first step in the calculation is not a moment in time q_t,k_t,v_t, but rather a calculation of all time Q,Kand V. Calculation shown below, where the input is a matrix X, the first matrix trow represents ta vector representation of wordx_t

Next Qand K^Tmultiplying and then dividing by \sqrt {d_k}(this is a trick paper mentioned above), and multiplying elapsed after softmax Vobtain an output

Multi-Head Attention

This paper also proposed the concept of Multi-Head Attention. In fact, it's very simple. The previously defined group Q,K,Vallows one word to attend to related words. We can define multiple groups Q,K,Vto focus on different contexts. Calculation Q,K,Vprocedure is the same, except that a linear transformation matrix from a set (W^Q,W^K,W^V)into a plurality of groups (W^Q_0,W^K_0,W^V_0), (W^Q_1,W^K_1,W^V_1)... as shown in FIG.

For the input matrix X, each group Qof Kand Vcan get an output matrix. As shown below

Padding Mask

In the calculation process of Self Attention above, we usually use mini-batch to calculate, that is, calculate multiple sentences at a time, namely

The dimensions are [batch_size, sequence_length],

It is the sentence length, and a mini-batch is composed of multiple sentences of unequal length. We need to fill in the remaining sentences according to the largest sentence length in this mini-batch, usually with 0 for filling. This process is called padding

But at this time, there will be problems when performing softmax. Looking back at the softmax function \ sigma (z_i) = \ frac {e ^ {z_i}} {\ sum_ {j = 1} ^ K e ^ {z_j}}, it e^0is 1, and it has a value. In this case, the padding part of the softmax is involved in the calculation, which is equivalent to letting the invalid part participate in the calculation, which may cause great hidden dangers. Therefore, a mask operation is needed to make these invalid regions not participate in the calculation. Generally, a large negative offset is added to the invalid regions, namely

Z_{illegal} = Z_{illegal}+bias_{illegal}

bias_{illegal}{\rightarrow-\infty }

3. Residual connection and Layer Normalization

Residual connection

In the previous step, we got the output after self-attention weighting, that is Attention (Q, \ K, \ V), and then add them to make the residual connection

X_{embedding} + Self\ Attention(Q, \ K, \ V)

Layer Normalization

The role of Layer Normalization is to normalize the hidden layers in the neural network to a standard normal distribution, that is, i.i.dindependent and identical distribution, to speed up training and accelerate convergence.

\mu_{j}=\frac{1}{m} \sum^{m}_{i=1}x_{ij}

In the column of the matrix equation (column)for the averaging unit;

\sigma^{2}_{j}=\frac{1}{m} \sum^{m}_{i=1}(x_{ij}-\mu_{j})^{2}

Column of the matrix in the formula (column)units of variance seek

LayerNorm(x)=\frac{x_{ij}-\mu_{j}}{\sqrt{\sigma^{2}_{j}+\epsilon}}

Then each column of each element of subtracting this column mean , divided by that column's standard deviation , to thereby obtain the normalized value, is applied in order to prevent the denominator is 0

The following figure shows more details: the input x_1,x_2becomes after the self-attention layer z_1,z_2, and then the x_1,x_2residual connection with the input , after the LayerNorm, is output to the fully connected layer. The fully connected layer also has a residual connection and a LayerNorm, which are finally output to the next Encoder (the FeedForward layer weights in each Encoder Block are shared)

4. Transformer Encoder overall structure

After the above 3 steps, we have basically understood the main components of the Encoder . Below we use the formula to organize the calculation process of an Encoder block:

1). Word vector and position coding

X = Embedding\ Lookup(X) + Positional\ Encoding

2). Self-attention mechanism

Q = Linear(X) = XW_{Q}\\ K = Linear(X) = XW_{K}\\ V = Linear(X) = XW_{V}\\ X_{attention} = SelfAttention(Q, \ K, \ V)

3). Self-attention residual connection and Layer Normalization

X_{attention} = X + X_{attention}\\ X_{attention} = LayerNorm(X_{attention})

4). The fourth part of the Encoder block structure diagram , which is FeedForward, is actually a two-layer linear mapping and activated with an activation function, for exampleresume

X_{hidden} = Linear(ReLU(Linear(X_{attention})))

5). FeedForward residual connection and Layer Normalization

X_{hidden} = X_{attention} + X_{hidden}\\ X_{hidden} = LayerNorm(X_{hidden})

among them

X_{hidden} \in \mathbb{R}^{batch\_size \ * \ seq\_len. \ * \ embed\_dim}

5. Transformer Decoder overall structure

Let's first observe the Decoder structure from the perspective of HighLevel, from bottom to top:

  • Masked Multi-Head Self-Attention
  • Multi-Head Encoder-Decoder Attention
  • FeedForward Network

Like Encoder, each of the above three parts has a residual connection followed by a Layer Normalization . The intermediate parts of the Decoder are not complicated, most of which we have already introduced in the previous Encoder, but due to its special function, the Decoder will involve some details during training.

Masked Self-Attention

Specifically, the Decoder in the traditional Seq2Seq uses the RNN model, so input during the training process

Words of moments, the model cannot see words of future moments anyway, because the cyclic neural network is time-driven, and can only be seen when the calculation of the moment is over

The word of the moment. The Transformer Decoder abandoned RNN and changed to Self-Attention, which created a problem. During the training process, the entire ground truth is exposed to the Decoder. This is obviously wrong. We need to process the input of the Decoder. , The process is called Mask

For example, the ground truth of the Decoder is "<start> I am fine". We input this sentence into the Decoder. After WordEmbedding and Positional Encoding, the resulting matrix is ​​subjected to a cubic linear transformation (

). Then perform self-attention operation, first pass

To get the Scaled Scores, the next step is very important. We have to mask the Scaled Scores. For example, when we input "I", the model currently only knows the information of all the previous words including "I", that is, "<start> The information of "and "I" should not be let to know the information of the word after "I". The reason is very simple. When we make predictions, we make predictions one word by word in order. How can we know the information of the following words without predicting this word? Mask is very simple. First generate a matrix with all 0s in the lower triangle and negative infinity in the upper triangle, and then add it to the Scaled Scores.

After doing softmax, you can change-inf to 0, and the resulting matrix is ​​the weight between each word

Multi-Head Self-Attention is nothing more than doing the above steps several times in parallel. The Encoder has also been introduced before, so I won’t repeat it here.

Masked Encoder-Decoder Attention

In fact, the calculation process of this part is very similar to the previous Masked Self-Attention, and the structure is exactly the same. The only difference is that here K, Vis the output of Encoder and the output Qof Masked Self-Attention in Decoder.

6. Summary

So far, 95% of the contents of Transformer have been introduced, and we use a picture to show its complete structure. I have to say that the Transformer design is very ingenious

There are a few questions below, I found from the Internet, I feel that after reading it, I can have a deeper understanding of Transformer

Why does Transformer need Multi-head Attention?

The original paper said that the reason for Multi-head Attention is to divide the model into multiple heads to form multiple subspaces, allowing the model to pay attention to different aspects of information, and finally integrate all aspects of information. In fact, it is intuitively conceivable that if you design such a model yourself, you will certainly not only do one attention. The result of multiple attention synthesis can at least enhance the model. It can also be compared to the simultaneous use of multiple convolutions in CNN. The role of the nucleus , intuitively speaking, multi-headed attention helps the network to capture richer features/information

Compared with RNN/LSTM, what are the advantages of Transformer? why?

  1. RNN series models cannot be calculated in parallel, because the calculation at time T depends on the hidden layer calculation result at time T-1, and the calculation at time T-1 depends on the hidden layer calculation result at time T-2
  2. Transformer's feature extraction ability is better than RNN series models

Why can Transformer replace seq2seq?

It’s a little inappropriate to replace the word here. Although seq2seq is old, it still has its place. The biggest problem with seq2seq is to compress all the information from the Encoder into a fixed-length vector and use it as the header of the Decoder. A hidden state input to predict the hidden state of the first word (token) on the decoder side. When the input sequence is relatively long, this will obviously lose a lot of information on the Encoder side, and the fixed vector is sent to the Decoder side with such a brain, the Decoder side cannot pay attention to the information it wants to pay attention to . Transformer not only makes substantial improvements to these two shortcomings of the seq2seq model (multi-head interactive attention module), but also introduces a self-attention module to let the source sequence and the target sequence “self-associate” first. In this case, the source sequence The embedding representation of the target sequence itself contains more information, and the subsequent FFN layer also enhances the expression ability of the model, and the parallel computing ability of Transformer far exceeds the seq2seq series model

7. Reference articles

Guess you like

Origin blog.csdn.net/sunlin972913894/article/details/110310909