Transformer model detailed explanation related information

Transformer model detailed explanation

1 Introduction

Transformer was proposed in the paper "Attention is All You Need" and is now the recommended reference model for Google Cloud TPU. The Tensorflow code related to the paper can be obtained from GitHub as part of the Tensor2Tensor package.

1.1 Transformer overall structure

The overall structure of Transformer for Chinese-English translation

img
 The overall structure of Transformer, the Encoder on the left and the Decoder on the right \text { The overall structure of Transformer, the Encoder on the left and the Decoder on the right }The overall structure of  the Transformer  , the Encoder on the left and the Decoder 
on the right Transformer consists of two parts, the Encoder and the Decoder. Both the Encoder and the Decoder contain 6 blocks.

1.2 Transformer workflow

  1. **Get the representation vector X of each word of the input sentence . X is obtained by adding the Embedding of the word (Embedding is the Feature extracted from the original data) and the Embedding of the word position.

img
 Transformer's input representation \text { Transformer's input representation } Transformer  input representation 

  1. Pass the obtained word representation vector matrix (as shown in the figure below, each row is the representation x of a word) into the Encoder. After 6 Encoder blocks, the encoding information matrix C of all words in the sentence can be obtained \mathbf{C}C , as shown below. The word vector matrix is​​X n × d X_{n \times d}Xn×dmeans, n \mathrm{n}n is the number of words in the sentence,d \mathrm{d}d is the dimension representing the vector (d = 512 in the paper \mathrm{d}=512d=512 ). The matrix dimension output by each Encoder block is exactly the same as the input.

img
 Transformer Encoder encodes sentence information \text { Transformer Encoder encodes sentence information } Transformer Encoder  encodes sentence information 

  1. Pass the encoding information matrix C output by the Encoder to the Decoder, and the Decoder will translate the next word i+1 based on the currently translated words 1~i, as shown in the figure below. In the process of using, when translating to word i+1, the word after i+1 needs to be covered by Mask (covering) operation.

img
T ransofrmer D ecoder prediction Transofrmer Decoder predictionT r an so f r m erDeco d er prediction
The Decoder in the above figure receives the encoding matrixC, and then first inputs a translation start symbol "" to predict the first word "I"; then inputs the translation start symbol "" and the word "I", predict the word "have", and so on. This is the general process of using Transformer, followed by the details of each part.

2. Transformer input

The input representation x of a word in Transformer is obtained by adding the word Embedding and the position Embedding (Positional Encoding).

img

2.1 Word Embedding

# 单词Embedding,可以利用nn.Embedding
#  i_vocab_size:输入词向量维度大小 hidden_size :隐藏层大小
i_vocab_embedding = nn.Embedding(i_vocab_size,hidden_size)

There are many ways to obtain word embedding. For example, it can be pre-trained using algorithms such as Word2Vec and Glove, or it can be trained in Transformer.

class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        """
        类的初始化函数
        d_model:指词嵌入的维度
        vocab:指词表的大小
        """
        super(Embeddings, self).__init__()
        #之后就是调用nn中的预定义层Embedding,获得一个词嵌入对象self.lut
        self.lut = nn.Embedding(vocab, d_model)
        #最后就是将d_model传入类中
        self.d_model =d_model
    def forward(self, x):
        """
        Embedding层的前向传播逻辑
        参数x:这里代表输入给模型的单词文本通过词表映射后的one-hot向量
        将x传给self.lut并与根号下self.d_model相乘作为结果返回
        """
        embedds = self.lut(x)
        return embedds * math.sqrt(self.d_model)

2.2 Location Embedding

In Transformer, in addition to word Embedding, positional Embedding also needs to be used to indicate the position where the word appears in the sentence. **Because Transformer does not use the structure of RNN, but uses global information, it cannot use the order information of words, and this part of information is very important for NLP. **So Transformer uses position Embedding to save the relative or absolute position of words in the sequence.

Position Embedding is represented by PE, and the dimensions of PE are the same as the word Embedding. PE can be obtained through training, or it can be calculated using some formula. The latter is adopted in Transformer, and the calculation formula is as follows:
PE ( pos , 2 i ) = sin ⁡ ( pos / 1000 0 2 i / d ) PE ( pos , 2 i + 1 ) = cos ⁡ ( pos / 1000 0 2 i / d ) \begin{gathered} P E_{(pos, 2 i)}=\sin \left(pos / 10000^{2 i / d}\right) \\ P E_{(pos, 2 i+1 )}=\cos \left(pos / 10000^{2 i / d}\right) \end{gathered}PE( p os , 2 i )=sin(pos/100002 i / d )PE( p os , 2 i + 1 ) .=cos(pos/100002 i / d )
Among them, pos represents the position of the word in the sentence, d \mathrm{d}d represents the dimension of PE (same as the word Embedding), 2i represents an even number of dimensions,2 i + 1 2 \mathrm{i}+12i _+1 represents odd dimensions (i.e.2 i ≤ d , 2 i + 1 ≤ d 2 \mathrm{i} \leq \mathrm{d} , 2 \mathrm{i}+1 \leq \mathrm{d}2i _d 2i _+1d ). Use this formula to calculatePEPEPE has the following benefits:

  • Make PE able to adapt to sentences longer than all the sentences in the training set. Suppose the longest sentence in the training set has 20 words, and suddenly a sentence with a length of 21 comes, then the 21st sentence can be calculated by using the formula calculation method Bit Embedding.
  • It allows the model to easily calculate the relative position. For a fixed-length spacing k, PE(pos+k) can be calculated using PE(pos). because

Sin ⁡ ( A + B ) = Sin ⁡ ( A ) Cos ⁡ ( B ) + Cos ⁡ ( A ) Sin ⁡ ( B ) , Cos ⁡ ( A + B ) = Cos ⁡ ( A ) Cos ⁡ ( B ) − \operatorname{Sin}(A+B)=\operatorname{Sin}(A) \operatorname{Cos}(B)+\operatorname{Cos}(A) \operatorname{Sin}(B), \operatorname{Cos}(A+B)=\operatorname{Cos}(A) \operatorname{Cos}(B)- Without ( A+B)=Without ( A )Cos(B)+Cos(A)Sin(B),Cos(A+B)=Cos(A)Cos(B) $\sin (A) \operatorname{Sin}(B) $

##  PositionalEncoding 代码实现
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
 
        ## 位置编码的实现其实很简单,直接对照着公式去敲代码就可以,下面这个代码只是其中一种实现方式;
        ## 从理解来讲,需要注意的就是偶数和奇数在公式上有一个共同部分,我们使用log函数把次方拿下来,方便计算;
        ## pos代表的是单词在句子中的索引,这点需要注意;比如max_len是128个,那么索引就是从0,1,2,...,127
        ##假设我的demodel是512维度,2i那个符号中i从0取到了255,那么2i对应取值就是0,2,4...510
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(dim=1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)## 这里需要注意的是pe[:, 0::2]这个用法,就是从0开始到最后面,补长为2,其实代表的就是偶数位置
        pe[:, 1::2] = torch.cos(position * div_term)##这里需要注意的是pe[:, 1::2]这个用法,就是从1开始到最后面,补长为2,其实代表的就是奇数位置
        ## 上面代码获取之后得到的pe:[max_len*d_model]
 
        ## 下面这个代码之后,我们得到的pe形状是:[max_len*1*d_model]
        pe = pe.unsqueeze(0).transpose(0, 1)
 
        self.register_buffer('pe', pe)  ## 定一个缓冲区,其实简单理解为这个参数不更新就可以
 
    def forward(self, x):
        """
        x: [seq_len, batch_size, d_model]
        """
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)
## Encoder 部分包含三个部分:词向量embedding,位置编码部分,注意力层及后续的前馈神经网络
 
class Encoder(nn.Module):
    def __init__(self):
        super(Encoder, self).__init__()
        self.src_emb = nn.Embedding(src_vocab_size, d_model)  ## 这个其实就是去定义生成一个矩阵,大小是 src_vocab_size * d_model
        self.pos_emb = PositionalEncoding(d_model) ## 位置编码情况,这里是固定的正余弦函数,也可以使用类似词向量的nn.Embedding获得一个可以更新学习的位置编码
        self.layers = nn.ModuleList([EncoderLayer() for _ in range(n_layers)]) ## 使用ModuleList对多个encoder进行堆叠,因为后续的encoder并没有使用词向量和位置编码,所以抽离出来;
 
    def forward(self, enc_inputs):
        ## 这里我们的 enc_inputs 形状是: [batch_size x source_len]
 
        ## 下面这个代码通过src_emb,进行索引定位,enc_outputs输出形状是[batch_size, src_len, d_model]
        enc_outputs = self.src_emb(enc_inputs)
 
        ## 这里就是位置编码,把两者相加放入到了这个函数里面,从这里可以去看一下位置编码函数的实现;3.
        enc_outputs = self.pos_emb(enc_outputs.transpose(0, 1)).transpose(0, 1)
 
        ##get_attn_pad_mask是为了得到句子中pad的位置信息,给到模型后面,在计算自注意力和交互注意力的时候去掉pad符号的影响,去看一下这个函数 4.
        enc_self_attn_mask = get_attn_pad_mask(enc_inputs, enc_inputs)
        enc_self_attns = []
        for layer in self.layers:
            ## 去看EncoderLayer 层函数 5.
            enc_outputs, enc_self_attn = layer(enc_outputs, enc_self_attn_mask)
            enc_self_attns.append(enc_self_attn)
        return enc_outputs, enc_self_attns

3. Self-Attention (self-attention mechanism)

img

The picture above is the internal structure diagram of the Transformer in the paper. The Encoder block is on the left and the Decoder block is on the right. The part in the red circle is Multi-Head Attention , which is composed of multiple Self-Attention . You can see that the Encoder block contains one Multi-Head Attention, and the Decoder block contains two Multi-Head Attention (one of which uses Masked ). There is also an Add & Norm layer above Multi-Head Attention. Add represents Residual Connection to prevent network degradation, and Norm represents Layer Normalization, which is used to normalize the activation value of each layer.

Because Self-Attention is the focus of Transformer, we focus on Multi-Head Attention and Self-Attention. First, we will learn more about the internal logic of Self-Attention.

3.1 Self-Attention structure

img

The above picture is the structure of Self-Attention. The matrices Q (query), K (key value), and V (value) need to be used during calculation . In practice, Self-Attention receives the input (matrix X composed of word representation vectors x) or the output of the previous Encoder block. Q, K, and V are obtained by linear transformation of the input of Self-Attention.

3.2 Calculation of Q, K, V

The input of Self-Attention is represented by matrix X, then Q, K, V can be calculated using linear transformation matrices WQ, WK, WV . The calculation is as shown in the figure below. Note that each row of X, Q, K, V represents a word.

img

3.3 Output of Self-Attention

After obtaining the matrices Q, K, and V, the output of Self-Attention can be calculated. The calculation formula is as follows:

Attention ⁡ ( Q , K , V ) = softmax ⁡ ( Q K T d k ) V \operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V Attention(Q,K,V)=softmax(dk QKT)V d k d_k dkis Q, KQ, KQ,The number of columns of the K
matrix, that is, the vector dimension . In the formula, the inner product of each row vector of the matrix Q and K is calculated. In order to prevent the inner product from being too large, it is divided bydk d_kdkAfter the square root of Q multiplied by the transpose of K, the number of rows and columns of the matrix obtained is n, nn, nn,n is the number of words in the sentence, and this matrix can represent the attention intensity between words
. The picture below shows Q timesKT, 1234 K^T, 1234KT ,1234represents the words in the sentence.

img

getQKTQK ^TQKAfter T , use Softmax to calculate the attention coefficient of each word for other words. Softmax in the formula is to perform Softmax on each row of the matrix, that is, the sum of each row becomes 1.

img

GetS oftmax SoftmaxThe S o f t max matrix can then be combined withVVV is multiplied to get the final outputZZZ

img

The first row of the Softmax matrix in the above figure represents the attention coefficient of word 1 and all other words, and the final output of word 1 is Z 1 Z_1Z1equal to all words i \mathrm{i}i 's 值V i V_iViAccording to the ratio of the attention coefficients, they are added together, as shown in the figure below:

img

3.4 Multi-Head Attention

In the previous step, we already know how to calculate the output matrix Z through Self-Attention, and Multi-Head Attention is formed by combining multiple Self-Attentions. The following figure is the structure diagram of Multi-Head Attention in the paper.

img

As you can see from the picture above, Multi − H ead A attention Multi-Head AttentionMultiHe a d A tt e n t i o n contains multiple $Self-Attention$ layers. First, the inputXXX is passed toh \mathrm{h}Among h different $Self-Attention$,hhh output matricesZZZ. _ The picture below showsh = 8 h=8h=In the case of time 8 , 8 output matrices Z will be obtained.

img

Get 8 output matrices Z 1 Z_1Z1to Z 8 Z_8Z8 之后, M u l t i − H e a d A t t e n t i o n Multi-Head Attention MultiHe a d A tt e n t i o nConcatenate them together $(Concat)and pass in one and then pass in, then pass in a Linearlayer, get the layer, and getlayer, get the final output of Multi-Head Attentionand the final outputThe final output Z$.

You can see that the matrix Z output by Multi-Head Attention has the same dimension as its input matrix X.

4. Encoder structure

img

The red part in the above picture is the $Encoder block structure of $Transformer$. You can see that it is composed of structure. You can see that it is composed ofStructure, you can see that it is composed of Multi-Head Attention$,A dd & N orm , Feed F orward , A dd & N orm Add \& Norm, Feed Forward, Add \& NormAdd&Norm,FeedForward,A dd & N or m composed. I have just learned about the calculation process of $Multi-Head Attention$, now let me know aboutAdd & Norm Add \& NormA dd & N or m and $Feed Forward$ section.

4.1 Add & Norm

The Add & Norm layer consists of Add and Norm. Its calculation formula is as follows:
Layer N orm (X + MultiHead Attention (X)) LayerNorm (X+ MultiHeadAttention (X))Layer N or m ( X _ _+MultiHeadAttention(X))

LayerNorm(X+FeedForward(X)) LayerNorm(X+FeedForward(X))Layer N or m ( X _ _+FeedForward(X))

where X \mathbf{X}X representsthe input of $Multi-Head Attention$ or $Feed Forward, the input ofThe input of , MultiHeadAttention $( X ) (\mathbf{X})( X ) and FeedForward (X \mathbf{X}X represents output (output and inputX \mathbf{X}The X dimensions are the same, so they can be added).
Add refers toX + \mathbf{X}+X+ M u l t i H e a d A t t e n t i o n MultiHeadAttention MultiHeadAttention( X ) \mathbf{X}) X ) is a residual connection, which is usually used to solve the problem of multi-layer network training. It allows the network to only focus on the current difference. It is often used in ResNet:

img

Norm L a y e r N o r m a l i z a t i o n Layer Normalization L a yer N or ma l i z a t i o n , usually used in $RNNstructure, structure,Structure, Layer Normalization$ will convert the input of each layer of neurons into the same mean and variance, which can speed up convergence.

4.2 Feed Forward

The Feed Forward layer is relatively simple. It is a two-layer fully connected layer. The activation function of the first layer is Relu, and the activation function is not used in the second layer. The corresponding formula is as follows.
max ⁡ ( 0 , XW 1 + b 1 ) W 2 + b 2 \max \left(0, X W_1+b_1\right) W_2+b_2max(0,XW1+b1)W2+b2

X \mathbf{X} X is the input, and the dimension of the final output matrix obtained by Feed Forward is the same asX \mathbf{X}X is consistent.

class FeedForwardNetwork(nn.Module):
    def __init__(self, hidden_size, filter_size, dropout_rate):
        super(FeedForwardNetwork, self).__init__()

        self.layer1 = nn.Linear(hidden_size, filter_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(dropout_rate)
        self.layer2 = nn.Linear(filter_size, hidden_size)
        initialize_weight(self.layer1)
        initialize_weight(self.layer2)

    def forward(self, x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.layer2(x)
        return x

4.3 Composition of Encoder

An Encoder block can be constructed through the Multi-Head Attention, Feed Forward, Add & Norm described above. The Encoder block receives the input matrix X (n × d) X_{(n \times d)}X(n×d), and output a matrix O ( n × d ) O_{(n \times d)}O(n×d). An Encoder can be formed by adding multiple Encoder blocks.
The input of the first Encoder block is the representation vector matrix of the sentence words. The input of the subsequent Encoder block is the output of the previous Encoder block. The matrix output by the last Encoder block is the encoding information matrix C. This matrix will be used in the Decoder later. .

img

class Encoder(nn.Module):
    def __init__(self):
        super(Encoder, self).__init__()
        self.src_emb = nn.Embedding(src_vocab_size, d_model)  ## 这个其实就是去定义生成一个矩阵,大小是 src_vocab_size * d_model
        self.pos_emb = PositionalEncoding(d_model) ## 位置编码情况,这里是固定的正余弦函数,也可以使用类似词向量的nn.Embedding获得一个可以更新学习的位置编码
        self.layers = nn.ModuleList([EncoderLayer() for _ in range(n_layers)]) ## 使用ModuleList对多个encoder进行堆叠,因为后续的encoder并没有使用词向量和位置编码,所以抽离出来;
 
    def forward(self, enc_inputs):
        ## 这里我们的 enc_inputs 形状是: [batch_size x source_len]
 
        ## 下面这个代码通过src_emb,进行索引定位,enc_outputs输出形状是[batch_size, src_len, d_model]
        enc_outputs = self.src_emb(enc_inputs)
 
        ## 这里就是位置编码,把两者相加放入到了这个函数里面,从这里可以去看一下位置编码函数的实现;3.
        enc_outputs = self.pos_emb(enc_outputs.transpose(0, 1)).transpose(0, 1)
 
        ##get_attn_pad_mask是为了得到句子中pad的位置信息,给到模型后面,在计算自注意力和交互注意力的时候去掉pad符号的影响,去看一下这个函数 4.
        enc_self_attn_mask = get_attn_pad_mask(enc_inputs, enc_inputs)
        enc_self_attns = []
        for layer in self.layers:
            ## 去看EncoderLayer 层函数 5.
            enc_outputs, enc_self_attn = layer(enc_outputs, enc_self_attn_mask)
            enc_self_attns.append(enc_self_attn)
        return enc_outputs, enc_self_attns

5. Decoder structure

img

The red part in the above picture is the $Decoder block structure of $Transformer$, and the structure, andStructure, similar to Encoder block $, but there are some differences:

  • Contains two $Multi-Head Attention$ layers.
  • 第一个 M u l t i − H e a d A t t e n t i o n Multi-Head Attention MultiThe He a d A tt e n t i o n layer uses the $Masked $ operation.
  • The layer of the second $Multi-Head Attention layerThe K, V matrix of the layer uses the matrix to useThe matrix uses the encoding information matrix C \mathbf{C}of Encoder $C performs calculations, whileQ \mathbf{Q}Q is calculated using the output of the previous $Decoder block$.
  • Finally there is a $Softmax$ layer that calculates the probability of the next translated word.

Complete decoder

class Decoder(nn.Module):
    def __init__(self, hidden_size, filter_size, dropout_rate, n_layers):
        super(Decoder, self).__init__()

        decoders = [DecoderLayer(hidden_size, filter_size, dropout_rate)
                    for _ in range(n_layers)]
        self.layers = nn.ModuleList(decoders)

        self.last_norm = nn.LayerNorm(hidden_size, eps=1e-6)

    def forward(self, targets, enc_output, i_mask, t_self_mask, cache):
        decoder_output = targets
        for i, dec_layer in enumerate(self.layers):
            layer_cache = None
            if cache is not None:
                if i not in cache:
                    cache[i] = {
    
    }
                layer_cache = cache[i]
            decoder_output = dec_layer(decoder_output, enc_output,
                                       t_self_mask, i_mask, layer_cache)
        return self.last_norm(decoder_output)
class DecoderLayer(nn.Module):
    def __init__(self, hidden_size, filter_size, dropout_rate):
        super(DecoderLayer, self).__init__()

        self.self_attention_norm = nn.LayerNorm(hidden_size, eps=1e-6)
        self.self_attention = MultiHeadAttention(hidden_size, dropout_rate)
        self.self_attention_dropout = nn.Dropout(dropout_rate)

        self.enc_dec_attention_norm = nn.LayerNorm(hidden_size, eps=1e-6)
        self.enc_dec_attention = MultiHeadAttention(hidden_size, dropout_rate)
        self.enc_dec_attention_dropout = nn.Dropout(dropout_rate)

        self.ffn_norm = nn.LayerNorm(hidden_size, eps=1e-6)
        self.ffn = FeedForwardNetwork(hidden_size, filter_size, dropout_rate)
        self.ffn_dropout = nn.Dropout(dropout_rate)

    def forward(self, x, enc_output, self_mask, i_mask, cache):
        y = self.self_attention_norm(x)
        y = self.self_attention(y, y, y, self_mask)
        y = self.self_attention_dropout(y)
        x = x + y

        if enc_output is not None:
            y = self.enc_dec_attention_norm(x)
            y = self.enc_dec_attention(y, enc_output, enc_output, i_mask,
                                       cache)
            y = self.enc_dec_attention_dropout(y)
            x = x + y

        y = self.ffn_norm(x)
        y = self.ffn(y)
        y = self.ffn_dropout(y)
        x = x + y
        return x

5.1 The first Multi-Head Attentionwd

The first Multi-Head Attention of the Decoder block uses the Masked operation, because it is translated sequentially during the translation process, that is, after the i-th \ mathrm{i} is translatedOnly i words can be translated i + 1 \mathrm{i}+1i+1 word. Masked operation can preventi \mathrm{i}i words knowi + 1 i+1i+Information after 1 word. Let's take the translation of "I have a cat" into "I have a cat" as an example to understand the Masked operation.
The following description uses a concept similar to Teacher Forcing. Children's shoes who are not familiar with Teacher Forcing can refer to the following article for a detailed explanation of the Seq2Seq model. At the time of Decoder, it is necessary to solve the current most likely translation based on the previous translation, as shown in the figure below. First, the first word is predicted as "I" based on the input "", and then the next word "have" is predicted based on the input "।".

img

D ecode DecodeDecoder can use $ Teacher Forcing$ during the training process and parallelize the training, that is, the correct word sequence ( <Begin> >> I have a cat) and the corresponding output (I have a cat <end> >> ) passed toD ecoder DecoderDeco d er . Then when predicting thei-th \mathrm{i}When output i , the word after i+1 must be masked. Note that the Mask operation is used before the $Softmax $ of $Self-Attention $, and 012345 is used below to represent "I have a cat" respectively.

  1. are the input matrix and Mask matrix of the Decoder. The input matrix contains "I have a cat" (0, 1, 2 (0,1,2(0,1,2 , 3, 4) Representation vectors of five words, Mask is a5 × 5 5 \times 55×5 matrix. In Mask, it can be found that word 0 can only use the information of word 0, while word 1 can use the information of word 0,1, that is, only the previous information can be used.

img
Input Matrix and Mask Matrix Input Matrix and Mask MatrixInput matrix and M a s k matrix

  1. The next operation is the same as the previous Self-Attention, by inputting the matrix X \mathbf{X}X keywordQK , V \mathbf{Q} \mathbf{K}, \mathbf{V}QK,V matrix. Then calculateQ \mathbf{Q}Q andKTK^TKThe product of T QKTQK^TQKT

img
Q times the transpose of K times the transpose of Q times KTranspose of Q times K

  1. After getting QKTQK^TQKAfter T , Softmax needs to be performed to calculate the attention score. Before Softmax, we need to use the Mask matrix to block the information after each word. The blocking operation is as follows:

img
Softmax beforeM ask Softmax beforeMaskS o f t ma xbeforeM a s kgetMask
QKTQK^ T _QKAfter T in MaskQKTQK^TQKSoftmax is performed on T , and the sum of each row is 1. But word 0 is in words1, 2, 3 1, 2, 3The attention scores on 1 , 2 , and 3 are all 0.

  1. Use Mask QKTQK^TQKT and matrixV \mathbf{V}Multiply V to get the output Z \mathbf{Z}Z , then the output vector of word 1Z 1 Z_1Z1It only contains word 1 information.

img
Output after Mask Output after MaskOutput after Mask _ _

  1. Through the above steps, you can get an output matrix Z i Z_i of $ Mask Self-Attention $Zi, then similar to $Encoder$, through Multi-Head Attention Multi-Head AttentionMultiHe a d A tt e n t i o nSplice multiple outputsZ i Z_iZiThen calculate the output Z of the first $Multi-Head Attention$, Z is the same as the input X \mathbf{X}The X dimension is the same.
class MultiHeadAttention(nn.Module):
    def __init__(self, hidden_size, dropout_rate, head_size=8):
        super(MultiHeadAttention, self).__init__()

        self.head_size = head_size

        self.att_size = att_size = hidden_size // head_size
        self.scale = att_size ** -0.5

        self.linear_q = nn.Linear(hidden_size, head_size * att_size, bias=False)
        self.linear_k = nn.Linear(hidden_size, head_size * att_size, bias=False)
        self.linear_v = nn.Linear(hidden_size, head_size * att_size, bias=False)
        initialize_weight(self.linear_q)
        initialize_weight(self.linear_k)
        initialize_weight(self.linear_v)

        self.att_dropout = nn.Dropout(dropout_rate)

        self.output_layer = nn.Linear(head_size * att_size, hidden_size,
                                      bias=False)
        initialize_weight(self.output_layer)

    def forward(self, q, k, v, mask, cache=None):
        orig_q_size = q.size()

        d_k = self.att_size
        d_v = self.att_size
        batch_size = q.size(0)

        # head_i = Attention(Q(W^Q)_i, K(W^K)_i, V(W^V)_i)
        # [b,q_len,h,d_k]
        q = self.linear_q(q).view(batch_size, -1, self.head_size, d_k)
        if cache is not None and 'encdec_k' in cache:
            k, v = cache['encdec_k'], cache['encdec_v']
        else:
            k = self.linear_k(k).view(batch_size, -1, self.head_size, d_k)
            v = self.linear_v(v).view(batch_size, -1, self.head_size, d_v)
            if cache is not None:
                cache['encdec_k'], cache['encdec_v'] = k, v

        q = q.transpose(1, 2)                  # [b, h, q_len, d_k]
        v = v.transpose(1, 2)                  # [b, h, v_len, d_v]
        k = k.transpose(1, 2).transpose(2, 3)  # [b, h, d_k, k_len]

        # Scaled Dot-Product Attention.
        # Attention(Q, K, V) = softmax((QK^T)/sqrt(d_k))V
        q.mul_(self.scale)
        x = torch.matmul(q, k)  # [b, h, q_len, k_len]
        x.masked_fill_(mask.unsqueeze(dim=1), -1e9)
        x = torch.softmax(x, dim=3)
        x = self.att_dropout(x)
        x = x.matmul(v)  # [b, h, q_len, attn]

5.2 The second Multi-Head Attention

The second Multi-Head Attention of Decoder block has little change. The main difference is that the K and V matrices of Self-Attention are not calculated using the output of the previous Decoder block, but are calculated using the encoding information matrix C of the Encoder.
Calculate K , V \mathbf{K}, \mathbf{V} according to the output C of the EncoderK,V , according to the output Z \mathbf{Z}of the previous Decoder blockZ calculationQ \mathbf{Q}Q (if it is the first Decoder block, use the input matrixX \mathbf{X}X for calculation), the subsequent calculation method is consistent with the previous description.
The advantage of this is that during the Decoder, each word can use the information of all the words of the Encoder (the information does not need a Mask).

5.3 Softmax predicts output words

The last part of the Decoder block is to use Softmax to predict the next word. In the previous network layer, we can get a final output Z. Because of the existence of Mask, the output Z0 of word 0 only contains the information of word 0, as follows:

img
Z before Decoder Softmax Z before Decoder SoftmaxDeco d er S o f t max Z before _

Softmax predicts the next word based on each row of the output matrix:

img
D ecoder S oftmax predictionDecoder Softmax predictionDeco d er Soft max prediction This is the definition of
the Decoder block. Like the Encoder, the Decoder is composed of multiple Decoder blocks .

6.Transformer summary

  • Transformer is different from RNN and can be trained in parallel better.
  • Transformer itself cannot utilize the order information of words, so position Embedding needs to be added to the input, otherwise Transformer will be a bag-of-word model.
  • The focus of Transformer is the Self-Attention structure, which uses Q , K , V \mathbf{Q}, \mathbf{K}, \mathbf{V}Q,K,The V matrix is ​​obtained by linear transformation of the output.
  • There are multiple Self-Attentions in Multi-Head Attention in Transformer, which can capture the correlation coefficient attention scores in multiple dimensions between words.

reference

  1. Attention Is All You Need
  2. Detailed explanation of the Transformer model (the most complete version with illustrations)

Guess you like

Origin blog.csdn.net/weixin_42917352/article/details/128699231