[Transformer Series] Understanding the Transformer network model in simple terms (comprehensive article)

1. Reference materials

The Illustrated Transformer
Illustrated Transformer (full version)
Attention Is All You Need: The Core Idea of ​​the Transformer
transformer summary (super detailed - first edition)
Detailed explanation of the network structure of each layer of Transformer! A must for interviews! (Attached with code implementation)
Core technology of large language model-Detailed explanation of Transformer

Thesis: Attention Is All You Need

2. Introduction to Transformer

1. Disadvantages of RNN

Due to its sequential structure, the training speed of RNN is often limited.

RNN series models have poor parallel computing capabilities . The problem with RNN parallel computing lies here, because the calculation at time T depends on the hidden layer calculation result at time T-1, and the calculation at time T-1 depends on the hidden layer calculation result at time T-2. This continues to form the so-called sequence dependencies.

2. Disadvantages of seq2seq

The biggest problem with seq2seq is to compress all the information on the Encoder side into a fixed-length vector and use it as the input of the first hidden state on the Decoder side to predict the hidden state of the first word (token) on the Decoder side. When the input sequence is relatively long, this will obviously lose a lot of information on the Encoder side, and the fixed vector will be sent to the Decoder side all at once, and the Decoder side will not be able to pay attention to the information it wants to pay attention to.

3. Advantages and Disadvantages of Transformer

3.1 Advantages

  • Representation ability . Transformer allows the source sequence and the target sequence to be "self-associated". The embedding of the source sequence and the target sequence contains richer information, and the subsequent FFN (feedforward neural network) layer also enhances the expressive ability of the model.

  • Feature extraction capabilities . Transformer's feature extraction capability is better than the RNN series models. For specific experimental comparisons, please refer to: Give up illusions and fully embrace Transformer: Comparison of three major feature extractors (CNN/RNN/TF) for natural language processing .

  • Parallel computing capabilities . Transformer 's parallel computing capability far exceeds that of the seq2seq series models.

  • Semantic structure . Transformers are good at capturing semantic structure in text and images, and they capture semantic structure in text and even images better than other techniques. The combination of the generalization capabilities of the Transformer and the detail-preserving capabilities of the Diffusion Model provides the ability to generate fine-grained, highly detailed images while preserving the semantic structure in the image.

  • Generalization ability . In vision applications, Transformer exhibits the advantages of generalization and adaptation , making it suitable for general learning .

3.2 Disadvantages

Transformer requires large amounts of data and faces performance issues in many vision domains.

4. Transformer input and output dimension transformation

The overall structure of Transformer can be roughly divided into: Input, Encoder, Decoder, and Output.

  • Input : Assuming that the length of the input sequence is T, the dimension of the Encoder input is the data [batch_size, T]generated after the embedding layer, position encoding and other processes [batch_size, T, D]. D represents the hidden layer dimension of the model;
  • Encoder : Assuming that the length of the input sequence is T, the dimension of the Encoder input is [batch_size, T]. After the embedding layer, position encoding and other processes, the data of [batch_size, T, D] is generated. D represents the hidden layer dimension of the model;
  • Decoder : The input of Decoder is also obtained through similar transformation [batch_size, T', D]. T' is the input length of Decoder. After that, you will enter multiple modules with the same results, each module includes:
    • Self Multi-head Attention, indicating that Attention is performed internally on the elements in the Decoder sequence, which is the same as the Encoder;
    • Add&Norm
    • Cross Multi-head Attention, Attention is performed on each position of the Decoder and each position of the Encoder, similar to the Attention in the traditional seq2seq, used to align the Decoder and Encoder;
    • Add&Norm
    • Feed Forward
    • Add&Norm
      Insert image description here

5. Scaled Dot-Product Attention

The method of determining the weight distribution of value through the similarity between query and key is called Scaled Dot-Product Attention.

Why is Transformer used Scaled Dot-Product Attention?

  • Common attention mechanisms currently include:additive attentionanddot-product attention. The theoretical complexity of the two is similar, but since the matrix multiplication code has been optimized, the dot multiplication attention is more efficient in practice, both in terms of calculation and storage.

  • Dot multiplication attention adds 1 scaling factor ( 1 dk \frac1{\sqrt{d_k}}dk 1). The reason for the increase is: when dk \sqrt{d_k}dk When it is small, the performance of the two attention mechanisms is similar; while dk \sqrt{d_k}dk When it increases, the performance of dot product attention becomes worse. It is believed that the value after dot product is too large, causing the softmax function to approach the edge and the gradient is small.

6. Machine translation

Let's use an example of translating French to English to explain the process of machine translation by Transformer:
Insert image description here
Insert image description here
Insert image description here

6.1 Step 1

First, the Embedding (word embedding) of the input sentence is passed to the encoder block , which has a multi-head attention mechanism. The Q, K, and V values ​​calculated from the word embedding and weight matrices are input into this module, which then generates a matrix that can be passed to the feedforward neural network. In the Transformer paper, this coding block will be repeated N times, and the general value of N is 6.
Insert image description here

6.2 Step 2

The output of the encoder is then fed into the decoder block . The decoder's job is to output an English translation. Each step of the decoder block takes as input the first few words of the translation that has been generated. Initially, the translation will start with a marker that starts a sentence. This label is input to the multi-head attention block and used to calculate Q, K and V for this multi-head attention block. The output of this block will be used to generate the Q matrix of the next multi-head attention block, and the output of the encoder will be used to generate K and V.
Insert image description here

6.3 Step 3

The output of the decoder block is fed into a feedforward neural network , which is tasked with predicting the next word in the sentence.

7. Application of Transformer in CV field

  • In May 2020, Facebook AI Lab launchedDetection Transformer(DETR), for target detection and panoramic segmentation. This is the first target detection framework that successfully integrates Transformer as the central building block of the detection pipeline, and its detection performance on large targets is better than Faster R-CNN.
  • In October 2020, Google proposed Vision Transformer (ViT), which can directly use Transformer to classify images without the need for a convolutional network. The ViT model achieves results comparable to state-of-the-art convolutional networks, but requires significantly less computational resources for training.

3. Transformer structure

The most powerful ViT (Vision Transformer) principle and code analysis on the entire network

The structure of Transformer is the same as the Attention model, using the encoder-decoder architecture. But its structure is more complex than Attention. In the paper, the encoder layer is stacked by 6 encoders, and the decoder layer is the same.
Insert image description here

0. Overall structure

Transformer is formed by a bunch of encoders and decoders. Encoder and decoder are composed of multi-head attention layers and fully connected feed-forward networks . The high-level structure of the network is as follows:

  • EncoderIt consists of N encoder blocks (Encoder Block) connected in series. Each encoder block contains:
    • A Multi -Head Attention layer;
    • A feed forward fully connected neural network (Feed Forward Neural Network);
  • DecoderIt is also composed of N decoder blocks (Decoder Block) connected in series. Each decoder block contains:
    • a multi-headed attention layer;
    • A multi-head attention layer for the Encoder output;
    • A feedforward fully connected neural network;
      Insert image description here

1. Transformer input

The input representation x of a word in Transformer is obtained by adding Embedding(词嵌入)and Positional Encoding位置编码.

1.1 Embedding

For a detailed introduction to the principle of Embedding, please check the blog: [Transformer Series] Understanding Embedding (Word Embedding) in simple terms

Embedding can be obtained in many ways. For example, it can be pre-trained using algorithms such as Word2Vec and Glove, or it can be trained in Transformer.

1.2 Positional Embedding

For a detailed introduction to the principle of Positional Embedding, please check the blog: [Transformer Series] Understand Positional Encoding in simple terms

In Transformer, in addition Embedding(词嵌入), you also need to use Positional Embeddingto indicate the position where the word appears in the sentence. Because Transformer does not use the structure of RNN, but uses global information , it cannot use the order information of words, and this part of information is very important for NLP. Therefore, Transformer uses Positional Embeddingto save the relative or absolute position of the word in the sequence.

2. Encoder

2.0 Encoder structure

The Encoder structure is given by N = 6 \text{N} = 6N=6 identical onesencoder blockare stacked,Encoder blockofMulti-Head Attention,Add&Norm,Feed Forward,Add & Normlayers. The input matrix and output matrix dimensions of eachencoder blockare the same, and each layer mainly has two sub-layers:

  1. The first sub-layer is the multi-head attention mechanism (Multi-Head Attention);
  2. The second is a simple positionwise fully connected feed forward network (Positionwise Feed Forward).
    Insert image description here

2.1 Encoder encoding process

First, the model needs to perform an embedding (word embedding) operation on the input data, which can also be understood as an operation similar to w2c. After the enmbedding is completed, it is input to the encoder layer. After self-attention processes the data, the data is sent to the feedforward neural network. , the calculation of the feedforward neural network can be parallelized, and the obtained output will be input to the next encoder.
Insert image description here
Insert image description here

2.2 Embedding

For a detailed introduction to the principle of Embedding, please check the blog: [Transformer Series] Understanding Embedding (Word Embedding) in simple terms

2.3 Positional Encoding

For a detailed introduction to the principle of Positional Encoding, please check the blog: [Transformer Series] Understand Positional Encoding in simple terms

The input is positionally encoded so that the position of the word in the sentence is taken into account in the translation. This is done using a set of sine and cosine equations.

Attention lacks a way to explain the order of words in the input sequence, which is not the same as the sequence model (RNN). To handle this, the Transformer positionally encodes the input so that the position of the word in the sentence is taken into account in the translation. Specifically, Transformer adds an additional vector Positional Encoding to the input of the encoder layer and decoder layer. The dimension is the same as the embedding dimension. This vector uses a very unique method to allow the model to learn this value. This vector can Determines the position of the current word, or the distance between different words in a sentence. There are many specific calculation methods for this position vector. The Transformer paper uses a set of sine and cosine equations to achieve it. The calculation method is as follows:
PE ( pos , 2 i ) = sin ( pos 1000 0 2 idmodel ) PE (pos,2i) =sin(\frac{pos}{10000^{\frac{2i}{d_{model}}}})PE ( p os ,2i ) _=sin(10000dmodel2i _pos)

P E ( p o s , 2 i + 1 ) = c o s ( p o s 1000 0 2 i d m o d e l ) PE(pos,2i+1)=cos(\frac{pos}{10000^{\frac{2i}{d_{model}}}}) PE ( p os ,2i _+1)=cos(10000dmodel2i _pos)

Among them, pos refers to the position of the current word in the sentence, and i refers to the index of each value in the vector. It can be seen that at even positions, sine encoding is used, and at odd positions, cosine encoding is used .

Finally, the Positional Encoding and embedding values ​​are added and sent to the next layer as input.
Insert image description here

2.4 Self-Attention

For a detailed introduction to the principle of Self-Attention, please check the blog: [Transformer Series] Understand the Attention and Self-Attention mechanisms in simple terms

2.5 Multi-Headed Attention

For a detailed introduction to the principle of Multi-Headed Attention, please check the blog: [Transformer Series] Understand the Attention and Self-Attention mechanisms in simple terms

Multi-Headed AttentionIt is Self-Attentionimproved on the basis, that is, when q, k, v are generated, q, k, v are divided into num_heads parts, and self-attention operations are performed on each part, and finally Spliced ​​together, this achieves parameter isolation to a certain extent .

Multi-Headed Attention It not only initializes one set of Q, K, and V matrices, but initializes multiple sets. Transformer uses 8 sets , so the final result is 8 matrices.

Multi-head AttentionThe schematic diagram is as follows:
Insert image description here
Insert image description here

2.6 Add & Norm module

Insert image description here
Insert image description here

In Transformer, each sub-layer (self-attition, Feed Forward Neural Network) will be followed by a incomplete module and a Layer normalization.
Add & NormThe layer consists of two parts Add: and Norm. Here Addrefers to X + MultiHeadAttention(X), which is a residual connection. NormYes Layer Normalization. Add & NormThe layer calculation process can be expressed as:

Layer Norm ( X + MultiHeadAttention ( X ) ) \text{Layer Norm}(X+\text{MultiHeadAttention}(X)) Layer Norm(X+MultiHeadAttention(X))

Among them, Add represents Residual Connection, which is to solve the problem of difficulty in training multi-layer neural networks.By passing the information from the previous layer to the next layer without any difference, you can effectively focus only on the differences., this method is often used in image processing results such as ResNet.
Insert image description here

Layer Norm is a commonly used neural network normalization technology that can make model training more stable and converge faster. Its main function is to normalize each sample in the feature dimension, reduce the dependence between different features, and improve the generalization ability of the model. For its principle, please refer to the paper: Layer Normalization . The calculation visualization of the Layer Norm layer is shown in the figure below:
Insert image description here

2.7 Feed Forward

Insert image description here

The full name of Feed Forward layer Feed Forward Neural Network(FFN for short), that is, feed forward neural network, is essentially a two-layer fully connected layer. The activation function of the first layer is Relu, and the second layer does not use an activation function. The calculation process can be expressed by mathematical formulas. for:

FFN ⁡ ( X ) = max ⁡ ( 0 , X W 1 + b 1 ) W 2 + b 2 \operatorname{FFN}(X)=\max(0,XW_1+b_1)W_2+b_2 FFN(X)=max(0,XW1+b1)W2+b2

In addition to using two fully connected layers to complete the linear transformation, another way is to use two 1 × 1 1\times 1 kernel_size = 11×1 convolutional layer, the input and output dimensions remain unchanged, both are 512, and the intermediate dimension is 2048.

Feed Forward cannot input 8 matrices, what should we do? So we need a way to reduce 8 matrices to 1. First, we connect 8 matrices together to get a large matrix, then randomly initialize a weight matrix and multiply it with this combined large matrix to get a final matrix.
Insert image description here

3. Decoder

According to the overall structure diagram above, it can be seen that the structure of Decoder is similar to that of Encoder. First, a Positional Encoding position vector is added, and then a masked mutil-head attification is added. The mask here is a very critical technology of Transformer. This chapter explains it Give a detailed introduction. The rest of the layer structure is the same as the Encoder, please refer to the Encoder layer structure.

3.1 Decoder structure

Insert image description here

3.2 masked mutil-head attetion

mask represents a mask, which masks certain values ​​so that they have no effect when parameters are updated . There are two types of masks involved in the Transformer model, namely padding maskand sequence mask. Among them, it is needed padding maskin all , but only in the decoder .scaled dot-product attentionsequence maskself-attention

The method of determining the weight distribution of value through the similarity between query and key is called scaled dot-product attention.

  • For the decoder's self-attention, which is used scaled dot-product attention, both padding maskand are sequence maskrequired attn_mask. The specific implementation is the addition of two masks attn_mask.

  • In other cases, attn_maskit is equal to padding mask.

3.2.1 padding mask

What is padding mask? Because the length of each batch of input sequences is different, that is to say, we need to align the input sequences. Specifically, shorter sequences are padded with 0s. If the input sequence is too long, the content on the left is intercepted and the excess is discarded directly. For input sequences that are too long, these filled positions are meaningless. The attention mechanism should not focus on these positions, so we need to do some processing.

The specific method is to add the value of these positions toVery large negative numbers (negative infinity), so after softmax, the probability of these positions will be close to 0.

padding maskIt is actually a tensor, each value is a Boolean, and the place where the value is false is where we want to process it.

3.2.2 Sequence mask

sequence maskThis is to prevent the decoder from seeing future information. That is to say, for a sequence, when time_step is t, the decoding output should only depend on the output before time t, but not the output after t. Therefore, we need to find a way to hide the information after t.

So how to do it? It's also very simple: generate an upper triangular matrix, and the values ​​in the upper triangle are all 0. Applying this matrix to each sequence can achieve our goal .

3.3 Output layer

After all the decoder layers are executed, how to map the obtained vector to the word we need? It is very simple. We only need to add another one at the end.Fully connected layerandsoftmax layer. If the dictionary has 1w words, then softmax will eventually input the probability of 1w words, and the corresponding word with the largest probability value is the final result.

4. Dynamic flow chart

In the encoder stage, the encoder starts working by processing the input sequence. The output of the top encoder is then converted into a vector containing K (key vector) and V (value vector)attention vector set, which is a parallelization operation . These vectors will be used by each decoder for its own "encoding-decoding attention layer", and these layers can help the decoder pay attention to the relevance (importance) of the input sequence:
Insert image description here

After completing the encoding phase, the decoding phase begins. Each step of the decoding phase outputs an element of the output sequence (e.g. English to German).

In the following steps, this process is repeated until a special termination symbol is reached, which indicates that the Transformer's decoder has completed its output. The output of each step is provided to the bottom decoders at the next time_step, which output their decoded results.
Insert image description here

5. Transformer Summary

  • Transformer, unlike RNN, can be trained better in parallel .
  • Transformer itself cannot utilize the order information of words, so position coding needs to be added to the input Positional Encoding, otherwise Transformer will be a bag-of-word model.
  • The focus of Transformer is the Self-Attention structure, in which the Q, K, and V matrices used are obtained through linear transformation.
  • There are multiple Self-Attentions in Multi-Head Attention in Transformer, which can capture the correlation coefficients in multiple dimensions between words Attention Score.

4. Relevant experience

A TensorFlow Implementation of Attention Is All You Need After staying up all night
with Transformer analysis and tensorflow code interpretation
, I implemented the Transformer model from scratch and will tell you the code with a
zero-based understanding of why it is a Transformer? What is a Transformer? (Understand Transformer and its pytorch source code in a simple and easy way)

1. Open source projects

Transformers
NLP_ability
ML-NLP

2. Self-Attention code implementation

class ScaleDotProductAttention(nn.Module):
    def __init__(self, ):
        super(ScaleDotProductAttention, self).__init__()
        self.softmax = nn.Softmax(dim = -1)

    def forward(self, Q, K, V, mask=None):
        K_T = K.transpose(-1, -2) # 计算矩阵 K 的转置  
        d_k = Q.size(-1)
        # 1, 计算 Q, K^T 矩阵的点积,再除以 sqrt(d_k) 得到注意力分数矩阵
        scores = torch.matmul(Q, K_T) / math.sqrt(d_k)
        # 2, 如果有掩码,则将注意力分数矩阵中对应掩码位置的值设为负无穷大
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        # 3, 对注意力分数矩阵按照最后一个维度进行 softmax 操作,得到注意力权重矩阵,值范围为 [0, 1]
        attn_weights = self.softmax(scores)
        # 4, 将注意力权重矩阵乘以 V,得到最终的输出矩阵
        output = torch.matmul(attn_weights, V)

        return output, attn_weights

# 创建 Q、K、V 三个张量
Q = torch.randn(5, 10, 64)  # (batch_size, sequence_length, d_k)
K = torch.randn(5, 10, 64)  # (batch_size, sequence_length, d_k)
V = torch.randn(5, 10, 64)  # (batch_size, sequence_length, d_k)

# 创建 ScaleDotProductAttention 层
attention = ScaleDotProductAttention()

# 将 Q、K、V 三个张量传递给 ScaleDotProductAttention 层进行计算
output, attn_weights = attention(Q, K, V)

# 打印输出矩阵和注意力权重矩阵的形状
print(f"ScaleDotProductAttention output shape: {
      
      output.shape}") # torch.Size([5, 10, 64])
print(f"attn_weights shape: {
      
      attn_weights.shape}") # torch.Size([5, 10, 10])

3. Multi-head Attention code implementation

class MultiHeadAttention(nn.Module):
    """Multi-Head Attention Layer
    Args:
        d_model: Dimensions of the input embedding vector, equal to input and output dimensions of each head
        n_head: number of heads, which is also the number of parallel attention layers
    """
    def __init__(self, d_model, n_head):
        super(MultiHeadAttention, self).__init__()
        self.n_head = n_head
        self.attention = ScaleDotProductAttention()
        self.w_q = nn.Linear(d_model, d_model)  # Q 线性变换层
        self.w_k = nn.Linear(d_model, d_model)  # K 线性变换层
        self.w_v = nn.Linear(d_model, d_model)  # V 线性变换层
        self.fc = nn.Linear(d_model, d_model)   # 输出线性变换层
        
    def forward(self, q, k, v, mask=None):
        # 1. dot product with weight matrices
        q, k, v = self.w_q(q), self.w_k(k), self.w_v(v) # size is [batch_size, seq_len, d_model]
        # 2, split by number of heads(n_head) # size is [batch_size, n_head, seq_len, d_model//n_head]
        q, k, v = self.split(q), self.split(k), self.split(v)
        # 3, compute attention
        sa_output, attn_weights = self.attention(q, k, v, mask)
        # 4, concat attention and linear transformation
        concat_tensor = self.concat(sa_output)
        mha_output = self.fc(concat_tensor)
        
        return mha_output
    
    def split(self, tensor):
        """
        split tensor by number of head(n_head)

        :param tensor: [batch_size, seq_len, d_model]
        :return: [batch_size, n_head, seq_len, d_model//n_head], 输出矩阵是四维的,第二个维度是 head 维度
        
        # 将 Q、K、V 通过 reshape 函数拆分为 n_head 个头
        batch_size, seq_len, _ = q.shape
        q = q.reshape(batch_size, seq_len, n_head, d_model // n_head)
        k = k.reshape(batch_size, seq_len, n_head, d_model // n_head)
        v = v.reshape(batch_size, seq_len, n_head, d_model // n_head)
        """
        
        batch_size, seq_len, d_model = tensor.size()
        d_tensor = d_model // self.n_head
        split_tensor = tensor.view(batch_size, seq_len, self.n_head, d_tensor).transpose(1, 2)
        # it is similar with group convolution (split by number of heads)
        
        return split_tensor
    
    def concat(self, sa_output):
        """ merge multiple heads back together

        Args:
            sa_output: [batch_size, n_head, seq_len, d_tensor]
            return: [batch_size, seq_len, d_model]
        """
        batch_size, n_head, seq_len, d_tensor = sa_output.size()
        d_model = n_head * d_tensor
        concat_tensor = sa_output.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model)
        
        return concat_tensor

4. Add & Norm code implementation

class LayerNorm(nn.Module):
    def __init__(self, d_model, eps=1e-12):
        super(LayerNorm, self).__init__()
        self.gamma = nn.Parameter(torch.ones(d_model))
        self.beta = nn.Parameter(torch.zeros(d_model))
        self.eps = eps
    
    def forward(self, x):
        mean = x.mean(-1, keepdim=True) # '-1' means last dimension. 
        var = x.var(-1, keepdim=True)
        
        out = (x - mean) / torch.sqrt(var + self.eps)
        out = self.gamma * out + self.beta
        
        return out

# NLP Example
batch, sentence_length, embedding_dim = 20, 5, 10
embedding = torch.randn(batch, sentence_length, embedding_dim)

# 1,Activate nn.LayerNorm module
layer_norm1 = nn.LayerNorm(embedding_dim)
pytorch_ln_out = layer_norm1(embedding)

# 2,Activate my nn.LayerNorm module
layer_norm2 = LayerNorm(embedding_dim)
my_ln_out = layer_norm2(embedding)

# 比较结果
print(torch.allclose(pytorch_ln_out, my_ln_out, rtol=0.1,atol=0.01))  # 输出 True

5. Feed Forward code implementation

The Pytorch implementation code of the PositionwiseFeedForward layer is as follows:

class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_diff, drop_prob=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, d_diff)
        self.fc2 = nn.Linear(d_diff, d_model)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(drop_prob)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        
        return x

6. Encoder code implementation

Based on the previous content Multi-Head Attention, we can completely implement the Encoder structure.Feed ForwardAdd & Norm

Insert image description here

class EncoderLayer(nn.Module):
    def __init__(self, d_model, ffn_hidden, n_head, drop_prob=0.1):
        super(EncoderLayer, self).__init__()
        self.mha = MultiHeadAttention(d_model, n_head)
        self.ffn = PositionwiseFeedForward(d_model, ffn_hidden)
        self.ln1 = LayerNorm(d_model)
        self.ln2 = LayerNorm(d_model)
        self.dropout1 = nn.Dropout(drop_prob)
        self.dropout2 = nn.Dropout(drop_prob)
    
    def forward(self, x, mask=None):
        x_residual1 = x
        
        # 1, compute multi-head attention
        x = self.mha(q=x, k=x, v=x, mask=mask)
        
        # 2, add residual connection and apply layer norm
        x = self.ln1( x_residual1 + self.dropout1(x) )
        x_residual2 = x
        
        # 3, compute position-wise feed forward
        x = self.ffn(x)
        
        # 4, add residual connection and apply layer norm
        x = self.ln2( x_residual2 + self.dropout2(x) )
        
        return x

class Encoder(nn.Module):
    def __init__(self, enc_voc_size, seq_len, d_model, ffn_hidden, n_head, n_layers, drop_prob=0.1, device='cpu'):
        super().__init__()
        self.emb = TransformerEmbedding(vocab_size = enc_voc_size,
                                        max_len = seq_len,
                                        d_model = d_model,
                                        drop_prob = drop_prob,
                                        device=device)
        self.layers = nn.ModuleList([EncoderLayer(d_model, ffn_hidden, n_head, drop_prob) 
                                     for _ in range(n_layers)])
    
    def forward(self, x, mask=None):
        
        x = self.emb(x)
        
        for layer in self.layers:
            x = layer(x, mask)
        return x

7. Decoder code implementation

Insert image description here

The code implementation of the Decoder component is as follows:

class DecoderLayer(nn.Module):

    def __init__(self, d_model, ffn_hidden, n_head, drop_prob):
        super(DecoderLayer, self).__init__()
        self.mha1 = MultiHeadAttention(d_model, n_head)
        self.ln1 = LayerNorm(d_model)
        self.dropout1 = nn.Dropout(p=drop_prob)
        
        self.mha2 = MultiHeadAttention(d_model, n_head)
        self.ln2 = LayerNorm(d_model)
        self.dropout2 = nn.Dropout(p=drop_prob)
        
        self.ffn = PositionwiseFeedForward(d_model, ffn_hidden)
        self.ln3 = LayerNorm(d_model)
        self.dropout3 = nn.Dropout(p=drop_prob)
    
    def forward(self, dec_out, enc_out, trg_mask, src_mask):
        x_residual1 = dec_out
        
        # 1, compute multi-head attention
        x = self.mha1(q=dec_out, k=dec_out, v=dec_out, mask=trg_mask)
        
        # 2, add residual connection and apply layer norm
        x = self.ln1( x_residual1 + self.dropout1(x) )
        
        if enc_out is not None:
            # 3, compute encoder - decoder attention
            x_residual2 = x
            x = self.mha2(q=x, k=enc_out, v=enc_out, mask=src_mask)
    
            # 4, add residual connection and apply layer norm
            x = self.ln2( x_residual2 + self.dropout2(x) )
        
        # 5. positionwise feed forward network
        x_residual3 = x
        x = self.ffn(x)
        # 6, add residual connection and apply layer norm
        x = self.ln3( x_residual3 + self.dropout3(x) )
        
        return x
    
class Decoder(nn.Module):
    def __init__(self, dec_voc_size, max_len, d_model, ffn_hidden, n_head, n_layers, drop_prob, device):
        super().__init__()
        self.emb = TransformerEmbedding(d_model=d_model,
                                        drop_prob=drop_prob,
                                        max_len=max_len,
                                        vocab_size=dec_voc_size,
                                        device=device)

        self.layers = nn.ModuleList([DecoderLayer(d_model=d_model,
                                                  ffn_hidden=ffn_hidden,
                                                  n_head=n_head,
                                                  drop_prob=drop_prob)
                                     for _ in range(n_layers)])

        self.linear = nn.Linear(d_model, dec_voc_size)

    def forward(self, trg, src, trg_mask, src_mask):
        trg = self.emb(trg)

        for layer in self.layers:
            trg = layer(trg, src, trg_mask, src_mask)

        # pass to LM head
        output = self.linear(trg)
        return output

8. Transformer code implementation

Core technology of large language model-Detailed explanation of Transformer

Based on the Encoder and Decoder components implemented previously, we can implement the complete code of the Transformer model, as shown below:

import torch
from torch import nn

from models.model.decoder import Decoder
from models.model.encoder import Encoder


class Transformer(nn.Module):

    def __init__(self, src_pad_idx, trg_pad_idx, trg_sos_idx, enc_voc_size, dec_voc_size, d_model, n_head, max_len,
                 ffn_hidden, n_layers, drop_prob, device):
        super().__init__()
        self.src_pad_idx = src_pad_idx
        self.trg_pad_idx = trg_pad_idx
        self.trg_sos_idx = trg_sos_idx
        self.device = device
        self.encoder = Encoder(d_model=d_model,
                               n_head=n_head,
                               max_len=max_len,
                               ffn_hidden=ffn_hidden,
                               enc_voc_size=enc_voc_size,
                               drop_prob=drop_prob,
                               n_layers=n_layers,
                               device=device)

        self.decoder = Decoder(d_model=d_model,
                               n_head=n_head,
                               max_len=max_len,
                               ffn_hidden=ffn_hidden,
                               dec_voc_size=dec_voc_size,
                               drop_prob=drop_prob,
                               n_layers=n_layers,
                               device=device)

    def forward(self, src, trg):
        src_mask = self.make_pad_mask(src, src, self.src_pad_idx, self.src_pad_idx)

        src_trg_mask = self.make_pad_mask(trg, src, self.trg_pad_idx, self.src_pad_idx)

        trg_mask = self.make_pad_mask(trg, trg, self.trg_pad_idx, self.trg_pad_idx) * \
                   self.make_no_peak_mask(trg, trg)

        enc_src = self.encoder(src, src_mask)
        output = self.decoder(trg, enc_src, trg_mask, src_trg_mask)
        return output

    def make_pad_mask(self, q, k, q_pad_idx, k_pad_idx):
        len_q, len_k = q.size(1), k.size(1)

        # batch_size x 1 x 1 x len_k
        k = k.ne(k_pad_idx).unsqueeze(1).unsqueeze(2)
        # batch_size x 1 x len_q x len_k
        k = k.repeat(1, 1, len_q, 1)

        # batch_size x 1 x len_q x 1
        q = q.ne(q_pad_idx).unsqueeze(1).unsqueeze(3)
        # batch_size x 1 x len_q x len_k
        q = q.repeat(1, 1, 1, len_k)

        mask = k & q
        return mask

    def make_no_peak_mask(self, q, k):
        len_q, len_k = q.size(1), k.size(1)

        # len_q x len_k
        mask = torch.tril(torch.ones(len_q, len_k)).type(torch.BoolTensor).to(self.device)

        return mask

5. FAQ

Answer analysis (1) - The most comprehensive Transformer interview questions in history: Soul 20 questions will help you thoroughly understand Transformer.
Several questions about Transformer will be compiled and recorded in
the details and techniques of Transformer.

1. Why is Multi-head Attention needed?

Why does Transformer need Multi-head Attention?

The original paper mentioned that the reason for Multi-head Attention is to divide the model into multiple heads to formmultiple subspaces, which allows the model to focus on information from different aspects (from different angles), and finally synthesize the information from all aspects. In fact, it can be intuitively imagined that if you design such a model yourself, you will definitely not do attention only once. The combined results of multiple attentions can at least enhance the model, and can also be compared to using multiple convolutions at the same time in CNN. The role of the nucleus . Intuitively speaking, multi-head attention ensures that Transformer can pay attention to information in different subspaces, helping the network capture richer features/information .

2. Why do Q and K use different weight matrices?

Why are different K and Q used in the transformer? Why can't the same value be used?

To simply understand, using Q/K/V different weight matrices can ensure projection in different spaces , enhance expression capabilities , and improve generalization capabilities .

3. Why choose dot product instead of addition when calculating Attention?

To calculate faster . The calculation amount of matrix addition is indeed simple, but when calculating Attention as a whole, it is equivalent to a hidden layer, and the overall calculation amount is similar to the point product. In terms of effect, the effect of the two is the same as dk d_kdkVector dimensions are related, dk d_kdkThe larger the dimension, the more significant the effect of addition.

4. Does attention need to be scaled before performing softmax?

Please check the [Scaled Dot-Product Attention] chapter.

Guess you like

Origin blog.csdn.net/m0_37605642/article/details/132887513