Transformer code detailed analysis

Transformer code detailed analysis




This article is [This is the best [Transformer actual combat] tutorial at station B so far! Take you from zero to a detailed interpretation of the Transformer model and learn it all at once! ——Artificial Intelligence, Deep Learning, Neural Network] The text version of the tutorial, organized by personal learning, is being continuously updated.

The B station video tutorial is also organized according to a document from Harvard University: The Annotated Transformer


1. Background introduction of Transformer

1.1 The birth of Transformer

In October 2018, Google issued a paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", the BERT model was born, and swept the best results of 11 tasks in the NLP field!


Paper address: https://arxiv.org/pdf/1810.04805.pdf


The structure that plays an important role in BERT is Transformer, and then XLNET, roBERT and other models have defeated BERT, but their core has not changed, it is still: Transformer.

1.2 Advantages of Transformer

Compared with the LSTM and GRU models that previously occupied the market, Transformer has two significant advantages:

  1. Transformer can use distributed GPUs for parallel training to improve model training efficiency.
  2. When analyzing and predicting longer texts, it is better to capture semantic associations with longer intervals.
    The following is a comparison chart in the evaluation:
    insert image description here

1.3 Market of Transformer

On the famous SOTA machine translation list, almost all top-ranked models use Transformer,
insert image description here

It can basically be regarded as the wind vane of the industry, so there is no need to say more about the market space!


Two, Transformer architecture analysis

2.1 Understand the Transformer architecture

2.1.1 The role of the Transformer model

  • learning target:
    • Understand the role of the Transformer model.
    • Learn the names of the various components in Transformer's overall architecture diagram.

  • The role of the Transformer model:
    • The transformer model based on the seq2seq architecture can complete typical tasks in the field of NLP research, such as machine translation, text generation, etc. At the same time, it can build a pre-trained language model for transfer learning of different tasks.

  • statement:
    • In the following architecture analysis, we will assume that the Transformer model architecture is used to process the translation from one language text to another language text, so many naming methods follow the rules in NLP. For example: Embedding layer will be called text embedding Layer, the tensor generated by the Embedding layer is called word embedding tensor, and its last dimension will be called word vector, etc.

2.1.2 Transformer overall architecture diagram

insert image description here

  • Transformer's overall architecture can be divided into four parts:
    • input part
    • output part
    • Encoder part
    • decoder part

  • The input section contains:
    • Source text embedding layer and its positional encoder
    • Target text embedding layer and its positional encoder
      insert image description here

  • Output section:
    • linear layer
    • softmax layer
      insert image description here

  • Encoder part:
    • Stacked by N encoder layers
    • Each encoder layer consists of two sub-layer connection structures
    • The first sublayer connection structure consists of a multi-head self-attention sublayer and normalization layer and a residual connection
    • The second sublayer connection structure consists of a feedforward fully connected sublayer and normalization layer and a residual connection
      insert image description here

  • Decoder part:
    • It is stacked by N decoder layers
    • Each decoder layer consists of three sublayer connection structures
    • The first sublayer connection structure consists of a multi-head self-attention sublayer and normalization layer and a residual connection
    • The second sublayer connection structure includes a multi-head attention sublayer and normalization layer and a residual connection
    • The third sublayer connection structure includes a feed-forward fully connected sublayer and normalization layer and a residual connection
      insert image description here

2.2 Implementation of the input part

  • The input section contains:
    • Source text embedding layer and its positional encoder
    • Target text embedding layer and its positional encoder
      insert image description here

  • The role of the text embedding layer:
    • Whether it is source text embedding or target text embedding, it is to transform the digital representation of words in the text into a vector representation, hoping to capture the relationship between words in such a high-dimensional space.

  • Code analysis of the text embedding layer:
# 导入必备的工具包
import torch

# 预定义的网络层 torch.nn,工具开发者已经帮助我们开发好的一些常用层,
# 比如卷积层,lstm 层,embedding 层等,不需要我们再重复造轮子
import torch.nn as nn

# 数学计算工具包
import math

# torch 中变量封装函数 Variable。
from torch.autograd import Variable  # 注:tensor不能反向传播,variable可以反向传播。

# 定义 Embedding 类来实现文本嵌入层,这里 s 说明代表两个一模一样的嵌入层,他们共享参数。
# 该类继承 nn.Module,这样就有标准层的一些功能,这里我们也可以理解为一种模式,我们自己实现的所有层
class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        """类初始化函数,有两个参数,d_model:指词嵌入的维度,vocab:指词表的大小。"""
        # 接着就是使用 super 的方式知名继承 nn.Module 的初始化函数,我们自己实现的所有层都会这样去
        super(Embeddings, self).__init__()
        # 之后就是调用 nn 中的预定义层 Embedding,获得一个词嵌入对象 self.lut
        self.lut = nn.Embedding(vocab, d_model)
        # 最后就是将 d_model 传入类中
        self.d_model = d_model

    def forward(self, x):
        """
        可以将其理解为该层的前向传播逻辑,所有层中都会有此函数
        当传给该类的实例化对象参数时,自动调用该类函数
        参数 x:因为 Embedding 层是首层,所以代表输入给模型的文本通过词汇映射后的张量。
        """

        # 将 x 传给 self.lut 并与根号下 self.d_model 相乘作为结果返回。
        return self.lut(x) * math.sqrt(self.d_model)  # 此处对结果进行了缩放。思考为啥?

embedding = nn.Embedding(10, 3)  # 10 为 vocab_size,也就是需要编码的目标的总数
input = torch.LongTensor([[1, 2, 4, 5], [4, 3, 2, 9]])
embedding(input)

"""
tensor([[[-0.3748,  0.9854,  2.1664],
         [-0.3865,  0.5358, -0.9346],
         [ 1.1182,  0.9395, -0.0367],
         [-0.6363,  0.5225,  1.1308]],

        [[ 1.1182,  0.9395, -0.0367],
         [-1.6431, -0.7007,  0.3045],
         [-0.3865,  0.5358, -0.9346],
         [-0.4872, -0.5939, -2.2481]]], grad_fn=<EmbeddingBackward0>)
"""

embedding = nn.Embedding(10, 3, padding_idx=0)  # padding_idx=0 代表0全映射为0
input = torch.LongTensor([[0, 2, 0, 5]])
embedding(input)

"""
tensor([[[ 0.0000,  0.0000,  0.0000],
         [ 0.3094, -1.1555, -0.3782],
         [ 0.0000,  0.0000,  0.0000],
         [ 0.7858, -0.8773, -1.7635]]], grad_fn=<EmbeddingBackward0>)
"""

  • Instantiation parameters:
# 词嵌入维度是512维
d_model = 512

# 词表大小是1000
vocab = 1000

  • Input parameters:
# 输入x是一个使用 Variable 封装的长整型张量,形状是 2x4
# 注:tensor不能反向传播,variable可以反向传播。
x = Variable(torch.LongTensor([[100, 2, 421, 508], [491, 998, 1, 221]])) 

  • transfer:
emb = Embeddings(d_model, vocab)
embr = emb(x)
print("embr:", embr)

"""
embr: tensor([[[ 18.0835,  18.4307,  25.7455,  ...,  18.6767,  15.9241,   6.8471],
         [ 11.9021,  -8.0713, -26.0825,  ...,  10.6016, -31.4030,  10.6847],
         [  9.1287,  17.2753,  14.2908,  ...,   4.8957,   7.2101, -18.9200],
         [ 14.0852, -38.7575,  46.5348,  ...,  -3.5360,  10.5081, -10.5434]],

        [[-25.3910,  21.7453,  18.1822,  ...,  -4.4432,  -8.6873, -31.0001],
         [ -6.7744,  16.6366,  13.0639,  ..., -27.6570, -25.0405,   6.2767],
         [ 24.7707, -19.4100, -12.0361,  ...,  12.6369,   6.8837,  -0.7358],
         [-26.4784,  -9.7061, -20.1255,  ...,  49.6167, -16.2520,  -1.8203]]],
       grad_fn=<MulBackward0>)
[torch.FloatTensor of size 2x4x512]
"""

  • The role of the position encoder:
    • Because in Transformer's encoder structure, there is no processing of vocabulary position information, so it is necessary to add a position encoder after the Embedding layer, and add information that may produce different semantics due to different vocabulary positions into the word embedding vector to make up for the position Lack of information.

  • Code analysis of the position encoder:
# 定义位置编码器类,我们同样把它看做一个层,因此会继承 nn.Module
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout, max_len=5000):
        """位置编码器类的初始化函数,共有三个参数,分别是 d_model:词嵌入维度,dropout:置0比率,max_len:每个句子的最大长度"""
        super(PositionalEncoding, self).__init__()

        # 实例化 nn 中预定义的 Dropout 层,并将 dropout 传入其中,获得对象 self.dropout
        self.dropout = nn.Dropout(p=dropout)

        # 初始化一个位置编码矩阵,它是一个 0 矩阵,矩阵的大小是 max_len x d_model。
        pe = torch.zeros(max_len, d_model)  # 5000 * 512

        # 初始化一个绝对位置矩阵,在我们这里,词汇的绝对位置就是用它的索引去表示。
        # 所以我们首先使用 arange 方法获得一个连续自然数向量,然后再使用 unsqueeze 方法拓展向量维度
        # 又因为参数传的是1,代表矩阵拓展的位置,会使变量变成一个 max_len x 1 的矩阵。
        position = torch.arange(0, max_len).unsqueeze(1)  # 5000 * 1

        # 绝对位置矩阵初始化之后,接下来就是考虑如何将这些位置信息加入到位置编码矩阵中,
        # 最简单思路就是先将 max_len x 1 的绝对位置矩阵,变换成 max_len x d_model 形状,然后覆盖原来的初始化位置编码矩阵即可,
        # 要做这种矩阵变换,就需要一个 1 x d_model 形状的变换矩阵 div_term,我们对这个变换矩阵的要求除了形状外,
        # 还是希望它能够将自然数的绝对位置编码缩放成足够小的数字,有助于在之后的梯度下降过程中更快的收敛。这样我们就可以开始初始化这个变换矩阵了,
        # 首先使用 arange 获得一个自然数矩阵,但是细心的同学会发现,我们这里并没有按照预计的一样初始化一个 1 x d_model 的矩阵,
        # 而是有了一个跳跃,只初始化了一半即 1 x d_model / 2 的矩阵,为什么是一半呢,其实这里并不是真正意义上的初始化了一半的矩阵,
        # 我们可以把它看作是初始化了两次,而每次初始化的变换矩阵会做不同的处理,第一次初始化的变换矩阵分布在正弦波上,第二次初始化的矩阵分布在余弦波上,
        # 并把这两个矩阵分别填充在位置编码矩阵的偶数和奇数位置上,组成最终的位置编码矩阵。
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))  # 2 * 1
        pe[:, 0::2] = torch.sin(position * div_term)  # d_model 偶数列插入
        pe[:, 1::2] = torch.cos(position * div_term)  # d_model 奇数列插入

        # 这样我们就得到了位置编码矩阵 pe,pe 现在还只是一个二维矩阵,要想和 embedding 的输出(一个三维张量)相加,
        # 就必须拓展一个维度,所以这里使用 unsqueeze 拓展维度。
        pe = pe.unsqueeze(0)  # 1 * 5000 * 512

        # 最后把 pe 位置编码矩阵注册成模型的 buffer,什么是 buffer 呢,
        # 我们把它认为是对模型效果有帮助的,但是却不是模型结构中超参数或者参数,不需要随着优化步骤近似进行更新的增益对象。
        # 注册之后我们就可以在模型保存后重加载时和模型结构与参数一同被加载。
        self.register_buffer('pe', pe)

    def forward(self, x):
        """forward 函数的参数是 x,表示文本序列的词嵌入表示"""
        # 在相加之前我们对 pe 做一些适配工作,将这个三维张量的第二维也就是句子最大长度的那一维将切片到与输入的 x 的第二维相同即 x.size(1),
        # 因为我们默认 max_len 为5000一般来讲实在太大了,很难有一条句子包含5000个词汇,所以要进行与输入张量的适配。
        # 最后使用 Variable 进行封装,使其与 x 的样式相同,但是它是不需要进行梯度求解的,因此把 requires_grad 设置成 False。
        x = x + Variable(self.pe[:, :x.size(1)], requires_grad=False)  # 1 * x.size(1) * 512
        # 最后使用 self.dropout 对象进行‘丢弃’操作,并返回结果。
        return self.dropout(x)

  • nn.Dropout demo:
m = nn.Dropout(p=0.2)
input = torch.rand(4, 5)
output = m(input)
output

"""
tensor([[0.4195, 0.6860, 1.0761, 0.0000, 0.0000],
        [0.1527, 1.1528, 0.9567, 0.8003, 0.7162],
        [1.1279, 0.3140, 0.2020, 0.0000, 1.1303],
        [0.2858, 0.0206, 0.0000, 1.0863, 0.4558]])
[torch.FloatTensor of size 4x5]
"""

  • torch.unsqueeze demo:
x = torch.tensor([1, 2, 3, 4])
torch.unsqueeze(x, 0)

"""
tensor([[1, 2, 3, 4]])
"""

torch.unsqueeze(x, 1)

"""
tensor([[1],
        [2],
        [3],
        [4]])
"""

  • Instantiation parameters:
# 词嵌入维度是512维
d_model = 512

# 置0比率是0.1
dropout = 0.1

# 句子最大长度
max_len = 60

  • Input parameters:
# 输入 x 是 Embedding 层的输出的张量,形状是 2 x 4 x 512
x = embr

Variable contining:
(0,.,.)=
    18.0835,  18.4307,  25.7455,  ...,  18.6767,  15.9241,   6.8471
    11.9021,  -8.0713, -26.0825,  ...,  10.6016, -31.4030,  10.6847
    9.1287,  17.2753,  14.2908,  ...,   4.8957,   7.2101, -18.9200
    14.0852, -38.7575,  46.5348,  ...,  -3.5360,  10.5081, -10.5434

(1,.,.)=
    -25.3910,  21.7453,  18.1822,  ...,  -4.4432,  -8.6873, -31.0001
    -6.7744,  16.6366,  13.0639,  ..., -27.6570, -25.0405,   6.2767
    24.7707, -19.4100, -12.0361,  ...,  12.6369,   6.8837,  -0.7358
    -26.4784,  -9.7061, -20.1255,  ...,  49.6167, -16.2520,  -1.8203
[torch.FolatTensor of size 2x4x512]

  • transfer:
pe = PositionalEncoding(d_model, dropout, max_len)
pe_result = pe(x)
print("pe_result:", pe_result)

"""
pe_result: Variable containing:
(0,.,.)=
    18.0835,  18.4307,  25.7455,  ...,  18.6767,  15.9241,   6.8471
    11.9021,  -8.0713, -26.0825,  ...,  10.6016, -31.4030,  10.6847
    9.1287,  17.2753,  14.2908,  ...,   4.8957,   7.2101, -18.9200
    14.0852, -38.7575,  46.5348,  ...,  -3.5360,  10.5081, -10.5434

(1,.,.)=
    -25.3910,  21.7453,  18.1822,  ...,  -4.4432,  -8.6873, -31.0001
    -6.7744,  16.6366,  13.0639,  ..., -27.6570, -25.0405,   6.2767
    24.7707, -19.4100, -12.0361,  ...,  12.6369,   6.8837,  -0.7358
    -26.4784,  -9.7061, -20.1255,  ...,  49.6167, -16.2520,  -1.8203
torch.FolatTensor of size 2x4x512
"""
***
* 绘制词汇向量中特征的分布曲线:
import numpy as np
import matplotlib.pyplot as plt

# 创建一张 5 x 5 大小的画布
plt.figure(figsize=(15, 5))

# 实例化 PositionalEncoding 类得到的 pe 对象,输入参数是 20 和 0
pe = PositionalEncoding(20, 0)

# 然后向 pe 传入被 Variable 封装的 tensor,这样 pe 会直接执行 forward 函数,
# 且这个 tensor 里的数值都是0,被处理后相当于位置编码张量
y = pe(Variable(torch.zeros(1, 100, 20)))

# 然后定义画布的横纵坐标,横坐标到100的长度,纵坐标是某一个词汇中的某维特征在不同长度下对应的值
# 因为总共有20维之多,我们这里只查看4,5,6,7维的值。
plt.plot(np.arange(100), y[0, :, 4:8].data.numpy())

# 在画布上填写维度信息提示
plt.legend(['dim %d' %p for p in [4, 5, 6, 7]])

insert image description here


  • Output effect analysis:
    • Each colored curve represents the meaning of a feature in a certain vocabulary at different positions
    • Ensure that the embedding vector of the corresponding position of the same word will change with different positions
    • The value range of the sine wave and the cosine wave is both 1 to -1, which controls the size of the embedded value very well, which is helpful for the fast calculation of the gradient.

Section summary:

  • Learned the role of the text embedding layer:
    • Whether it is source text embedding or target text embedding, it is to convert the digital representation of words in the text (first convert the text word2index into numbers) into a vector representation, hoping to capture the relationship between words in such a high-dimensional space.
  • Learned and implemented the class of the text embedding layer: Embeddings
    • The initialization function takes d_model, word embedding dimension, and vocab, the total number of words as parameters, and internally uses the predetermined layer Embedding in nn for word embedding.
    • In the forwarding function, the input x is passed into the instantiated object of Embedding, and then multiplied by a root sign to scale d_model to control the size of the value.
    • Its output is the result after text embedding
  • Learned the role of the position encoder:
    • Because in Transformer's encoder structure, there is no processing of word position information, so it is necessary to add a position encoder after the Embedding layer, and add information that may produce different semantics due to different word positions into the word embedding tensor to make up for position information. missing.
  • Learned and implemented the class of the position encoder: PositionalEncoding
    • The initialization function takes d_model, dropout, and max_len as parameters, which respectively represent d_model: word embedding dimension, dropout: zero-setting ratio, and max_len: the maximum length of each sentence.
    • The input parameter in the forward function is x (x + pe), which is the output of the Embedding layer.
    • Finally, a word embedding vector with position encoding information is output.
  • Implemented to draw the distribution curve of the features in the word vector:
    • It is guaranteed that the embedding vector of the corresponding position of the same word will change with different positions.
    • The value range of the sine wave and the cosine wave is both 1 to -1, which controls the size of the embedded value very well, which is helpful for the fast calculation of the gradient.

2.3 Encoder Partial Implementation

  • learning target:
    • Understand the role of each component in the encoder.
    • Master the implementation process of each component in the encoder.

  • Encoder part:
    • It is formed by stacking N encoder layers.
    • Each encoder layer consists of two sublayer connection structures.
    • The first sublayer connection structure consists of a multi-head self-attention sublayer and normalization layer and a residual connection.
    • The second sublayer connection structure consists of a feed-forward fully connected sublayer and normalization layer and a residual connection.
      insert image description here

2.3.1 Mask tensor

  • Learn what a mask tensor is and what it does.
  • Master the implementation process of generating mask tensors.

  • What is a mask tensor:
    • Mask means masking. The code is the value in our tensor. Its size is variable. Generally, there are only 1 and 0 elements in it, which means that the position is covered or not. As for whether the 0 position is covered or the 1 position is covered, it can be customized. Therefore, Its function is to cover up some values ​​in another tensor, or to be replaced, and its expression is a tensor.

  • The role of the mask tensor:
    • In the transformer, the main function of the mask tensor is that when the attention is applied, some value calculations in the generated attention tensor may be obtained by knowing the future information. The future information is seen because the entire output will be output during training. The results are all embedding at one time, but theoretically, the output of the decoder does not produce the final result once, but is obtained through the synthesis of the previous results time after time, so future information may be used in advance. So, we'll do masking.

  • Code analysis to generate a masked tensor:
import torch
import numpy as np

def subsequent_mask(size):
    """生成向后遮掩的掩码张量,参数 size 是掩码张量最后两个维度的大小,它的最后两维形成一个方阵"""
    # 在函数中,首先定义掩码张量的形状
    attn_shape = (1, size, size)

    # 然后使用 np.ones 方法向这个形状中添加1元素,形成上三角矩阵,最后为了节约空间,在使用其中的数据类型变为无符号8位整型 unit8
    subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')

    # 最后将 numpy 类型转化为 torch 中的 tensor,内部做一个 1- 的操作,
    # 在这里其实是做了一个三角阵的反转,subsequent_mask 中的每个元素都会被 1 减。
    # 如果是0,subsequent_mask 中的该位置由0变成1
    # 如果是1,subsequent_mask 中的该位置由1变成0
    return torch.from_numpy(1 - subsequent_mask)

  • np.triu demo:
np.triu([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]], k=-1)

"""
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 0,  8,  9],
       [ 0,  0, 12]])
"""

np.triu([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]], k=0)

"""
array([[1, 2, 3],
       [0, 5, 6],
       [0, 0, 9],
       [0, 0, 0]])
"""

np.triu([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]], k=1)

"""
array([[0, 2, 3],
       [0, 0, 6],
       [0, 0, 0],
       [0, 0, 0]])
"""

  • Input example:
# 生成的掩码张量的最后两维的大小
size = 5

  • transfer:
sm = subsequent_mask(size)
print('sm:', sm)

"""最后两维形成一个下三角阵
tensor([[[1, 0, 0, 0, 0],
         [1, 1, 0, 0, 0],
         [1, 1, 1, 0, 0],
         [1, 1, 1, 1, 0],
         [1, 1, 1, 1, 1]]], dtype=torch.uint8)
torch.ByteTensor of size 1x5x5
"""

  • Visualization of masked tensors:
plt.figure(figsize=(5, 5))
plt.imshow(subsequent_mask(20)[0])  # 把第0维取出来(0或者1),从内到外为的索引为:0,1,2

  • Output effect:
    insert image description here

  • Output effect analysis:
    • By observing the visualized square matrix, yellow is the part of 1, which represents covered information, purple represents information that is not covered, the abscissa represents the position of the target vocabulary, and the ordinate represents the viewable position.
    • We can see that at the position of 0, it is yellow at first glance, and it is covered. At the position of 1, it is still yellow at first glance, indicating that the word has not been generated for the first time. Looking at it from the second position, it is The word in position 1 can be seen, but not in other positions, and so on.

2.3.1 Mask tensor summary

  • Learn what a mask tensor is:
    • Mask means masking. The code is the value in our tensor. Its size is variable. Generally, there are only 1 and 0 elements in it, which means that the position is covered or not. As for whether the 0 position is covered or the 1 position is covered, it can be customized. Therefore, Its function is to make some values ​​in another tensor be masked, or replaced, and its expression is a tensor.

  • Learned the role of the mask tensor:
    • In the transformer, the main function of the mask Zhang Liang is to apply it to attention. Some value calculations in the generated attention tensor may be obtained by knowing the future information. The future information is seen because the entire output will be output during training. The results are all embedding at one time, but theoretically, the output of the decoder does not produce the final result once, but is obtained through the synthesis of the previous results time after time. Therefore, future information may be used in advance. So, we'll do masking.

  • Learned and implemented a mask tensor function that generates backward masking: subsequent_mask
    • Its input is size, which represents the size of the mask tensor.
    • Its output is a lower triangular matrix with the last two dimensions forming a 1-square matrix.
    • Finally, a visual analysis of the generated mask tensor was performed to further understand its purpose.

2.3.2 Attention mechanism

  • learning target:
    • Understand what is the attention calculation rule and attention mechanism.
    • Master the implementation process of attention calculation rules.

  • What is attention:
    • When we observe things, the reason why we can quickly judge a thing (of course allowing the judgment to be wrong) is because our brain can quickly focus on the most recognizable part of the thing and make a judgment instead of starting from scratch Only after observing things from the end to the end can there be a judgment result. It is based on this theory that the attention mechanism is produced.

  • What is the attention calculation rule:
    • It requires three specified inputs Q(query), K(key), V(value), and then obtains the calculation result of attention through the formula, which represents the expression of query under the action of key and value. There are many specific calculation rules, and I will only introduce the one we use here.

  • The calculation rules of attention we use here:
    insert image description here

  • Metaphorical interpretation of Q, K, V:

Join us for a problem: Given a piece of text, describe it using some keywords!
In order to facilitate a unified and correct answer, this question may have written some keywords as hints in advance, and these hints can be regarded as keys, and the entire text information is equivalent to query, and the meaning of value is More abstract, it can be compared to the answer information that emerges in your mind after reading this text information. Here we assume that everyone is not very smart at the beginning, and the information that basically emerges in your mind after seeing this text information for the first time Only the information is prompted, so the key and the value are basically the same, but as we understand this issue in depth, more and more things come to mind through our thinking, and we can start to answer our query, which is this This is the process of attention. Through this process, the value in our mind has changed, and the key word representation method of query is generated according to the prompt key, which is another feature representation. method.
The key and value we just mentioned are generally the same by default, and they are different from the query. This is our general attention input form, but there is a special case, that is, our query is the same as the key and value. In this case We call it the self-attention mechanism, just like our example just now, using the general attention mechanism is to use keywords different from the given text to represent it, and the self-attention mechanism needs to use the given text itself to express itself, That is to say, you need to extract keywords from a given text to express it, which is equivalent to a feature extraction of the text itself.


  • What is the attention mechanism:
    • The attention mechanism is the carrier of the deep learning network that the attention rules can be applied to. In addition to the attention calculation rules, it also includes some necessary fully connected layers and related tensor processing, so that it can be integrated with the application network, using self-attention The attention mechanism for computing rules is called self-attention mechanism.

  • A graphical representation of the attention mechanism implemented in the network:
    insert image description here

  • Code analysis of attention calculation rules:
import torch
import math
import torch.nn.functional as F

def attention(query, key, value, mask=None, dropout=None):
    """注意力机制的实现,分别是 query,key,value,mask:掩码张量, dropout 是 nn.Dropout 层的实例化对象,默认为 None"""
    # 在函数中,首先取 query 的最后一维的大小,一般情况下就等同于我们的词嵌入维度,命名为 d_k
    d_k = query.size(-1)  # 512
    # 按照注意力公式, 将query与key的转置相乘, 这里面key是将最后两个维度进行转置, 再除以缩放系数根号下d_k, 这种计算方法也称为缩放点积注意力计算.
    # 得到注意力得分张量 scores
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)  # 4 * 4

    # 接着判断是是否使用掩码张量
    if mask is not None:
        # 使用 tensor 的 masked_fill 方法,将掩码张量和 scores 张量每个位置一一比较,如果掩码张量处为0
        # 则对应的 scores 张量用-1e9这个值来替换,如下演示:
        scores = scores.masked_fill(mask == 0, -1e9)  # 4 * 4

    # 对 scores 的最后一维进行 softmax 操作,使用 F.softmax 方法,第一个参数是 softmax 对象,第二个是目标维度。
    # 这样获得最终的注意力张量
    # 默认是横向 softmax,也就是 dim=1, dim=-1
    p_attn = F.softmax(scores, dim=-1)  # 4 * 4

    # 之后判断是否使用 dropout 进行随机置0
    if dropout is not None:
        # 将 p_attn 传入 dropout 对象中进行‘丢弃’处理。
        p_attn = dropout(p_attn)  # 4 * 4

    # 最后,根据公式将 p_attn 与 value 张量相乘获得最终的 query 注意力表示,同时返回注意力张量。
    return torch.matmul(p_attn, value), p_attn  # 4 * 512,4 * 4

  • tensor.masked_fill demo:
input = Variable(torch.randn(5, 5))
print(input)

"""
Variable containing:
 2.0344 -0.5450  0.3365 -0.1888 -2.1803
 1.5221 -0.3823  0.8414  0.7836 -0.8481
-0.0345 -0.8643  0.6476 -0.2713  1.5645
 0.8788 -2.2142  0.4022  0.1997  0.1474
 2.9109  0.6006 -0.6745 -1.7262  0.6977
[torch.FloatTensor of size 5x5]
"""

mask = Variable(torch.zeros(5, 5))
print(mask)

"""
Variable containing:
 0  0  0  0  0
 0  0  0  0  0
 0  0  0  0  0
 0  0  0  0  0
 0  0  0  0  0
[torch.FloatTensor of size 5x5]
"""

input.masked_fill(mask == 0, -1e9)

"""
Variable containing:
-1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09
-1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09
-1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09
-1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09
-1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09
[torch.FloatTensor of size 5x5]
"""

  • Input parameters:
# 我们令输入的 query,key,value 都相同,位置编码的输出
query = key = value = pe_result

"""
Variable containing:
(0,.,.) = 
	-2.4706,   34.0796,   2.7505,  ...,  -3.4972, -47.7639,  -3.3239
	 19.8931,  21.1933, -22.7047,  ...,  -6.5685, -57.0562,  -8.0796
	 3.4568,    9.4423,  -3.3226,  ...,  -3.1958,  17.6394, -28.5494
	-13.1135,  32.7553,  49.7123,  ...,  17.4657,  47.5091,  12.2267

(1,.,.) = 
	 12.9825,   7.8147,  24.1951,  ...,  22.3257,   9.8316, -15.4033
	 -7.6471,  -8.1521,  -1.1315,  ...,  -0.2127,   3.8205, -19.2361
	-37.0878,  50.6874,  40.1957,  ...,  12.9805, -16.8716, -35.2725
	 5.0624,    0.4235, -17.8792,  ...,  16.9755, -36.1472,   7.2271
[torch.FloatTensor of size 2x4x512]
"""

  • transfer:
attn, p_attn = attention(query, key, value)
print('attn:', attn)
print('p_attn:', p_attn)

# 得到两个结果
# query 的注意力表示:
"""
attn: Variable containing:
(0,.,.) = 
	-2.4706,   34.0796,   2.7505,  ...,  -3.4972, -47.7639,  -3.3239
	 19.8931,  21.1933, -22.7047,  ...,  -6.5685, -57.0562,  -8.0796
	 3.4568,    9.4423,  -3.3226,  ...,  -3.1958,  17.6394, -28.5494
	-13.1135,  32.7553,  49.7123,  ...,  17.4657,  47.5091,  12.2267

(1,.,.) = 
	 12.9825,   7.8147,  24.1951,  ...,  22.3257,   9.8316, -15.4033
	 -7.6471,  -8.1521,  -1.1315,  ...,  -0.2127,   3.8205, -19.2361
	-37.0878,  50.6874,  40.1957,  ...,  12.9805, -16.8716, -35.2725
	 5.0624,    0.4235, -17.8792,  ...,  16.9755, -36.1472,   7.2271
[torch.FloatTensor of size 2x4x512]
"""
# 注意力张量:第一行代表第一个向量与其他向量之间的关系程度,1完全相关,0完全无关
"""
p_attn: Variable containing:
(0 ,.,.) = 
  1  0  0  0
  0  1  0  0
  0  0  1  0
  0  0  0  1

(1 ,.,.) = 
  1  0  0  0
  0  1  0  0
  0  0  1  0
  0  0  0  1
[torch.FloatTensor of size 2x4x4]

"""

  • Input parameters with mask:
query = key = value = pe_result

# 令 mask 为一个 2x4x4 的0张量
mask = Variable(torch.zeros(2, 4, 4))

  • transfer:
attn, p_attn = attention(query, key, value, mask=mask)
print('attn:', attn)
print('p_attn:', p_attn)

# query 的注意力表示
"""
```python
attn: Variable containing:
( 0 ,.,.) = 
   0.4284  -7.4741   8.8839  ...    1.5618   0.5063   0.5770
   0.4284  -7.4741   8.8839  ...    1.5618   0.5063   0.5770
   0.4284  -7.4741   8.8839  ...    1.5618   0.5063   0.5770
   0.4284  -7.4741   8.8839  ...    1.5618   0.5063   0.5770

( 1 ,.,.) = 
  -2.8890   9.9972 -12.9505  ...    9.1657  -4.6164  -0.5491
  -2.8890   9.9972 -12.9505  ...    9.1657  -4.6164  -0.5491
  -2.8890   9.9972 -12.9505  ...    9.1657  -4.6164  -0.5491
  -2.8890   9.9972 -12.9505  ...    9.1657  -4.6164  -0.5491
[torch.FloatTensor of size 2x4x512]
"""
# 注意力张量:第一行代表第一个向量与其他向量之间的关系程度,1完全相关,0完全无关
"""
p_attn: Variable containing:
(0 ,.,.) = 
  0.2500  0.2500  0.2500  0.2500
  0.2500  0.2500  0.2500  0.2500
  0.2500  0.2500  0.2500  0.2500
  0.2500  0.2500  0.2500  0.2500

(1 ,.,.) = 
  0.2500  0.2500  0.2500  0.2500
  0.2500  0.2500  0.2500  0.2500
  0.2500  0.2500  0.2500  0.2500
  0.2500  0.2500  0.2500  0.2500
[torch.FloatTensor of size 2x4x4]
"""

2.3.3 Multi-head attention mechanism

  • What is a multi-head attention mechanism:
    • From the structure diagram of multi-head attention, it seems that the so-called multiple heads refer to multiple sets of linear transformation layers. In fact, it is not. I only use a set of linear transformation layers, that is, three transformation tensor pairs Q, K, and V respectively Perform linear transformations. These transformations will not change the size of the original tensor, so each transformation matrix is ​​a square matrix. After the output result is obtained, the role of multiple heads begins to appear. Each head begins to split the output tensor from the semantic level. That is, each head wants to obtain a set of Q, K, and V for the calculation of the attention mechanism, but only a part of the representation of each word in the sentence is obtained, that is, only the word embedding vector of the last dimension is divided. This is the so-called multi-head, and the input obtained by each head is sent to the attention mechanism to form a multi-head attention mechanism.

  • Structural diagram of multi-head attention mechanism:
    insert image description here

  • The role of the multi-head attention mechanism:
    • This structural design allows each attention mechanism to optimize different feature parts of each vocabulary, thereby balancing the possible deviations of the same attention mechanism, and allowing word meanings to have more diverse expressions. Experiments show that it can improve the model effect .

  • The code implementation of the multi-head attention mechanism:
# 用于深度拷贝的 copy 工具包
import copy

# 首先需要定义克隆函数,因为在多头注意力机制的实现中,用到多个结构相同的线性层。
# 我们将使用 clone 函数将它们一同初始化在一个网络层列表对象中,之后的结构中也会用到该函数。
def clones(module, N):
    """用于生成相同网络层的克隆函数,它的参数 module 表示要克隆的目标网络层,N 代表需要克隆的数量"""
    # 在函数中,我们通过 for 循环对 module 进行 N 次深度拷贝,使其每个 module 称为独立的层。
    # 然后将其放在 nn.ModuleList 类型的列表中存放。
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])


# 我们使用一个类来实现多头注意力机制的处理
class MultiHeadedAttention(nn.Module):
    def __init__(self, head, embedding_dim, dropout=0.1):
        """在类的初始化时,会传入三个参数,head 代表头数,embedding_dim 代表词嵌入的维度,
        dropout 代表进行 dropout 操作时置0比率,默认是 0.1"""
        super(MultiHeadedAttention, self).__init__()

    	# 在函数中,首先使用了一个测试中常用的 assert 语句,判断 h 是否能被 d_model 整除,
    	# 这是因为我们之后要给每个头分配等量的词特征,也就是 embedding_dim/head 个。
        assert embedding_dim % head == 0

    	# 得到每个头获得的分割词向量维度 d_k
        self.d_k = embedding_dim // head

    	# 传入头数 h
        self.head = head

    	# 然后获得线性层对象,通过 nn 的 Linear 实例化,它的内部变换矩阵是 embedding_dim x embedding_dim,然后使用 clones 函数克隆四个,
		# 为什么是四个呢,这是因为在多头注意力中,Q,K,V 个需要一个,最后拼接的矩阵还需要一个,因此一共是四个。
        self.linears = clones(nn.Linear(embedding_dim, embedding_dim), 4)

    	# self.atten 为 None,它代表最后得到的注意力张量,现在还没有结果所以为 None
        self.attn = None

    	# 最后就是一个 dropout 对象,它通过 nn 中的 Dropout 实例化而来,置0比率为传进来的参数 dropout。
        self.dropout = nn.Dropout(dropout)

    def forward(self, query, key, value, mask=None):
        """前向逻辑函数,它的输入参数有四个,前三个就是注意力机制需要的 Q,K,V,
        最后一个是注意力机制中可能需要的 mask 掩码张量,默认是 None"""

        # 如果存在掩码张量 mask
        if mask is not None:
            # 使用 unsqueeze 扩展维度
             mask = mask.unsqueeze(0)

        # 接着,我们获得一个 batch_size 的变量,它是 query 尺寸的第一个数字,代表有多少条样本。
        batch_size = query.size(0)

        # 之后就进入多头处理环节
        # 首先利用 zip 将输入 QKV 与三个线性层组到一起,然后使用 for 循环,将输入 QKV 分贝传到线性层中,
        # 做完线性变换后,开始为每个头分割输入,这里使用 view 方法对线性变换的结果进行维度重塑,多加了一个维度 h,代表头数
        # 这样就意味着每个头可以获得一部分词特征组成的句子,其中的 -1 代表自适应维度
        # 计算机会根据这种变换自动计算这里的值,然后对第二维和第三维进行转置操作,
        # 为了让代表句子长度维度和词向量维度能够相邻,这样注意力机制才能找到词义与句子位置的关系,
        # 从 attention 函数中可以看到,利用的是原始输入的倒数第一和第二维,这样我们就得到了每个头的输入。
        query, key, value = \
            [model(x).view(batch_size, -1, self.head, self.d_k).transpose(1, 2)
             for model, x in zip(self.linears, (query, key, value))]

        # 得到每个头的输入后,接下来就是将它们传入到 attention 中,
        # 这里直接调用我们之前实现的 attention 函数,同时也将 mask 和 dropout 传入其中。
        x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout)

        # 通过多头注意力计算后,我们就得到了每个头计算结果组成的4维张量,我们需要将其转换为输入的形状以方便后续的计算,
        # 因此这里开始进行第一步处理环节的逆操作,先对第二和第三维转置,然后使用 contiguous 方法,
        # 这个方法的作用就是能够让转置后的张量应用 view 方法,否则将无法直接使用,
        # 所以,下一步就是使用 view 重塑形状,变成和输入形状相同。
        x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.head * self.d_k)

        # 最后使用线性层列表中的最后一个线性层对输入进行线性变换得到最终多头注意力结构的输出。
        return self.linears[-1](x)

  • tensor.masked_fill demo:
x = torch.randn(4, 4)
x.size()

"""
torch.Size([4, 4])
"""

y = x.view(16)
y.size()

"""
torch.Size([16])
"""

z = x.view(-1, 8)  # the size -1 is inferred from other dimensions
y.size()

"""
torch.Size([2, 8])
"""

a = torch.randn(1, 2, 3, 4)
a.size()

"""
torch.Size([1, 2, 3, 4])
"""

b = a.transpose(1, 2)  # Swaps 2nd and 3rd dimension
b.size()

"""
torch.Size([1, 3, 2, 4])
"""

c = a.view(1, 3, 2, 4)  # Does not change tensor layout in memory
c.size()

"""
torch.Size(1, 3, 2, 4)
"""

torch.equal(b, c)
"""
False
"""

  • torch.transpose demo:
x = torch.randn(2, 3)
x

"""
tensor([[ 1.0028, -0.9893,  0.5809],
        [-0.1669,  0.7299,  0.4942]])
"""

torch.transpose(x, 0, 1)

"""
tensor([[ 1.0028, -0.1669],
        [-0.9893,  0.7299],
        [ 0.5809,  0.4942]])
"""

  • Instantiation parameters:
# 头数 head
head = 8

# 词嵌入维度 embedding_dim
embedding_dim = 512

# 置零比率 dropout
dropout = 0.2

  • Input parameters:
# 假设输入的 Q,K,V 仍然相等
query = value = key = pe_result

# 输入的掩码张量
mask = Variable(torch.zeros(8, 4, 4))

  • transfer:
mha = MultiHeadedAttention(head, embedding_dim, dropout)
mha_result = mha(query, key, value, mask)
print(mha_result)

"""
tensor([[[-0.3075,  1.5687, -2.5693,  ..., -1.1098,  0.0878, -3.3609],
         [ 3.8065, -2.4538, -0.3708,  ..., -1.5205, -1.1488, -1.3984],
         [ 2.4190,  0.5376, -2.8475,  ...,  1.4218, -0.4488, -0.2984],
         [ 2.9356,  0.3620, -3.8722,  ..., -0.7996,  0.1468,  1.0345]],

        [[ 1.1423,  0.6038,  0.0954,  ...,  2.2679, -5.7749,  1.4132],
         [ 2.4066, -0.2777,  2.8102,  ...,  0.1137, -3.9517, -2.9246],
         [ 5.8201,  1.1534, -1.9191,  ...,  0.1410, -7.6110,  1.0046],
         [ 3.1209,  1.0008, -0.5317,  ...,  2.8619, -6.3204, -1.3435]]],
       grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])
"""

2.3.4 Feedforward fully connected layer

  • learning target:
    • Learn what a feedforward layer is and what it does.
    • Master the implementation process of the feedforward fully connected layer

  • What is a feedforward fully connected layer:
    • The feed-forward fully connected layer in Transformer is a fully connected network with two linear layers.

  • The role of the feedforward fully connected layer:
    • Considering that the attention mechanism may not fit the complex process enough, the ability of the model is enhanced by adding a two-layer network.

  • Code analysis of the feedforward fully connected layer:
# 通过类 PositionwiseFeedForward 来实现前馈全连接层
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        """初始化函数有三个输入参数分别是 d_model,d_ff 和 dropout=0.1,第一个是线性层的输入维度也是第二个线性层的输出维度,
        因为我们希望输入通过前馈全连接层后输入和输出的维度不变,第二个参数 d_ff 就是第二个线性层的输入维度和第一个线性层的输出维度,
        最后一个是 dropout 置0比率"""
        super(PositionwiseFeedForward, self).__init__()
        
        # 首先按照我们预期使用 nn 实例化了两个线性层对象,self.w1 和 self.w2
        # 它们的参数分别是 d_model,d_ff 和 d_ff,d_model
        self.w1 = nn.Linear(d_model, d_ff)
        self.w2 = nn.Linear(d_ff, d_model)
        # 然后使用 nn 的 Dropout 实例化了对象 self.dropout
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        """输入参数为 x,代表来自上一层的输出"""
        # 首先经过第一个线性层,然后使用 Funtional 中 relu 函数进行激活,
        # 之后再使用 droput 进行随机置0,最后通过第二个线性层 w2,返回最终结果。
        return self.w2(self.dropout(F.relu(self.w1(x))))

  • ReLU function formula: ReLU(x) = max(0, x)
  • ReLU function image:
    insert image description here

  • Instantiation parameters:
d_model = 512

# 线性变化的维度
d_ff = 64

dropout = 0.2
  • Input parameters:
# 输入参数 x 可以是多头注意力机制的输出
x = mha_result
tensor([[[-0.3075,  1.5687, -2.5693,  ..., -1.1098,  0.0878, -3.3609],
         [ 3.8065, -2.4538, -0.3708,  ..., -1.5205, -1.1488, -1.3984],
         [ 2.4190,  0.5376, -2.8475,  ...,  1.4218, -0.4488, -0.2984],
         [ 2.9356,  0.3620, -3.8722,  ..., -0.7996,  0.1468,  1.0345]],

        [[ 1.1423,  0.6038,  0.0954,  ...,  2.2679, -5.7749,  1.4132],
         [ 2.4066, -0.2777,  2.8102,  ...,  0.1137, -3.9517, -2.9246],
         [ 5.8201,  1.1534, -1.9191,  ...,  0.1410, -7.6110,  1.0046],
         [ 3.1209,  1.0008, -0.5317,  ...,  2.8619, -6.3204, -1.3435]]],
       grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])

  • transfer:
ff = PositionwiseFeedForward(d_model, d_ff, dropout)
ff_result = ff(x)
print(ff_result)

"""
tensor([[[-1.9488e+00, -3.4060e-01, -1.1216e+00,  ...,  1.8203e-01,
          -2.6336e+00,  2.0917e-03],
         [-2.5875e-02,  1.1523e-01, -9.5437e-01,  ..., -2.6257e-01,
          -5.7620e-01, -1.9225e-01],
         [-8.7508e-01,  1.0092e+00, -1.6515e+00,  ...,  3.4446e-02,
          -1.5933e+00, -3.1760e-01],
         [-2.7507e-01,  4.7225e-01, -2.0318e-01,  ...,  1.0530e+00,
          -3.7910e-01, -9.7730e-01]],

        [[-2.2575e+00, -2.0904e+00,  2.9427e+00,  ...,  9.6574e-01,
          -1.9754e+00,  1.2797e+00],
         [-1.5114e+00, -4.7963e-01,  1.2881e+00,  ..., -2.4882e-02,
          -1.5896e+00, -1.0350e+00],
         [ 1.7416e-01, -4.0688e-01,  1.9289e+00,  ..., -4.9754e-01,
          -1.6320e+00, -1.5217e+00],
         [-1.0874e-01, -3.3842e-01,  2.9379e-01,  ..., -5.1276e-01,
          -1.6150e+00, -1.1295e+00]]], grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])
"""

2.3.5 Normalization layer

  • The role of the normalization layer:
    • It is a standard network layer required by all deep network models, because as the number of network layers increases, the parameters may start to appear too large or too small after the calculation of multiple layers, which may cause abnormalities in the learning process, and the model Convergence may be very slow. Therefore, a certain number of layers will be followed by a normalization layer to normalize the value, so that the characteristic value is within a reasonable range.

  • The code implementation of the normalization layer:
# 通过 LayerNorm 实现规范化层的类
class LayerNorm(nn.Module):
    def __init__(self, features, eps=1e-6):
        """初始化函数有两个参数,一个是 features,表示词嵌入的维度,
        另一个 eps 它是一个足够小的数,在规范化公式的分母中出现,
        防止分母为0,默认是1e-6"""
        super(LayerNorm, self).__init__()

        # 根据 features 的形状初始化两个参数张量 a2 和 b2,第一个初始化为1张量,
        # 也就是里面的元素都是1,第二个初始化为0张量,也就是里面的元素都是0,这两个张量就是规范化层的参数,
        # 也是直接对上一层得到的结果做规范化公式计算,将改变结果的正常表征,因此就需要有参数作为调节因子,
        # 使其能满足规范化要求,又能不改变针对目标的表征,最后使用 nn.parameter 封装,代表它们是模型的参数。
        self.a2 = nn.Parameter(torch.ones(features))
        self.b2 = nn.Parameter(torch.zeros(features))

        # 把 eps 传到类中
        self.eps = eps

    def forward(self, x):
        """输入参数 x 代表来自上一层的输出"""
        # 在函数中,首先对输入变量 x 求其最后一个维度的均值,并保持输出维度与输入维度一致。
        # 接着在求最后一个维度的标准差,然后就是根据规范化公式,用 x 减去均值除以标准差获得规范化的结果,
        # 最后对结果乘以我们的缩放参数,即 a2,* 号代表同型点乘,即对应位置进行乘法操作,加上位移参数 b2,返回即可。
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a2 * (x - mean) / (std + self.eps) + self.b2

  • Instantiation parameters:
features = d_model = 512
eps = 1e-6

  • Input parameters:
# 输入 x 来自前馈全连接层的输出
x = ff_result

"""
tensor([[[-1.9488e+00, -3.4060e-01, -1.1216e+00,  ...,  1.8203e-01,
          -2.6336e+00,  2.0917e-03],
         [-2.5875e-02,  1.1523e-01, -9.5437e-01,  ..., -2.6257e-01,
          -5.7620e-01, -1.9225e-01],
         [-8.7508e-01,  1.0092e+00, -1.6515e+00,  ...,  3.4446e-02,
          -1.5933e+00, -3.1760e-01],
         [-2.7507e-01,  4.7225e-01, -2.0318e-01,  ...,  1.0530e+00,
          -3.7910e-01, -9.7730e-01]],

        [[-2.2575e+00, -2.0904e+00,  2.9427e+00,  ...,  9.6574e-01,
          -1.9754e+00,  1.2797e+00],
         [-1.5114e+00, -4.7963e-01,  1.2881e+00,  ..., -2.4882e-02,
          -1.5896e+00, -1.0350e+00],
         [ 1.7416e-01, -4.0688e-01,  1.9289e+00,  ..., -4.9754e-01,
          -1.6320e+00, -1.5217e+00],
         [-1.0874e-01, -3.3842e-01,  2.9379e-01,  ..., -5.1276e-01,
          -1.6150e+00, -1.1295e+00]]], grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])
"""

  • transfer:
ln = LayerNorm(feature, eps)
ln_result = ln(x)
print(ln_result)

"""
tensor([[[ 2.2697,  1.3911, -0.4417,  ...,  0.9937,  0.6589, -1.1902],
         [ 1.5876,  0.5182,  0.6220,  ...,  0.9836,  0.0338, -1.3393],
         [ 1.8261,  2.0161,  0.2272,  ...,  0.3004,  0.5660, -0.9044],
         [ 1.5429,  1.3221, -0.2933,  ...,  0.0406,  1.0603,  1.4666]],

        [[ 0.2378,  0.9952,  1.2621,  ..., -0.4334, -1.1644,  1.2082],
         [-1.0209,  0.6435,  0.4235,  ..., -0.3448, -1.0560,  1.2347],
         [-0.8158,  0.7118,  0.4110,  ...,  0.0990, -1.4833,  1.9434],
         [ 0.9857,  2.3924,  0.3819,  ...,  0.0157, -1.6300,  1.2251]]],
       grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])
"""

2.3.5 Normalization layer summary:

  • Learned the role of the normalization layer:
    • It is a standard network layer required by all deep network models, because as the number of network layers increases, the parameters may start to appear too large or too small after the calculation of multiple layers, which may cause abnormalities in the learning process, and the model May converge very slowly. Therefore, a certain number of layers will be followed by a normalization layer to normalize the value, so that the characteristic value is within a reasonable range.

  • Learned and implemented the normalization layer class: LayerNorm
    • It has two instantiation parameters, features and eps, which respectively represent the size of word embedding features, and a small enough number.
    • Its input parameter x represents the output from the previous layer.
    • Its output is the normalized feature representation.

2.3.6 Sublayer connection structure

  • What is the sublayer connection structure:
    • As shown in the figure, a residual connection (skip connection) is also used in the process of inputting to each sublayer and the normalization layer, so we call this part of the structure as a sublayer connection (representing the sublayer and its connection structure), in In each encoder layer, there are two sublayers, and these two sublayers plus the surrounding connection structure form a two sublayer connection structure.

  • Sublayer connection structure diagram:
    insert image description here

  • Code analysis of sub-layer connection structure:
# 使用 SublayerConnection 来实现子层连接结构的类
class SublayerConnection(nn.Module):
    def __init__(self, size, dropout=0.1):
        """它输入参数有两个,size 以及 dropout,size 一般都是词嵌入维度的大小,
        dropout 本身是对模型结构中的节点数进行随机抑制的比率,
        又因为节点被抑制等效就是该节点的输出都是0,因此也可以把 dropout 看作是对输出矩阵的随机置0的比率。"""
        super(SublayerConnection, self).__init__()
        # 实例化了规范化对象 self.norm
        self.norm = LayerNorm(size)
        # 又使用了 nn 中预定义的 dropout 实例化一个 self.dropout 对象。
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x, sublayer):
        """前向逻辑函数中,接收上一个层或者子层的输入作为第一个参数,
        将该子层连接中的子层函数作为第二个参数"""
        # 我们首先对输出进行规范化,然后将结果传给子层处理,之后再对子层进行 dropout 操作,
        # 随机停止一些网络中神经元的作用,来防止过拟合,最后还有一个 add 操作,
        # 因为存在跳跃连接,所以是将输入 x 与 dropout 后的子层输出结果相加作为最终的子层连接输出。
        return x + self.dropout(sublayer(self.norm(x)))

  • Instantiation parameters:
size = 512
dropout = 0.2
head = 8
d_model = 512
  • Input parameters:
# 令 x 为位置编码的输出
x = pe_result
mask = Variable(torch.zeros(8, 4, 4))

# 假设子层中装的是多头注意力层,实例化这个类
self_attn = MultiHeadedAttention(head, d_model)

# 使用 lambda 获得一个函数类型的子类
sublayer = lambda x: self_attn(x, x, x, mask)

  • transfer:
sc = SublayerConnection(size, dropout)
sc_result = sc(x, sublayer)
print(sc_result)
print(sc_result.shape)

"""
tensor([[[ 14.8830,  22.4106, -31.4739,  ...,  21.0882, -10.0338,  -0.2588],
         [-25.1435,   2.9246, -16.1235,  ...,  10.5069,  -7.1007,  -3.7396],
         [  0.1374,  32.6438,  12.3680,  ..., -12.0251, -40.5829,   2.2297],
         [-13.3123,  55.4689,   9.5420,  ..., -12.6622,  23.4496,  21.1531]],

        [[ 13.3533,  17.5674, -13.3354,  ...,  29.1366,  -6.4898,  35.8614],
         [-35.2286,  18.7378, -31.4337,  ...,  11.1726,  20.6372,  29.8689],
         [-30.7627,   0.0000, -57.0587,  ...,  15.0724, -10.7196, -18.6290],
         [ -2.7757, -19.6408,   0.0000,  ...,  12.7660,  21.6843, -35.4784]]],
       grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])
"""

2.3.6 Summary of sub-layer connection structure:

  • What is the sublayer connection structure:
    • As shown in the figure, a residual connection (skip connection) is also used in the process of inputting to each sublayer and the normalization layer, so we call this part of the structure as a sublayer connection (representing the sublayer and its connection structure), in In each encoder layer, there are two sublayers, and these two sublayers plus the surrounding connection structure form a two sublayer connection structure.
      insert image description here
      insert image description here

  • Learn and implement the class of sublayer connection structure: SublayerConnection
    • The input parameters of the initialization function of the class are size and dropout, which represent the word embedding size and zeroing ratio respectively.
    • Its instantiated object input parameters are x, sublayer, which respectively represent the output of the previous layer and the function representation of the sublayer.
    • Its output is the output processed through the sublayer connection structure.

2.3.7 Encoder layer

  • The role of the encoder layer:
    • As a constituent unit of the encoder, each encoder layer completes a feature extraction process for the input, that is, the encoding process.

  • The composition diagram of the encoder layer:
    insert image description here

  • Code analysis for the encoder layer:
# 使用 EncoderLayer 类实现编码层
class EncoderLayer(nn.Module):
    def __init__(self, size, self_attn, feed_forward, dropout):
        """它的初始化函数参数有四个,分别是 size,其实就是我们词嵌入维度的大小,它也将作为我们编码器层的大小,
        第二个 self_attn,之后我们将传入多头子注意力子层实例化对象,并且是自注意力机制,
        第三个是 feed_forward,之后我们将传入前馈全连接层实例化对象,最后一个是置0比率。"""
        super(EncoderLayer, self).__init__()
        
        # 首先将 self_attn 和 feed_forward 传入其中。
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        
        # 如图所示,编码器层中有两个子层连接结构,所以使用 clones 函数进行克隆
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        # 把 size 传入其中
        self.size = size
        
    def forward(self, x, mask):
        """forward 函数中有两个输入参数,x 和 mask,分别代表上一层的输出,和掩码张量 mask。"""
        # 里面就是按照结构图左侧的流程,首先通过第一个子层连接结构,其中包含多头自注意力子层。
        # 然后通过第二个子层连接结构,其中包括前馈全连接层,最后返回结果。
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)

  • Instantiation parameters:
size = 512
head = 8
d_model = 512
d_ff = 64
x = pe_result
dropout = 0.2
self_attn = MultiHeadedAttention(head, d_model)
ff = PositionwiseFeedForward(d_model, d_ff, dropout)
mask = Variable(torch.zeros(8, 4, 4))

  • transfer:
el = EncoderLayer(size, self_attn, ff, dropout)
el_result = el(x, mask)
print(el_result)
print(el_result.shape)

"""
tensor([[[ 1.9921e+01,  3.3425e+01,  6.1629e+00,  ...,  6.2678e+01,
           2.9171e+01,  1.1156e+01],
         [ 2.3325e-01,  8.0859e-02, -3.5207e+01,  ...,  3.7640e+01,
          -1.7566e+01,  1.2176e+01],
         [ 4.4443e+00,  4.8411e+01,  3.3100e+01,  ...,  1.9235e+01,
           1.5410e+01, -1.3317e-02],
         [-5.4771e+00,  2.3913e+01,  3.2659e+01,  ...,  3.3492e+01,
          -1.5238e+01, -1.7955e+01]],

        [[-1.0243e+01, -7.3980e+00,  4.0956e+01,  ...,  1.2323e+01,
           4.2930e+00,  1.0701e+01],
         [ 1.3229e+01, -1.2794e+01,  3.0340e+01,  ...,  7.0137e-01,
          -1.1191e+01, -1.1884e+01],
         [-1.0737e+01,  1.7524e+01,  3.9035e+01,  ..., -1.7176e+01,
          -2.2373e+01, -2.2339e+00],
         [-1.0236e+01,  2.0050e+01, -3.4379e+01,  ...,  3.6887e+01,
           7.5028e+00,  1.4658e+01]]], grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])
"""

2.3.7 Encoder layer summary:

  • Learned the role of the encoder layer:
    • As a constituent unit of the encoder, each encoder layer completes a feature extraction process for the input, that is, the encoding process.

  • Learned and implemented the encoder layer class: Encoder_layer
    • There are 4 initialization functions of the class, namely size, which is actually the size of our word embedding dimension. The second is self_attn, and then we will pass in the multi-head self-attention sublayer instantiation object, and it is the self-attention mechanism. The third is feed_forward, after which we will pass in the feedforward fully connected layer instantiation object. The last one is the zero rate dropout.
    • There are 2 input parameters for the instantiated object, x represents the output from the previous layer, and mask represents the mask tensor.
    • Its output represents the feature representation of the entire encoding layer.

2.3.8 Encoder

  • The role of the encoder:
    • The feature extraction process used by the encoder to specify the input, also known as encoding, consists of N encoder layers stacked.

  • The structure diagram of the encoder:
    insert image description here

  • Code analysis for encoders:
# 使用 Encoder 类来实现编码器
class Encoder(nn.Module):
    def __init__(self, layer, N):
        """初始化函数的两个参数分别代表编码器层和编码器层的个数"""
        super(Encoder, self).__init__()
        # 首先使用 clones 函数克隆 N 个编码器层放在 self.layers 中
        self.layers = clones(layer, N)
        # 再初始化一个规范层,它将用在编码器的最后面。
        self.norm = LayerNorm(layer.size)

    def forward(self, x, mask):
        """forward 函数的输入和编码器层相同,x 代表上一层的输出,mask 代表掩码张量"""
        # 首先就是对我们克隆的编码器层进行循环,每次都会得到一个新的 x,
        # 这个循环的过程,就相当于输出 x 经过了 N 个编码器层的处理。
        # 最后再通过规范化层的对象 self.norm 进行处理,最后返回结果。
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

  • Instantiation parameters:
# 第一个实例化参数 layer,它是一个编码器层的实例化对象,因此需要传入编码器层的参数
# 又因为编码器层中的子层是不共享的,因此需要使用深度拷贝各个对象。
size = 512
head = 8
d_model = 512
d_ff = 64
c = copy.deepcopy
attn = MultiHeadedAttention(head, d_model)
ff = PositionwiseFeedForward(d_model, d_ff, dropout)
dropout = 0.2
layer = EncoderLayer(size, c(attn), c(ff), dropout)

# 编码器中编码器层的个数 N
N = 8
mask = Variable(torch.zeros(8, 4, 4))

  • transfer:
en = Encoder(layer, N)
en_result = en(x, mask)
print(en_result)
print(en_result.shape)

"""
tensor([[[ 0.9845,  1.3802,  0.0177,  ...,  2.7576,  1.1826,  0.5717],
         [ 0.0362, -0.1015, -1.6067,  ...,  1.4773, -0.8498,  0.6493],
         [ 0.3246,  1.7864,  1.2121,  ...,  0.9372,  0.5335,  0.1605],
         [-0.2135,  0.8947,  1.2349,  ...,  1.3519, -0.9314, -0.5688]],

        [[-0.6183, -0.1443,  1.6978,  ...,  0.5230, -0.0546,  0.7401],
         [ 0.3202, -0.5143,  0.9937,  ..., -0.0110, -0.6322, -0.3755],
         [-0.5319,  0.7588,  1.5268,  ..., -0.8102, -1.0899,  0.0276],
         [-0.5355,  0.9336, -1.6778,  ...,  1.6370,  0.2085,  0.7601]]],
       grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])
"""

2.4 Partial implementation of the decoder

  • Decoder part:
    • Stacked by N decoder layers
    • Each decoder layer consists of three sublayer connection structures
    • The first sublayer connection structure consists of a multi-head self-attention sublayer and normalization layer and a residual connection
    • The second sub-layer connection structure includes a multi-head attention layer and normalization layer and a residual connection
    • The third sublayer connection structure includes a feed-forward fully connected layer and normalization layer and a residual connection
      insert image description here

  • illustrate:
    • Various parts in the decoder layer, such as multi-attention mechanism, normalization layer, feed-forward fully connected network, sub-layer connection structure are the same as the implementation in the encoder. So it can be directly used here to build the decoder layer.

2.4.1 Decoder layer

  • The role of the decoder layer:
    • As the constituent unit of the decoder, each decoder layer performs feature extraction operations toward the target direction according to the given input, that is, the decoding process.

  • The code implementation of the decoder layer:
# 使用 DecoderLayer 的类实现解码器层
class DecoderLayer(nn.Module):
    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        """初始化函数的参数有5个,分别是 size,代表词嵌入的维度大小,同时也代表解码器层的尺寸,
        第二个是 self_attn,多头自注意力对象,也就是说这个注意力机制需要 Q=K=V,
        第三个是 src_attn,常规的多头注意力对象,这里 Q!=K=V,第四个是前馈全连接层对象,最后就是 dropout 置0比率。"""
        super(DecoderLayer, self).__init__()
        # 在初始化函数中,主要就是将这些输入传到类中
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        # 按照结构图使用 clones 函数克隆三个子层连接对象。
        self.sublayer = clones(SublayerConnection(size, dropout), 3)

    def forward(self, x, memory, source_mask, target_mask):
        """forward 函数中的参数有4个,分别是来自上一层的输入 x,
        来自编码器层的与已存储变量 memory,以及源数据掩码张量和目标数据掩码张量。"""
        # 将 memory 表示成 m 方便之后使用
        m = memory

        # 将 x 传入第一个子层结构,第一个子层结构的输入分别是 x 和 self_attn 函数,因为是自注意力机制,所以 Q,K,V 都是 x,
        # 最后一个参数是目标数据掩码张量,这时要对目标数据进行遮掩,因此此时模型可能还没有生成任何目标数据,
        # 比如在解码器准备生成第一个字符或词汇时,我们其实已经传入了第一个字符一遍计算损失,
        # 但是我们不希望在生成第一个字符时模型能利用这个信息,因此我们会将其遮掩,同样生成第二个字符或词汇时,
        # 模型只能使用第一个字符或词汇信息,第二个字符以及之后的信息都不允许被模型使用。
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, target_mask))

        # 接着进入第二个子层,这个子层中常规的注意力机制,q 是输入 x,k,v 是编码器层输出 memory,
        # 同样也传入 source_mask,但是进行源数据遮掩的原因并非是抑制信息泄露,而是遮蔽掉对结果没有意义的字符而产生的注意力值,
        # 以此提升模型效果和训练速度,这样就完成了第二个子层的处理。
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, source_mask))

        # 最后一个子层就是前馈全连接子层,经过它的处理后就可以返回结果。这就是我们的解码器层结构。
        return self.sublayer[2](x, self.feed_forward)

  • Instantiation parameters:
# 类的实例化参数与解码器层类似,相比多出了 src_attn,但是和 self_attn 是同一个类。
head = 8
size = 512
d_model = 512
d_ff = 64
dropout = 0.2
self_attn = src_attn = MultiHeadedAttention(head, d_model, dropout)

# 前馈全连接层也和之前相同
ff = PositionwiseFeedForward(d_model, d_ff, dropout)

  • Input parameters:
# x 是来自目标数据的词嵌入表示,但形式和元数据的词嵌入表示相同,这里使用 per 充当。
x = pe_result

# memory 是来自编码器的输出
memory = en_result

# 实际中 source_mask 和 target_mask 并不相同,这里为了方便计算使它们都为 mask
mask = Variable(torch.zeros(8, 4, 4))
source_mask = target_mask = mask

  • transfer:
dl = DecoderLayer(size, self_attn, src_attn, ff, dropout)
dl_result = dl(x, memory, source_mask, target_mask)
print(dl_result)
print(dl_result.shape)

"""
tensor([[[ 4.4165e+00,  3.9459e+01, -1.8993e+01,  ...,  2.8640e+01,
           4.5007e+01, -2.6191e+00],
         [-3.1433e+01, -3.3698e-02,  9.7387e+00,  ..., -3.1304e+01,
          -2.6359e+01,  2.5803e+01],
         [ 1.0853e-01, -8.5879e+00, -3.1416e+01,  ...,  9.3420e+00,
           9.9350e+00, -4.9279e+01],
         [-2.7589e+00, -1.2277e+01, -1.7585e+01,  ...,  5.6965e+00,
           2.1629e+00, -3.1873e+00]],

        [[ 1.5855e+01, -2.0839e+01,  2.7341e+01,  ..., -2.9785e+01,
           5.4703e+01, -5.4269e+01],
         [-2.5164e+01,  2.5174e+01,  4.4220e+01,  ...,  6.1001e+01,
           1.7924e+01,  1.7182e+01],
         [-1.4131e-01, -9.9527e-02, -1.1019e+01,  ...,  3.1327e+01,
           1.4808e+01, -3.9333e+00],
         [-2.6996e+01, -2.0339e+01,  7.2715e-02,  ..., -1.0746e+01,
           2.1724e+01,  9.0271e+00]]], grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])
"""

2.4.1 Decoder layer summary:

  • Learned the role of the decoder layer:
    • As the constituent unit of the decoder, each decoder layer performs feature extraction operations toward the target direction according to the given input, that is, the decoding process.

  • Learned and implemented the decoder layer class: DecoderLayer
    • The initialization function of the class has 5 parameters, namely size, which represents the dimension of word embedding, and also represents the size of the decoder layer. The second is self_attn, a multi-head self-attention object, which means that this attention mechanism requires Q=K=V, the third is src_attn, the multi-head attention object, here Q!=K=V, the fourth is the feed-forward fully connected layer object, and the last is the dropout zero setting ratio.
    • The forward function has 4 parameters, which are the input x from the previous layer, the semantic storage variable memory from the encoder layer, and the source data mask tensor and target data mask tensor.
    • The final output is the feature extraction result of the encoder input and the target data.

2.4.2 Decoder

  • The role of the decoder:
    • Based on the result of the encoder and the result of the previous prediction, the next possible "value" is represented by a feature.

  • Code analysis of the decoder:
# 使用类 Decoder 来实现解码器
class Decoder(nn.Module):
    def __init__(self, layer, N):
        """初始化函数的参数有两个,第一个就是解码器层 layer,第二个是解码器层的个数 N。"""
        super(Decoder, self).__init__()
        # 首先使用 clones 方法克隆了 N 个 layer,然后实例化了一个规范化层。
        # 因为数据走过了所有解码器层最后要做规范化处理。
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, memory, source_mask, target_mask):
        """forward 函数中的参数有4个,x 代表目标数据的嵌入表示,memory 是编码器层的输出,
        source_mask,target_mask 代表元数据和目标数据的掩码张量"""

        # 然后就是对每个层进行循环,当然这个循环就是变量 x 通过每一个层的处理,
        # 得出最后的结果,再进行一次规范化返回即可。
        for layer in self.layers:
            x = layer(x, memory, source_mask, target_mask)
        return self.norm(x)

  • Instantiation parameters:
# 分别是解码器层 layer 和解码器层的个数 N
size = 512
d_model = 512
head = 8
d_ff = 64
dropout = 0.2
c = copy.deepcopy
attn = MultiHeadedAttention(head, d_model)
ff = PositionwiseFeedForward(d_model, d_ff, dropout)
layer = DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout)
N = 8

  • Input parameters:
# 输入参数与解码器层的输入参数相同
x = pe_result
memory = en_result
mask = Variable(torch.zeros(8, 4, 4))
source_mask = target_mask = mask

  • transfer:
de = Decoder(layer, N)
de_result = de(x, memory, source_mask, target_mask)
print(de_result)
print(de_result.shape)

"""
tensor([[[-0.0100,  1.6852, -0.7575,  ...,  1.0972,  1.6985, -0.1922],
         [-1.3588, -0.1017,  0.3757,  ..., -1.3335, -1.5396,  0.9734],
         [-0.2805, -0.2980, -1.4023,  ...,  0.2735,  0.0846, -2.1428],
         [-0.1335, -0.3747, -0.5711,  ...,  0.1190, -0.0894, -0.0129]],

        [[ 0.7397, -0.9673,  1.0315,  ..., -1.4146,  2.1222, -2.3236],
         [-0.9619,  0.8483,  1.7075,  ...,  2.2987,  0.6200,  0.5135],
         [ 0.1447,  0.0146, -0.5066,  ...,  1.0641,  0.3366, -0.3634],
         [-0.8395, -0.8780, -0.1027,  ..., -0.4926,  0.7596,  0.2395]]],
       grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])
"""

2.4.2 Decoder summary

  • Learned the role of the decoder:
    • Based on the result of the encoder and the result of the previous prediction, the next possible "value" is represented by a feature.

  • Learned and implemented the decoder class: Decoder
  • The initialization function of the class has two parameters, the first is the decoder layer layer, and the second is the number N of decoder layers.
  • There are 4 parameters in the forward function, x represents the embedded representation of the target data, memory is the output of the encoder layer, src_mask, tgt_mask represent the mask tensor of metadata and target data.
  • Output the final feature representation of the decoding process.

2.5 Output Part Realization

  • The output section includes
    • linear layer
    • softmax layer
      insert image description here

  • The role of the linear layer:
    • The output of the specified dimension is obtained through the linear change of the previous step, that is, the function of transforming the dimension.
  • The role of the softmax layer:
    • Scale the numbers in the last dimension of the vector to the probability range of 0-1, and satisfy their sum to 1.

  • Code analysis of linear layer and softmax layer:
# nn.functional 工具包装载了网络层中那些只进行计算,而没有参数的层
import torch.nn.functional as F

# 将线性层和 softmax 计算层一起实现,因为二者的共同目标是生成最后的结构
# 因此把类的名字叫做 Gennerator,生成器类
class Generator(nn.Module):
    def __init__(self, d_model, vocab_size):
        """初始化函数的输入参数有两个,d_model 代表词嵌入维度,vocab_size 代表词表大小。"""
        super(Generator, self).__init__()
        # 首先就是使用 nn 中的预定义线性层进行实例化,得到一个 self.project 等待使用,
        # 这个线性层的参数有两个,就是初始化函数传进来的两个参数:d_model, vocab_size
        self.project = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        """前向逻辑函数中输入是上一层的输出张量 x"""
        # 在函数中,首先使用上一步得到的 self.project 对 x 进行线性变化,
        # 然后使用 F 中已经实现的 log_softmax 进行的 softmax 处理。
        # 在这里之所以使用 log_softmax 是因为和我们这个 pytorch 版本的损失函数实现有关,在其他版本中将修复。
        # log_softmax 就是对 softmax 的结果又取了对数,因为对数函数是单调递增函数,
        # 因此对最终我们取最大的概率值没有影响,最后返回结果即可。
        return F.log_softmax(self.project(x), dim=-1)

  • nn.Linear demo:
m = nn.Linear(20, 30)
input = torch.randn(128, 20)
output = m(input)
print(output.size())

"""
torch.Size([128, 30])
"""

  • Instantiation parameters:
# 词嵌入维度是512维
d_model = 512

# 词表大小是1000
vocab_size = 1000

  • Input parameters:
# 输入 x 是上一层网络的输出,我们使用来自解码器层的输出
x = de_result
  • transfer:
gen = Generator(d_model, vocab_size)
gen_result = gen(x)
print(gen_result)
print(gen_result.shape)

"""
tensor([[[-6.2595, -7.7115, -6.7399,  ..., -6.2820, -6.5677, -7.8326],
         [-7.5693, -7.0031, -8.5070,  ..., -7.1465, -6.5088, -7.0358],
         [-6.6488, -6.7498, -8.0701,  ..., -7.7222, -6.7835, -6.9165],
         [-7.1597, -6.9232, -7.2582,  ..., -6.2172, -6.8709, -6.6347]],

        [[-6.9703, -7.5174, -8.0617,  ..., -6.7344, -7.5092, -6.8835],
         [-6.6768, -6.9989, -7.0754,  ..., -7.2584, -6.9826, -7.2480],
         [-7.2625, -6.5267, -7.1707,  ..., -7.4458, -6.8468, -8.1583],
         [-7.2854, -7.1081, -7.2286,  ..., -7.0400, -8.2852, -7.4947]]],
       grad_fn=<LogSoftmaxBackward0>)
torch.Size([2, 4, 1000])
"""

2.5 Output part implementation summary:

  • The learned output section contains:
    • linear layer
  • softmax layer

  • The role of the linear layer:
    • The output of the specified dimension is obtained through the linear change of the previous step, that is, the function of transforming the dimension.

  • The role of the softmax layer:
    • Scale the numbers in the last dimension of the vector to the probability range of 0-1, and satisfy their sum to 1.

  • Learned and implemented the class of linear layer and softmax layer: Generator
    • There are two input parameters for the initialization function, d_model represents the word embedding dimension, and vocab_size represents the vocabulary size.
    • The forward function accepts the output of the previous layer.
    • Finally, the results processed by the linear layer and softmax layer are obtained.

2.6 Model Construction

  • learning target:
    • Master the implementation process of the encoder-decoder structure.
    • Master the construction process of the Transformer model.

  • Through the above sections, we have completed the implementation of all components, and then we will implement the complete encoder-decoder structure.

  • Transformer summary architecture diagram:
    insert image description here

  • Code implementation of encoder-decoder structure:
# 构建编码器-解码器结构类
class EncoderDecoder(nn.Module):
    def __init__(self, encoder, decoder, source_embed, target_embed, generator):
        # encoder:代表编码器对象
        # decoder:代表解码器对象
        # source_embed:代表源数据的嵌入函数
        # target_embed:代表目标数据的嵌入函数
        # generator:代表输出部分类别生成器对象
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = source_embed
        self.tgt_embed = target_embed
        self.generator = generator

    def forward(self, source, target, source_mask, target_mask):
        # source:代表源数据
        # target:代表目标数据
        # source_mask:代表源数据的掩码张量
        # target_mask:代表目标数据的掩码张量
        return self.generator(self.decode(self.encode(source, source_mask), source_mask, target, target_mask))

    def encode(self, source, source_mask):
        return self.encoder(self.src_embed(source), source_mask)

    def decode(self, memory, source_mask, target, target_mask):
        # memory:代表经历编码器编码后的输出张量
        return self.decoder(self.tgt_embed(target), memory, source_mask, target_mask)

  • Instantiation parameters:
vocab_size = 1000
d_model = 512
encoder = en
decoder = de
source_embed = nn.Embedding(vocab_size, d_model)
target_embed = nn.Embedding(vocab_size, d_model)
generator = gen

  • Input parameters:
# 假设源数据与目标数据相同,实际中并不相同
source = target = Variable(torch.LongTensor([[100, 2, 421, 508], [491, 998, 1, 221]]))

# 假设 src_mask 与 tgt_mask 相同,实际中并不相同
source_mask = target_mask = Variable(torch.zeros(8, 4, 4))

  • transfer:
ed = EncoderDecoder(encoder, decoder, source_embed, target_embed, generator)
ed_result = ed(source, target, source_mask, target_mask)
print(ed_result)
print(ed_result.shape)

"""
tensor([[[ 0.2102, -0.0826, -0.0550,  ...,  1.5555,  1.3025, -0.6296],
         [ 0.8270, -0.5372, -0.9559,  ...,  0.3665,  0.4338, -0.7505],
         [ 0.4956, -0.5133, -0.9323,  ...,  1.0773,  1.1913, -0.6240],
         [ 0.5770, -0.6258, -0.4833,  ...,  0.1171,  1.0069, -1.9030]],

        [[-0.4355, -1.7115, -1.5685,  ..., -0.6941, -0.1878, -0.1137],
         [-0.8867, -1.2207, -1.4151,  ..., -0.9618,  0.1722, -0.9562],
         [-0.0946, -0.9012, -1.6388,  ..., -0.2604, -0.3357, -0.6436],
         [-1.1204, -1.4481, -1.5888,  ..., -0.8816, -0.6497,  0.0606]]],
       grad_fn=<AddBackward0>)
torch.Size([2, 4, 1000])
"""

  • Next, a model for training will be constructed based on the above structure.
  • Code analysis of Trsnsformer model building process:
def make_model(source_vocab, target_vocab, N=6, d_model=512, d_ff=2048, head=8, dropout=0.1):
    """该函数用来构建模型,有7个参数,分别是源数据特征(词汇)总数,目标数据特征(词汇)总数,
    编码器和解码器堆叠数,词向量映射维度,前馈全连接网络中变换矩阵的维度,
    多头注意力结构中的多头数,以及置零比率 dropout。"""

    # 首先得到一个深度拷贝命令,接下来很多结构都需要进行深度拷贝,
    # 来保证他们彼此之间相互独立,不受干扰。
    c = copy.deepcopy

    # 实例化了多头注意力类,得到对象 attn
    attn = MultiHeadedAttention(head, d_model)

    # 然后实例化前馈全连接类,得到对象 ff
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)

    # 实例化位置编码类,得到对象 position
    position = PositionalEncoding(d_model, dropout)

    # 根据结构图,最外层是 EncoderDecoder,在 EncoderDecoder 中,
    # 分别是编码器层,解码器层,源数据 Embedding 层和位置编码组成的有序结构,
    # 目标数据 Embedding 层和位置编码组成的有序结构,以及类别生成器层,
    # 在编码器层中有 attention 子层以及前馈全连接子层,
    # 在解码器层中有两个 attention 子层以及前馈全连接层。
    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, source_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, target_vocab), c(position)),
        Generator(d_model, target_vocab)
    )

    # 模型结构完成后,接下来就是初始化模型中的参数,比如线性层中的变换矩阵
    # 这里一旦判断参数的维度大于1,则会将其初始化成一个服从均匀分布的矩阵,
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform(p)
    return model

  • nn.init.xavier_uniform demo:
# 这里服从均匀分布 U(-a, a)
w = torch.empty(3, 5)
w = nn.init.xavier_uniform(w, gain=nn.init.calculate_gain('relu'))
print(w)

"""
tensor([[ 0.1434,  1.0356,  1.0044, -0.3106, -0.6682],
        [-0.4416,  0.6925,  1.1832, -0.5946, -1.2158],
        [-0.6423, -0.1506,  0.0246,  0.2574,  0.5394]])
"""

  • Input parameters:
source_vocab = 11
target_vocab = 11
N = 6

# 其他参数都使用默认值

  • transfer:
if __name__ == '__main__':
    res = make_model(source_vocab, target_vocab, N)
    print(res)
  • Output result:
# 根据Transformer结构图构建的最终模型结构
EncoderDecoder(
  (encoder): Encoder(
    (layers): ModuleList(
      (0): EncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linears): ModuleList(
            (0): Linear(in_features=512, out_features=512)
            (1): Linear(in_features=512, out_features=512)
            (2): Linear(in_features=512, out_features=512)
            (3): Linear(in_features=512, out_features=512)
          )
          (dropout): Dropout(p=0.1)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048)
          (w_2): Linear(in_features=2048, out_features=512)
          (dropout): Dropout(p=0.1)
        )
        (sublayer): ModuleList(
          (0): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
          (1): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
        )
      )
      (1): EncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linears): ModuleList(
            (0): Linear(in_features=512, out_features=512)
            (1): Linear(in_features=512, out_features=512)
            (2): Linear(in_features=512, out_features=512)
            (3): Linear(in_features=512, out_features=512)
          )
          (dropout): Dropout(p=0.1)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048)
          (w_2): Linear(in_features=2048, out_features=512)
          (dropout): Dropout(p=0.1)
        )
        (sublayer): ModuleList(
          (0): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
          (1): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
        )
      )
    )
    (norm): LayerNorm(
    )
  )
  (decoder): Decoder(
    (layers): ModuleList(
      (0): DecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linears): ModuleList(
            (0): Linear(in_features=512, out_features=512)
            (1): Linear(in_features=512, out_features=512)
            (2): Linear(in_features=512, out_features=512)
            (3): Linear(in_features=512, out_features=512)
          )
          (dropout): Dropout(p=0.1)
        )
        (src_attn): MultiHeadedAttention(
          (linears): ModuleList(
            (0): Linear(in_features=512, out_features=512)
            (1): Linear(in_features=512, out_features=512)
            (2): Linear(in_features=512, out_features=512)
            (3): Linear(in_features=512, out_features=512)
          )
          (dropout): Dropout(p=0.1)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048)
          (w_2): Linear(in_features=2048, out_features=512)
          (dropout): Dropout(p=0.1)
        )
        (sublayer): ModuleList(
          (0): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
          (1): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
          (2): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
        )
      )
      (1): DecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linears): ModuleList(
            (0): Linear(in_features=512, out_features=512)
            (1): Linear(in_features=512, out_features=512)
            (2): Linear(in_features=512, out_features=512)
            (3): Linear(in_features=512, out_features=512)
          )
          (dropout): Dropout(p=0.1)
        )
        (src_attn): MultiHeadedAttention(
          (linears): ModuleList(
            (0): Linear(in_features=512, out_features=512)
            (1): Linear(in_features=512, out_features=512)
            (2): Linear(in_features=512, out_features=512)
            (3): Linear(in_features=512, out_features=512)
          )
          (dropout): Dropout(p=0.1)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048)
          (w_2): Linear(in_features=2048, out_features=512)
          (dropout): Dropout(p=0.1)
        )
        (sublayer): ModuleList(
          (0): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
          (1): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
          (2): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
        )
      )
    )
    (norm): LayerNorm(
    )
  )
  (src_embed): Sequential(
    (0): Embeddings(
      (lut): Embedding(11, 512)
    )
    (1): PositionalEncoding(
      (dropout): Dropout(p=0.1)
    )
  )
  (tgt_embed): Sequential(
    (0): Embeddings(
      (lut): Embedding(11, 512)
    )
    (1): PositionalEncoding(
      (dropout): Dropout(p=0.1)
    )
  )
  (generator): Generator(
    (proj): Linear(in_features=512, out_features=11)
  )
)

Guess you like

Origin blog.csdn.net/weixin_51524504/article/details/130366152