Detailed code explanation - Transformer

Overall structure

Source code address (pytorch): https://github.com/jadore801120/attention-is-all-you-need-pytorch
paper address: Attention is All You Need
✨✨✨It is strongly recommended to read "Detailed Attention Mechanism and Transformer" first " Understand the mechanism of Transformer before understanding the code in this article.

The overall structure of the project is as follows, where Transformerthe files under the package are the codes used to mainly build the Transformer model, and other files outside the package are preprocessing files and training test codes used by Transfomer to complete specific translation tasks.
insert image description here
This article focuses on the code in the red box (the core code for building Transformer), which implements the Transformer architecture shown in the figure below.
Because Transformer is often used in other tasks, this part of the core code is often ported to other project codes.
insert image description here

Modules.py

Models.pyThe file mainly defines one 缩放点积注意力(the part in the red box in the figure below)
insert image description here

The scaling dot product is calculated as follows:
softmax ⁡ ( QK ⊤ d ) V ∈ R n × v . \operatorname{softmax}\left(\frac{\mathbf{Q } \mathbf{K}^{\top}}{ \sqrt{d}}\right) \mathbf{V} \in \mathbb{R}^{n \times v} .softmax(d QK)VRn×v.

ScaledDotProductAttention

# 缩放点积注意力
class ScaledDotProductAttention(nn.Module):
    ''' Scaled Dot-Product Attention '''

    def __init__(self, temperature, attn_dropout=0.1):
        super().__init__()
        self.temperature = temperature
        self.dropout = nn.Dropout(attn_dropout)

    def forward(self, q, k, v, mask=None):
        # q:  [sz_b,n_head,len,d_q]
        # k:  [sz_b,n_head,len,d_k] ->  transpose 后:[sz_b,n_head,d_k,len]
        # v:  [sz_b,n_head,len,d_v]
        # 一般来说,d_q=d_k=d_v
        attn = torch.matmul(q / self.temperature, k.transpose(2, 3)) # score= qk^T/tempreture
        # attn: [sz_b,n_head,len,len]
        if mask is not None: # 判断是否有mask
            attn = attn.masked_fill(mask == 0, -1e9) # Mask
        attn = self.dropout(F.softmax(attn, dim=-1)) # a=softmax(Score) 然后 dropout
        output = torch.matmul(attn, v) # z=a*v
        # output: [sz_b,n_head,len,d_v]
        return output, attn

The meaning of relevant parameters:

  • q, k, vthe query, key, and value represented respectively correspond to those in the figure below Q K V; their sizes are all [sz_b,n_head,len,d_x](d_x represents d_q, d_v, d_k)
    • sz_bIndicates the batch size
    • n_headIndicates the number of heads for multi-head attention
    • lenIndicates the number of words, as shown in the figure below is 2.
    • d_xIndicates the number of features, as shown in the figure below is 64.
  • temperatureis the dmodel \sqrt{d_model}dmodel , d m o d e l d_model dmo del represents the number of features and is used for normalization. As shown in the figure below is64 = 8 \sqrt{64}=864 =8
  • maskIndicates whether to pass in the mask. There are two kinds of masks in the Transformer, namely padding maskandsequence mask
    insert image description here

Related code interpretation:

  • attn = torch.matmul(q / self.temperature, k.transpose(2, 3)): is to calculate the attention score and normalize score = QK ⊤ d score=\frac{\mathbf{Q } \mathbf{K}^{\top}}{\sqrt{d}}score=d QK
    k.transpose(2, 3)Indicates transposition in the last two dimensions (len_q, d_q) of k.
    According to the principle of matrix multiplication, the size of attn obtained is [sz_b,n_head,len_q,len_k], as shown in the figure below is [sz_b,n_head,2,2]
    insert image description here

  • attn = attn.masked_fill(mask == 0, -1e9)Then judge whether the mask is passed in, if there is a mask (the value of the mask parameter is not None), then set the mask to 0, and set the value of attn at the corresponding position to an infinitesimal negative number − e 9 -e^ {9}e9
    Why should it be set to infinitesimal? The figure below shows the softmax function. When x is infinitely small, the output of softmax approaches 0, and the value of attn is 0, which is equivalent to being masked.
    insert image description here

  • attn = self.dropout(F.softmax(attn, dim=-1))It is to perform a softmax operation on the attention score attn obtained just now d_q, convert attn into an α probability distribution matrix whose value is distributed between [0,1],
    insert image description here
    and then use the dropout operation after softmax to prevent overfitting.

  • output = torch.matmul(attn, v)The final output is multiplied by the above attn and value. The final output size is [sz_b,n_head,len_q,d_v][sz_b,n_head,2,64] as shown in the figure below. It can be found that the obtained output and the input K, Q, and V have the same size.
    insert image description here

SubLayers.py

MultiHeadAttention

MultiHeadAttentionDefines a multi-head attention and Add&Norm. (The red box in the figure below)
can realize the following three types of multi-head attention:
1) Multi-Head Self-Attention: K, Q, V have the same source
2) Masked Multi-Head Self-Attention: Pass in the sequence mask maskparameters, and K, Q, V have the same source
3) Multi-Head Cross-Attention: The sources of K, V and Q are different
insert image description here

# 多头注意力
class MultiHeadAttention(nn.Module):
    ''' Multi-Head Attention module '''

    def __init__(self, n_head, d_model, d_k, d_v, dropout=0.1):
        super().__init__()

        self.n_head = n_head # head数量
        self.d_k = d_k # key 的维度
        self.d_v = d_v # v 的维度

        self.w_qs = nn.Linear(d_model, n_head * d_k, bias=False) # [sz_b,len_q,d_model]->[sz_b,len_q,n*d_k]
        self.w_ks = nn.Linear(d_model, n_head * d_k, bias=False)
        self.w_vs = nn.Linear(d_model, n_head * d_v, bias=False)
        self.fc = nn.Linear(n_head * d_v, d_model, bias=False)

        self.attention = ScaledDotProductAttention(temperature=d_k ** 0.5) # 缩放点积注意力

        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(d_model, eps=1e-6)


    def forward(self, q, k, v, mask=None):
        # 原始输入 q/k/v:[sz_d,len,d_model]
        # sz_b: batch_size 
        # len: 单词的个数 (一般来说:len=len_q=len_k=len_v)
        # d_model:单词嵌入的维度 (一般来说:d_model=d_k=d_v)
        # n_head : head的个数
        d_k, d_v, n_head = self.d_k, self.d_v, self.n_head
        sz_b, len_q, len_k, len_v = q.size(0), q.size(1), k.size(1), v.size(1)
        residual = q # 残差连接

        q = self.w_qs(q).view(sz_b, len_q, n_head, d_k) 
        # w_qs后  [sz_b,len,d_model]->[sz_b,len,n*d_k]
        # view 后拆分成n_head个 [sz_b,len_q,n_head,d_k]
        k = self.w_ks(k).view(sz_b, len_k, n_head, d_k)
        v = self.w_vs(v).view(sz_b, len_v, n_head, d_v)

        # Transpose for attention dot product: b x n x lq x dv
        q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2) # [sz_b,n_head,len_q,d_k]

        if mask is not None:
            mask = mask.unsqueeze(1)   # 多添加一个head维度,为了方便广播
			#mask: [sz_b,len_q,len_k]-> [sz_b,1,len_q,len_k]
        q, attn = self.attention(q, k, v, mask=mask) # 缩放点积注意力
        # q: [sz_b,n_head,len_q,d_v]
        # attn: [sz_b, n_head, len_q, len_k]
        q = q.transpose(1, 2).contiguous().view(sz_b, len_q, -1)
        # [sz_b,len_q,n_head,d_v]-> [sz_b,len_q,n_head*d_v]
        q = self.dropout(self.fc(q)) # [sz_b,len_q,n_head*d_v]-> [sz_b,len_q,d_model]
        q += residual # 参差连接
        q = self.layer_norm(q) # 层归一化
        return q, attn # q: [sz_b,len_q,d_model] attn: [sz_b, n_head, len_q, len_k]

The meaning of relevant parameters:

  • The initial input in the forward function q, k, v, Note that this is not the Q, K, V in the figure below, but the content of the green box in the figure below (used to generate the original input of Q, K, V), the size is[sz_d,len_q,d_model]
    • sz_d : batch_size
    • len_x: The number of words. (len_x stands for len_q, len_v, len_k) The figure below is 2
    • d_model: Dimensions of word embeddings. The picture below is 512
    • d_x: The dimensionality of the feature. (d_x represents d_q, d_v, d_k) as shown in the figure below is 64
      insert image description here

Related code interpretation:

  • q = self.w_qs(q).view(sz_b, len_q, n_head, d_k)and k = self.w_ks(k).view(sz_b, len_k, n_head, d_k)and v = self.w_vs(v).view(sz_b, len_v, n_head, d_v)are the n_head groups Q,K,V obtained from the original input.
    w_qsis a Linear layer, the output size changes from [sz_b,len,d_model]to [sz_b,len,n*d_k]
    and then through the view function, the output size becomes[sz_b,len_q,n_head,d_k]
  • q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2) It is to put the n_head dimension on the second dimension, and the output size becomes. The [sz_b,n_head,len_q,d_k]
    first two steps are to complete the arrow content as shown in the figure below. From the original input, get the n_head group K , Q , VK,Q,VK,Q,V

insert image description here

  • mask = mask.unsqueeze(1)If mask is not None, we add a head dimension to Mask (for the convenience of subsequent broadcasting). The size of the mask [sz_b,len_q,len_k]changes from[sz_b,1,len_q,len_k]

  • q, attn = self.attention(q, k, v, mask=mask)By scaling the dot product attention, the size of the output q is [sz_b,n_head,len_q,d_v], the size of attn is [sz_b,n_head,len_q,len_k]
    the output q is actually Z 0 , Z 1 . . Z 7 Z_0,Z_1..Z_7 in the figure belowZ0,Z1..Z7
    insert image description here

  • q = q.transpose(1, 2).contiguous().view(sz_b, len_q, -1)After transpose, the size becomes [sz_b,len_q,n_head,d_v], and after passing the view, the size becomes [sz_b,len_q,n_head*d_v].
    This operation is equivalent to moving Z 0 , . . . Z 7 Z_0,...Z_7 along the direction of the blue line in the figure belowZ0,...Z7outputs are concatenated to get Z ′ Z’Z
    insert image description here

  • q = self.dropout(self.fc(q))First pass through a fc layer, the size from [sz_b,len_q,n_head*d_v]-> [sz_b,len_q,d_model].
    This step is equivalent to taking the obtained Z ′ Z'Z' WAWOW^OWMultiply O to get ZZZ
    then goes through a dropout.

  • q += residualRepresents the residual connection

  • q = self.layer_norm(q)Normalization of the presentation layer

PositionwiseFeedForward

PositionwiseFeedForwardA Feed Forwad and Add &Norm module is defined. (The red box in the picture below)
insert image description here

class PositionwiseFeedForward(nn.Module):
    ''' A two-feed-forward-layer module '''

    def __init__(self, d_in, d_hid, dropout=0.1):
        super().__init__()
        self.w_1 = nn.Linear(d_in, d_hid) # position-wise
        self.w_2 = nn.Linear(d_hid, d_in) # position-wise
        self.layer_norm = nn.LayerNorm(d_in, eps=1e-6)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x): # x: [sz_b,len_q,d_model]
        residual = x # 残差
        x = self.w_2(F.relu(self.w_1(x)))
        # w_1: [sz_b,len_q,d_hid]  w_2: [sz_b,len_q,d_model]
        x = self.dropout(x)
        x += residual # 残差连接
        x = self.layer_norm(x) # 层归一化

        return x # [sz_b,len_q,d_model]

The meaning of relevant parameters:

  • The size of the forward input xis[sz_b,len_q,d_model]
    • sz_b: batch size
    • len_q: the length of the word
    • d_model: Dimensions of word embeddings
  • h_in: The input feature dimension of the fully connected layer
  • h_hid: The output feature dimension of the fully connected layer

Related code interpretation:

  • x = self.w_2(F.relu(self.w_1(x)))It is to realize the feed forward layer, and the calculation formula of feed forward is as follows.
    FFN ⁡ ( x ) = max ⁡ ( 0 , x W 1 + b 1 ) W 2 + b 2 \operatorname{FFN}(\mathrm{x})=\max \left(0, \mathrm{xW}_1+\ mathrm{b}_1\right) \mathrm{W}_2+\mathrm{b}_2FFN(x)=max(0,xW1+b1)W2+b2
    feed forward A two-layer neural network, xfirst through w_1linear transformation, the size becomes [sz_b,len_q,d_hid]; then the ReLU nonlinear activation function; then through w_2linear transformation, the size becomes [sz_b,len_q,d_model].
  • q += residualRepresents the residual connection
  • q = self.layer_norm(q)Normalization of the presentation layer

Layers.py

EncoderLayer

EncoderLayerAn Encoder Block module is defined. (The red box in the picture below)
insert image description here

# Encoder Block
class EncoderLayer(nn.Module):
    ''' Compose with two layers '''

    def __init__(self, d_model, d_inner, n_head, d_k, d_v, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.slf_attn = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout)
        self.pos_ffn = PositionwiseFeedForward(d_model, d_inner, dropout=dropout)

    def forward(self, enc_input, slf_attn_mask=None): # 输入的k,q,v都是 enc_input
        enc_output, enc_slf_attn = self.slf_attn(
            enc_input, enc_input, enc_input, mask=slf_attn_mask)  # 多头注意力机制
        enc_output = self.pos_ffn(enc_output) # 前馈层
        return enc_output, enc_slf_attn
        # enc_output: [sz_b,len_q,d_model]
        # enc_slf_attn: [sz_b, n_head, len_q, len_k]

The meaning of relevant parameters:

  • enc_inputThe encoded input (that is, the result of adding word embedding and position encoding to the word) has a size of[sz_b,len_q,d_model]
    • d_model: Dimensions of word embeddings
    • len_q: the number of words
    • sz_b : batch_size
  • d_x: The dimensionality of the feature. (d_x stands for d_q, d_v, d_k)
  • slf_attn_mask: mask mask

Related code interpretation:

  • enc_output, enc_slf_attn = self.slf_attn(enc_input, enc_input, enc_input, mask=slf_attn_mask)
    MultiHeadAttentionFirst, the size of the output obtained through the definition in the SubLayers.py file remains unchanged, still as[sz_b,len_q,d_model]
  • enc_output = self.pos_ffn(enc_output), and then send the output of MultiHeadAttention to the defined in the SubLayers.py file PositionwiseFeedForward, and the size of the output remains unchanged, which is still[sz_b,len_q,d_model]

DecoderLayer

DecoderLayerA Decoder Block module is defined. (The red box in the picture below)
insert image description here

# Decoder Block
class DecoderLayer(nn.Module):
    ''' Compose with three layers '''

    def __init__(self, d_model, d_inner, n_head, d_k, d_v, dropout=0.1):
        super(DecoderLayer, self).__init__()
        self.slf_attn = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout)
        self.enc_attn = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout)
        self.pos_ffn = PositionwiseFeedForward(d_model, d_inner, dropout=dropout)

    def forward(
            self, dec_input, enc_output,
            slf_attn_mask=None, dec_enc_attn_mask=None):
        # dec_input: [sz_d,len_q,d_model]
        dec_output, dec_slf_attn = self.slf_attn(
            dec_input, dec_input, dec_input, mask=slf_attn_mask) # 第一个多头注意力: Self attention
        # 输入的q,k,v 均是dec_input
        # dec_output: [sz_b,len_q,d_model]
        # dec_slf_attn: [sz_b, n_head, len_q, len_k]

        dec_output, dec_enc_attn = self.enc_attn(
            dec_output, enc_output, enc_output, mask=dec_enc_attn_mask) # 第二个多头注意力: Cross Attention
        # 输入的q是上一个Decoder中多头注意力的输出, k,v是Encoder的输出
        dec_output = self.pos_ffn(dec_output) # 前馈网络
        return dec_output, dec_slf_attn, dec_enc_attn
        # dec_output: 最终编码器的输出 [sz_b,len_q,d_model]
        # dec_slf_attn: 第一个多头注意力的attention score  [sz_b, n_head, len_q, len_k]
        # dec_enc_attn: 第二个多头注意力的attention score  [sz_b, n_head, len_q, len_k]

The meaning of relevant parameters:

  • dec_input: The input to the decoder, of size[sz_b,len_q,d_model]
  • enc_output: Encoder input,[sz_b,len_q,d_model]
  • slf_attn_mask: mask of self-attention
  • dec_enc_attn_mask: mask of cross-attention

Related code interpretation:

  • dec_output, dec_slf_attn = self.slf_attn(dec_input, dec_input, dec_input, mask=slf_attn_mask)What is implemented here is the first Masked Multi-Head Self-Attentiionand Add&Norm layer in the decoder (as shown in the blue box).
    The inputs of q, k, and v are all dec_input (as shown in the red circle in the figure below). The resulting output dec_outputis of size[sz_b,len_q,d_model]
    insert image description here
  • dec_output, dec_enc_attn = self.enc_attn(dec_output, enc_output, enc_output, mask=dec_enc_attn_mask)What is implemented is the second Multi-Head Cross-Attentionand Add&Norm layer in the decoder. (As shown in the blue box in the figure below)
    The input of q comes from decoder_output (the output of a self-Attention on the decoder, the green circle in the figure below); the input of k and v comes from enc_output (the output of the encoder, in the figure below in the red circle). Because the sources of q and k, v are different, this multi-head attention is also called Cross-Attention. And when the sources of q, k, and v are the same, the multi-head attention is called Self-Attention.
    The resulting output dec_outputis still of size[sz_b,len_q,d_model]

insert image description here

  • dec_output = self.pos_ffn(dec_output)PositionwiseFeedForwardThe final output of the decoder is obtained through the definition in the Sublayers.py file , and the size is still[sz_b,len_q,d_model]

Models.py

get_pad_mask

get_pad_mask is implemented padding mask, because the length of each batch of input sequences is different, that is, we need to align the input sequences.

# padding mask
def get_pad_mask(seq, pad_idx): # seq: [sz_b,len_q]   pad_idx[sz_b,len_q]
    return (seq != pad_idx).unsqueeze(-2) # [sz_b,1,len_q]

The meaning of relevant parameters:

  • seq: input word sequence,
  • pad_idx: When the word index is pad_idx, the word embedding is padded with 0. For example pad_idx=3

Related code interpretation:

  • (seq != pad_idx).unsqueeze(-2)Used to generate padding mask
    Suppose now there is a dictionary containing three words {0: 'a', 1: 'b', 2: 'c'}, and pad_idx=3
    For a batch, the input sentence sequence is The index corresponding to "abc" is [0,1,2], assuming that the length of the sentence is required to be 5, the sequence is filled as the output of [0,1,2,3,3]
    seq!=pad_idx[True,True,True,False, False]. The position of False is the place where the mask is dropped.
    Going back to ScaledDotProductAttention, the code where the mask is called: attn = attn.masked_fill(mask == 0, -1e9)the size of attn is [sz_b, n_head, len_q, len_k]
    therefore for multiple batches, unsqueeze(-2)in order to generate the head dimension, the output size is [sz_b,1,len_q]. Among them, the dimension of len_k can be broadcasted.
    When the value of the mask is False, it will be filled with negative infinitesimal, and it will approach 0 after softmax, and the value of attn in this area will be masked.

get_subsequent_mask

get_subsequent_maskUsed to generate sequence mask.
The sequence mask is to prevent the decoder from seeing future information. For a sequence, when time_step is t, our decoding output should only depend on the output before t, but not on the output after t. Therefore, we need to think of a way to hide the information after t. This is effective during training, because every time we input the target data into the decoder completely during training, it is not needed during prediction. During prediction, we can only get the output predicted at the previous moment.

# sequence mask
def get_subsequent_mask(seq):
    sz_b, len_s = seq.size()
    # sz_b: batch size l
    # len_s: 句子中单词的个数
    subsequent_mask = (1 - torch.triu(
        torch.ones((1, len_s, len_s), device=seq.device), diagonal=1)).bool()
    ''''
    x= torch.ones (1, len_s, len_s) : 生成大小为[1,len_s,len_s] 全为1的矩阵
    y= torch.triu(x,diagonal=1)后,y 的形状类似于:
    011
    001
    000
    1-y后:
    100
    110
    111
    然后再转换成bool值
    '''
    return subsequent_mask

The meaning of relevant parameters:

  • sz_b : batch_size
  • len_s: the number of words in the input sequence

Related code interpretation:

  • x=torch.ones((1, len_s, len_s)Used to generate a matrix of size [1, len_s, len_s] all 1.
    Assuming len_s is 3, then the generated x matrix is
    ​​1 1 1
    1 1
    1 1 1 1
  • y= torch.triu(x,diagonal=1)After that, the shape of y becomes an upper triangular matrix
    0 1 1
    0 0 1
    0 0 0
  • 1-yFinally, it becomes a lower triangular matrix. This matrix is ​​the sequence mask
    1 0 0
    1 1 0
    1 1 1
    combined with ScaledDotProductAttentionthe explanation of the mask, where the value of 0 in the mask, the content of attn is assigned to negative infinitesimal, softmax Afterwards it tends to 0. Therefore, the content of 0 is equivalent to the mask off.

For example:
insert image description here
the yellow rectangle in the above figure is equivalent to filling 0, and the green rectangle is equivalent to filling 1.
When Decoder's input matrix and Mask matrix input matrix contains five word representation vectors of "I have a cat" (0, 1, 2, 3, 4), Mask is a 5×5 matrix. In Mask, it can be found that word 0 can only use the information of word 0, while word 1 can use the information of words 0 and 1, that is, only the previous information can be used.

PositionalEncoding

PositionalEncodingIt is to add positional encoding to the word embedding input by the encoder and decoder. (As shown in the red box in the figure below)
Transformer uses 正余弦位置编码. The position code is generated by using sine and cosine functions of different frequencies, and then added to the word vector of the corresponding position. The dimension of the position vector must be consistent with the dimension of the word vector.
insert image description here

# 位置编码
class PositionalEncoding(nn.Module):

    def __init__(self, d_hid, n_position=200):
        super(PositionalEncoding, self).__init__()

        # Not a parameter
        self.register_buffer('pos_table', self._get_sinusoid_encoding_table(n_position, d_hid))

    def _get_sinusoid_encoding_table(self, n_position, d_hid):
        ''' Sinusoid position encoding table '''
        # TODO: make it with torch instead of numpy

        def get_position_angle_vec(position):
            return [position / np.power(10000, 2 * (hid_j // 2) / d_hid) for hid_j in range(d_hid)]

        sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(n_position)])
        sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])  # dim 2j
        sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])  # dim 2j+1

        return torch.FloatTensor(sinusoid_table).unsqueeze(0)

    def forward(self, x):
        # x: 单词embedding  [sz_b,len_q,d_model]
        # pos_table: 位置encoding
        return x + self.pos_table[:, :x.size(1)].clone().detach()

The meaning of relevant parameters:

  • x: input word embedding (input embedding), size is[sz_b,len_q,d_model]
    • sz_b : batch_size
    • len_q: the number of words
    • d_model: Dimensions of word embeddings
  • pos_table: Generated position code

Related code interpretation:

  • _get_sinusoid_encoding_tableThe function is used to generate the position coded table
    assuming that the input represents X ∈ R n × d \mathbf{X} \in \mathbb{R}^{n \times d}XRn × d contains a d-dimensional embedding representation of n tokens in a sequence. Position encoding uses a position embedding matrix of the same shapeP ∈ R n × d \mathbf{P} \in \mathbb{R}^{n \times d}PRn × d outputsX + P \mathbf{X}+\mathbf{P}X+P ,iiLine i , 2nd j 2j2j columns and2j+1 2j+ 12 j+The elements on column 1 are:

p i , 2 j = sin ⁡ ( i 1000 0 2 j / d ) p i , 2 j + 1 = cos ⁡ ( i 1000 0 2 j / d ) . \begin{aligned} p_{i, 2 j} & =\sin \left(\frac{i}{10000^{2 j / d}}\right) \\ p_{i, 2 j+1} & =\cos \left(\frac{i}{10000^{2 j / d}}\right) . \end{aligned} pi , 2 jpi , 2 j + 1=sin(100002 j / di)=cos(100002 j / di).
part iii represents the absolute position of the word in the sentence,i = 0, 1, 2... i=0, 1, 2...i=012... Example:i = 2Jerryin"Tom chase Jerry"ii=2 d m o d e l d_{model} dmodelRepresents the dimension of the word vector, where dmodel = 512 d_{model}=512dmodel=5122j 2j2 j2 j + 1 2j+12 j+1 means parity,jjj represents the dimension in the word vector, for example heredmodel = 512 d_{model}=512dmodel=512 , soj = 0, 1, 2...255 j=0, 1, 2...255j=012255

Encoder

The Encoder implements the part of the red box in the figure below.
insert image description here

# 编码器
class Encoder(nn.Module):
    ''' A encoder model with self attention mechanism. '''
    def __init__(
            self, n_src_vocab, d_word_vec, n_layers, n_head, d_k, d_v,
            d_model, d_inner, pad_idx, dropout=0.1, n_position=200, scale_emb=False):

        super().__init__()

        self.src_word_emb = nn.Embedding(n_src_vocab, d_word_vec, padding_idx=pad_idx)  # 词嵌入
        self.position_enc = PositionalEncoding(d_word_vec, n_position=n_position) # 位置编码
        self.dropout = nn.Dropout(p=dropout)
        self.layer_stack = nn.ModuleList([
            EncoderLayer(d_model, d_inner, n_head, d_k, d_v, dropout=dropout)
            for _ in range(n_layers)]) # n_layers个encoder block
        self.layer_norm = nn.LayerNorm(d_model, eps=1e-6)
        self.scale_emb = scale_emb
        self.d_model = d_model

    def forward(self, src_seq, src_mask, return_attns=False):

        enc_slf_attn_list = []

        # -- Forward
        enc_output = self.src_word_emb(src_seq) # 词嵌入[sz_b,len_q]-> [sz_b,len_q,d_model]
        if self.scale_emb:
            enc_output *= self.d_model ** 0.5 # 归一化
        enc_output = self.dropout(self.position_enc(enc_output)) # 位置编码
        enc_output = self.layer_norm(enc_output) # 层归一化

        for enc_layer in self.layer_stack: # N 个Encoder Block
            enc_output, enc_slf_attn = enc_layer(enc_output, slf_attn_mask=src_mask)
            enc_slf_attn_list += [enc_slf_attn] if return_attns else []
        # enc_output: [sz_b,len_q,d_model]
        # enc_slf_attn: [sz_b, n_head, len_q, len_k]
        # enc_slf_attn_list 是n个encoder block产生的enc_slf_attn 构成的列表
        if return_attns:
            return enc_output, enc_slf_attn_list
        return enc_output,

The meaning of relevant parameters:

  • src_seqThe original sequence of words input by the encoder
  • scale_embControls whether to scale word embeddings
  • layer_stack Is a ModelList composed of n_layers encoder blocks
  • n_src_vocab: The total number of words in the word list defined by the nn.Embedding layer
  • d_word_vec: The feature dimension of the word embedding output by the nn.Embedding layer, equivalent to d_model
  • padding_idx: When the word index in the word table is padding_idx, the output word embedding is filled with 0.

Related code interpretation:

  • enc_output = self.src_word_emb(src_seq)Input Embedding obtained through word embedding, the size is[sz_b,len_q,d_model]
  • enc_output *= self.d_model ** 0.5If normalization is required, multiply the word embeddings by dmodel \sqrt{d_{model}}dmodel
  • enc_output = self.dropout(self.position_enc(enc_output))First encode the position, then add it to the word embedding, and then pass a dropout
  • enc_output = self.layer_norm(enc_output)Normalize by a layer
  • enc_output, enc_slf_attn = enc_layer(enc_output, slf_attn_mask=src_mask)Then traverse the ModelList of layer_stack, input the output of the previous EncoderBlock to the next EncoderBlock each time, and pass n_layers Encoder Blocks in series. The size of the final output enc_output is[sz_b,len_q,d_model]

Decoder

Decoder implements the part of the red box in the figure below.
insert image description here

# 解码器
class Decoder(nn.Module):
    ''' A decoder model with self attention mechanism. '''

    def __init__(
            self, n_trg_vocab, d_word_vec, n_layers, n_head, d_k, d_v,
            d_model, d_inner, pad_idx, n_position=200, dropout=0.1, scale_emb=False):

        super().__init__()

        self.trg_word_emb = nn.Embedding(n_trg_vocab, d_word_vec, padding_idx=pad_idx) # 单词嵌入
        self.position_enc = PositionalEncoding(d_word_vec, n_position=n_position) # 位置编码
        self.dropout = nn.Dropout(p=dropout)
        self.layer_stack = nn.ModuleList([
            DecoderLayer(d_model, d_inner, n_head, d_k, d_v, dropout=dropout)
            for _ in range(n_layers)])
        self.layer_norm = nn.LayerNorm(d_model, eps=1e-6)
        self.scale_emb = scale_emb
        self.d_model = d_model

    def forward(self, trg_seq, trg_mask, enc_output, src_mask, return_attns=False):

        dec_slf_attn_list, dec_enc_attn_list = [], []

        # -- Forward
        dec_output = self.trg_word_emb(trg_seq) # 单词嵌入 [sz_b,len_q,d_model]
        if self.scale_emb:
            dec_output *= self.d_model ** 0.5
        dec_output = self.dropout(self.position_enc(dec_output)) # 位置编码
        dec_output = self.layer_norm(dec_output) #层归一化

        for dec_layer in self.layer_stack: # N个decoder block
            dec_output, dec_slf_attn, dec_enc_attn = dec_layer(
                dec_output, enc_output, slf_attn_mask=trg_mask, dec_enc_attn_mask=src_mask)
            dec_slf_attn_list += [dec_slf_attn] if return_attns else []
            dec_enc_attn_list += [dec_enc_attn] if return_attns else []
        # dec_output: [sz_b,len_q,d_model]
        # dec_slf_attn: self-attention的attn[sz_b, n_head, len_q, len_k]
        # dec_enc_attn: cross-attention的attn[sz_b, n_head, len_q, len_k]
        # dec_slf_attn_list 是n个decoder block产生的enc_slf_attn 构成的列表
        # dec_enc_attn_list 是n个decoder block产生的enc_enc_attn 构成的列表
        if return_attns:
            return dec_output, dec_slf_attn_list, dec_enc_attn_list
        return dec_output,

The meaning of relevant parameters:

  • tar_seqThe original sequence of words input to the decoder
  • scale_embControls whether to scale word embeddings
  • layer_stack Is a ModelList composed of n_layers decoder blocks

Related code interpretation:

  • dec_output = self.trg_word_emb(trg_seq) First, word embedding is performed on the input word sequence to obtain Input Embedding, the size of which is[sz_b,len_q,d_model]
  • dec_output *= self.d_model ** 0.5If normalization is required, multiply the word embeddings by dmodel \sqrt{d_{model}}dmodel
  • dec_output = self.dropout(self.position_enc(dec_output))Add word embedding to position encoding and perform dropout
  • dec_output = self.layer_norm(dec_output)Normalize by layer
  • dec_output, dec_slf_attn, dec_enc_attn = dec_layer(dec_output, enc_output, slf_attn_mask=trg_mask, dec_enc_attn_mask=src_mask)Through n_layers series decoder block, dec_outputthe size of the final output is[sz_b,len_q,d_model]

Transformer

What Transformer implements is the overall architecture. (The content in the red box in the figure below)
insert image description here

# Transformer
class Transformer(nn.Module):
    ''' A sequence to sequence model with attention mechanism. '''

    def __init__(
            self, n_src_vocab, n_trg_vocab, src_pad_idx, trg_pad_idx,
            d_word_vec=512, d_model=512, d_inner=2048,
            n_layers=6, n_head=8, d_k=64, d_v=64, dropout=0.1, n_position=200,
            trg_emb_prj_weight_sharing=True, emb_src_trg_weight_sharing=True,
            scale_emb_or_prj='prj'):

        super().__init__()

        self.src_pad_idx, self.trg_pad_idx = src_pad_idx, trg_pad_idx

        # In section 3.4 of paper "Attention Is All You Need", there is such detail:
        # "In our model, we share the same weight matrix between the two
        # embedding layers and the pre-softmax linear transformation...
        # In the embedding layers, we multiply those weights by \sqrt{d_model}".
        #
        # Options here:
        #   'emb': multiply \sqrt{d_model} to embedding output
        #   'prj': multiply (\sqrt{d_model} ^ -1) to linear projection output
        #   'none': no multiplication

        assert scale_emb_or_prj in ['emb', 'prj', 'none']
        scale_emb = (scale_emb_or_prj == 'emb') if trg_emb_prj_weight_sharing else False
        self.scale_prj = (scale_emb_or_prj == 'prj') if trg_emb_prj_weight_sharing else False
        self.d_model = d_model

        self.encoder = Encoder(
            n_src_vocab=n_src_vocab, n_position=n_position,
            d_word_vec=d_word_vec, d_model=d_model, d_inner=d_inner,
            n_layers=n_layers, n_head=n_head, d_k=d_k, d_v=d_v,
            pad_idx=src_pad_idx, dropout=dropout, scale_emb=scale_emb)

        self.decoder = Decoder(
            n_trg_vocab=n_trg_vocab, n_position=n_position,
            d_word_vec=d_word_vec, d_model=d_model, d_inner=d_inner,
            n_layers=n_layers, n_head=n_head, d_k=d_k, d_v=d_v,
            pad_idx=trg_pad_idx, dropout=dropout, scale_emb=scale_emb)

        self.trg_word_prj = nn.Linear(d_model, n_trg_vocab, bias=False)

        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p) 

        assert d_model == d_word_vec, \
        'To facilitate the residual connections, \
         the dimensions of all module outputs shall be the same.'

        if trg_emb_prj_weight_sharing:
            # Share the weight between target word embedding & last dense layer
            self.trg_word_prj.weight = self.decoder.trg_word_emb.weight

        if emb_src_trg_weight_sharing:
            self.encoder.src_word_emb.weight = self.decoder.trg_word_emb.weight


    def forward(self, src_seq, trg_seq): 
		# src_seq (b_sz,len_q)
        src_mask = get_pad_mask(src_seq, self.src_pad_idx) # 对于输入,padding mask
        trg_mask = get_pad_mask(trg_seq, self.trg_pad_idx) & get_subsequent_mask(trg_seq) # 对于输出:padding mask+ sequence mask

        enc_output, *_ = self.encoder(src_seq, src_mask) # Encoder
        dec_output, *_ = self.decoder(trg_seq, trg_mask, enc_output, src_mask) # Decoder
        # enc_output: (b_sz,len_q,d_model)
        # dec_output: (b_sz,len_q,d_model)
        seq_logit = self.trg_word_prj(dec_output)
        #seq_logit: (b_sz,len_q,n_trg_vocab) 
        if self.scale_prj:
            seq_logit *= self.d_model ** -0.5

        return seq_logit.view(-1, seq_logit.size(2)) # (b_sz*len_q,n_trg_vocab) 

The meaning of relevant parameters:

  • src_seqThe original sequence of words input by the encoder
  • trg_seqThe original sequence of words input to the decoder
  • n_trg_vocabThe length of the target vocabulary

Related code interpretation:

  • src_mask = get_pad_mask(src_seq, self.src_pad_idx)For the input of the encoder, a padding mask is required to unify the length of the word sequence
  • trg_mask = get_pad_mask(trg_seq, self.trg_pad_idx) & get_subsequent_mask(trg_seq)For the input of the decoder, not only the padding mask is required to unify the length of the word sequence, but also the sequence mask is required, so that when predicting, we can only get the output predicted at the previous moment, but cannot see the following words.
  • enc_output, *_ = self.encoder(src_seq, src_mask)First go through the encoder
  • dec_output, *_ = self.decoder(trg_seq, trg_mask, enc_output, src_mask)and then through the decoder
  • seq_logit = self.trg_word_prj(dec_output)Through a linear layer, the dimension of the word embedding is mapped to the dimension of the vocabulary, and the size (b_sz,len_q,d_model) changes from (b_sz,len_q,n_trg_vocab) as shown in the red box below
    insert image description here
  • seq_logit *= self.d_model ** -0.5If scale_prj is true, multiply the output seq_logic by dmodel \sqrt{d_{model}}dmodel
  • seq_logit.view(-1, seq_logit.size(2))Merge the first two dimensions of seq_logit together, and the size becomes(b_sz*len_q,n_trg_vocab)

Guess you like

Origin blog.csdn.net/zyw2002/article/details/132252670