Article directory
Overall structure
Source code address (pytorch): https://github.com/jadore801120/attention-is-all-you-need-pytorch
paper address: Attention is All You Need
✨✨✨It is strongly recommended to read "Detailed Attention Mechanism and Transformer" first " Understand the mechanism of Transformer before understanding the code in this article.
The overall structure of the project is as follows, where Transformer
the files under the package are the codes used to mainly build the Transformer model, and other files outside the package are preprocessing files and training test codes used by Transfomer to complete specific translation tasks.
This article focuses on the code in the red box (the core code for building Transformer), which implements the Transformer architecture shown in the figure below.
Because Transformer is often used in other tasks, this part of the core code is often ported to other project codes.
Modules.py
Models.py
The file mainly defines one 缩放点积注意力
(the part in the red box in the figure below)
The scaling dot product is calculated as follows:
softmax ( QK ⊤ d ) V ∈ R n × v . \operatorname{softmax}\left(\frac{\mathbf{Q } \mathbf{K}^{\top}}{ \sqrt{d}}\right) \mathbf{V} \in \mathbb{R}^{n \times v} .softmax(dQK⊤)V∈Rn×v.
ScaledDotProductAttention
# 缩放点积注意力
class ScaledDotProductAttention(nn.Module):
''' Scaled Dot-Product Attention '''
def __init__(self, temperature, attn_dropout=0.1):
super().__init__()
self.temperature = temperature
self.dropout = nn.Dropout(attn_dropout)
def forward(self, q, k, v, mask=None):
# q: [sz_b,n_head,len,d_q]
# k: [sz_b,n_head,len,d_k] -> transpose 后:[sz_b,n_head,d_k,len]
# v: [sz_b,n_head,len,d_v]
# 一般来说,d_q=d_k=d_v
attn = torch.matmul(q / self.temperature, k.transpose(2, 3)) # score= qk^T/tempreture
# attn: [sz_b,n_head,len,len]
if mask is not None: # 判断是否有mask
attn = attn.masked_fill(mask == 0, -1e9) # Mask
attn = self.dropout(F.softmax(attn, dim=-1)) # a=softmax(Score) 然后 dropout
output = torch.matmul(attn, v) # z=a*v
# output: [sz_b,n_head,len,d_v]
return output, attn
The meaning of relevant parameters:
q
,k
,v
the query, key, and value represented respectively correspond to those in the figure belowQ
K
V
; their sizes are all[sz_b,n_head,len,d_x]
(d_x represents d_q, d_v, d_k)sz_b
Indicates the batch sizen_head
Indicates the number of heads for multi-head attentionlen
Indicates the number of words, as shown in the figure below is 2.d_x
Indicates the number of features, as shown in the figure below is 64.
temperature
is the dmodel \sqrt{d_model}dmodel, d m o d e l d_model dmo del represents the number of features and is used for normalization. As shown in the figure below is64 = 8 \sqrt{64}=864=8mask
Indicates whether to pass in the mask. There are two kinds of masks in the Transformer, namelypadding mask
andsequence mask
Related code interpretation:
-
attn = torch.matmul(q / self.temperature, k.transpose(2, 3))
: is to calculate the attention score and normalize score = QK ⊤ d score=\frac{\mathbf{Q } \mathbf{K}^{\top}}{\sqrt{d}}score=dQK⊤
k.transpose(2, 3)
Indicates transposition in the last two dimensions (len_q, d_q) of k.
According to the principle of matrix multiplication, the size of attn obtained is[sz_b,n_head,len_q,len_k]
, as shown in the figure below is [sz_b,n_head,2,2]
-
attn = attn.masked_fill(mask == 0, -1e9)
Then judge whether the mask is passed in, if there is a mask (the value of the mask parameter is not None), then set the mask to 0, and set the value of attn at the corresponding position to an infinitesimal negative number − e 9 -e^ {9}−e9
Why should it be set to infinitesimal? The figure below shows the softmax function. When x is infinitely small, the output of softmax approaches 0, and the value of attn is 0, which is equivalent to being masked.
-
attn = self.dropout(F.softmax(attn, dim=-1))
It is to perform a softmax operation on the attention score attn obtained just nowd_q
, convert attn into an α probability distribution matrix whose value is distributed between [0,1],
and then use the dropout operation after softmax to prevent overfitting. -
output = torch.matmul(attn, v)
The final output is multiplied by the above attn and value. The final output size is[sz_b,n_head,len_q,d_v]
[sz_b,n_head,2,64] as shown in the figure below. It can be found that the obtained output and the input K, Q, and V have the same size.
SubLayers.py
MultiHeadAttention
MultiHeadAttention
Defines a multi-head attention and Add&Norm. (The red box in the figure below)
can realize the following three types of multi-head attention:
1) Multi-Head Self-Attention
: K, Q, V have the same source
2) Masked Multi-Head Self-Attention
: Pass in the sequence mask mask
parameters, and K, Q, V have the same source
3) Multi-Head Cross-Attention
: The sources of K, V and Q are different
# 多头注意力
class MultiHeadAttention(nn.Module):
''' Multi-Head Attention module '''
def __init__(self, n_head, d_model, d_k, d_v, dropout=0.1):
super().__init__()
self.n_head = n_head # head数量
self.d_k = d_k # key 的维度
self.d_v = d_v # v 的维度
self.w_qs = nn.Linear(d_model, n_head * d_k, bias=False) # [sz_b,len_q,d_model]->[sz_b,len_q,n*d_k]
self.w_ks = nn.Linear(d_model, n_head * d_k, bias=False)
self.w_vs = nn.Linear(d_model, n_head * d_v, bias=False)
self.fc = nn.Linear(n_head * d_v, d_model, bias=False)
self.attention = ScaledDotProductAttention(temperature=d_k ** 0.5) # 缩放点积注意力
self.dropout = nn.Dropout(dropout)
self.layer_norm = nn.LayerNorm(d_model, eps=1e-6)
def forward(self, q, k, v, mask=None):
# 原始输入 q/k/v:[sz_d,len,d_model]
# sz_b: batch_size
# len: 单词的个数 (一般来说:len=len_q=len_k=len_v)
# d_model:单词嵌入的维度 (一般来说:d_model=d_k=d_v)
# n_head : head的个数
d_k, d_v, n_head = self.d_k, self.d_v, self.n_head
sz_b, len_q, len_k, len_v = q.size(0), q.size(1), k.size(1), v.size(1)
residual = q # 残差连接
q = self.w_qs(q).view(sz_b, len_q, n_head, d_k)
# w_qs后 [sz_b,len,d_model]->[sz_b,len,n*d_k]
# view 后拆分成n_head个 [sz_b,len_q,n_head,d_k]
k = self.w_ks(k).view(sz_b, len_k, n_head, d_k)
v = self.w_vs(v).view(sz_b, len_v, n_head, d_v)
# Transpose for attention dot product: b x n x lq x dv
q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2) # [sz_b,n_head,len_q,d_k]
if mask is not None:
mask = mask.unsqueeze(1) # 多添加一个head维度,为了方便广播
#mask: [sz_b,len_q,len_k]-> [sz_b,1,len_q,len_k]
q, attn = self.attention(q, k, v, mask=mask) # 缩放点积注意力
# q: [sz_b,n_head,len_q,d_v]
# attn: [sz_b, n_head, len_q, len_k]
q = q.transpose(1, 2).contiguous().view(sz_b, len_q, -1)
# [sz_b,len_q,n_head,d_v]-> [sz_b,len_q,n_head*d_v]
q = self.dropout(self.fc(q)) # [sz_b,len_q,n_head*d_v]-> [sz_b,len_q,d_model]
q += residual # 参差连接
q = self.layer_norm(q) # 层归一化
return q, attn # q: [sz_b,len_q,d_model] attn: [sz_b, n_head, len_q, len_k]
The meaning of relevant parameters:
- The initial input in the forward function
q
,k
,v
, Note that this is not the Q, K, V in the figure below, but the content of the green box in the figure below (used to generate the original input of Q, K, V), the size is[sz_d,len_q,d_model]
sz_d
: batch_sizelen_x
: The number of words. (len_x stands for len_q, len_v, len_k) The figure below is 2d_model
: Dimensions of word embeddings. The picture below is 512d_x
: The dimensionality of the feature. (d_x represents d_q, d_v, d_k) as shown in the figure below is 64
Related code interpretation:
q = self.w_qs(q).view(sz_b, len_q, n_head, d_k)
andk = self.w_ks(k).view(sz_b, len_k, n_head, d_k)
andv = self.w_vs(v).view(sz_b, len_v, n_head, d_v)
are the n_head groups Q,K,V obtained from the original input.
w_qs
is a Linear layer, the output size changes from[sz_b,len,d_model]
to[sz_b,len,n*d_k]
and then through the view function, the output size becomes[sz_b,len_q,n_head,d_k]
q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2)
It is to put the n_head dimension on the second dimension, and the output size becomes. The[sz_b,n_head,len_q,d_k]
first two steps are to complete the arrow content as shown in the figure below. From the original input, get the n_head group K , Q , VK,Q,VK,Q,V
-
mask = mask.unsqueeze(1)
If mask is not None, we add a head dimension to Mask (for the convenience of subsequent broadcasting). The size of the mask[sz_b,len_q,len_k]
changes from[sz_b,1,len_q,len_k]
-
q, attn = self.attention(q, k, v, mask=mask)
By scaling the dot product attention, the size of the output q is[sz_b,n_head,len_q,d_v]
, the size of attn is[sz_b,n_head,len_q,len_k]
the output q is actually Z 0 , Z 1 . . Z 7 Z_0,Z_1..Z_7 in the figure belowZ0,Z1..Z7
-
q = q.transpose(1, 2).contiguous().view(sz_b, len_q, -1)
After transpose, the size becomes[sz_b,len_q,n_head,d_v]
, and after passing the view, the size becomes[sz_b,len_q,n_head*d_v]
.
This operation is equivalent to moving Z 0 , . . . Z 7 Z_0,...Z_7 along the direction of the blue line in the figure belowZ0,...Z7outputs are concatenated to get Z ′ Z’Z′。
-
q = self.dropout(self.fc(q))
First pass through a fc layer, the size from[sz_b,len_q,n_head*d_v]
->[sz_b,len_q,d_model]
.
This step is equivalent to taking the obtained Z ′ Z'Z' WAWOW^OWMultiply O to get ZZZ
then goes through a dropout. -
q += residual
Represents the residual connection -
q = self.layer_norm(q)
Normalization of the presentation layer
PositionwiseFeedForward
PositionwiseFeedForward
A Feed Forwad and Add &Norm module is defined. (The red box in the picture below)
class PositionwiseFeedForward(nn.Module):
''' A two-feed-forward-layer module '''
def __init__(self, d_in, d_hid, dropout=0.1):
super().__init__()
self.w_1 = nn.Linear(d_in, d_hid) # position-wise
self.w_2 = nn.Linear(d_hid, d_in) # position-wise
self.layer_norm = nn.LayerNorm(d_in, eps=1e-6)
self.dropout = nn.Dropout(dropout)
def forward(self, x): # x: [sz_b,len_q,d_model]
residual = x # 残差
x = self.w_2(F.relu(self.w_1(x)))
# w_1: [sz_b,len_q,d_hid] w_2: [sz_b,len_q,d_model]
x = self.dropout(x)
x += residual # 残差连接
x = self.layer_norm(x) # 层归一化
return x # [sz_b,len_q,d_model]
The meaning of relevant parameters:
- The size of the forward input
x
is[sz_b,len_q,d_model]
sz_b
: batch sizelen_q
: the length of the wordd_model
: Dimensions of word embeddings
h_in
: The input feature dimension of the fully connected layerh_hid
: The output feature dimension of the fully connected layer
Related code interpretation:
x = self.w_2(F.relu(self.w_1(x)))
It is to realize the feed forward layer, and the calculation formula of feed forward is as follows.
FFN ( x ) = max ( 0 , x W 1 + b 1 ) W 2 + b 2 \operatorname{FFN}(\mathrm{x})=\max \left(0, \mathrm{xW}_1+\ mathrm{b}_1\right) \mathrm{W}_2+\mathrm{b}_2FFN(x)=max(0,xW1+b1)W2+b2
feed forward A two-layer neural network,x
first throughw_1
linear transformation, the size becomes[sz_b,len_q,d_hid]
; then the ReLU nonlinear activation function; then throughw_2
linear transformation, the size becomes[sz_b,len_q,d_model]
.q += residual
Represents the residual connectionq = self.layer_norm(q)
Normalization of the presentation layer
Layers.py
EncoderLayer
EncoderLayer
An Encoder Block module is defined. (The red box in the picture below)
# Encoder Block
class EncoderLayer(nn.Module):
''' Compose with two layers '''
def __init__(self, d_model, d_inner, n_head, d_k, d_v, dropout=0.1):
super(EncoderLayer, self).__init__()
self.slf_attn = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout)
self.pos_ffn = PositionwiseFeedForward(d_model, d_inner, dropout=dropout)
def forward(self, enc_input, slf_attn_mask=None): # 输入的k,q,v都是 enc_input
enc_output, enc_slf_attn = self.slf_attn(
enc_input, enc_input, enc_input, mask=slf_attn_mask) # 多头注意力机制
enc_output = self.pos_ffn(enc_output) # 前馈层
return enc_output, enc_slf_attn
# enc_output: [sz_b,len_q,d_model]
# enc_slf_attn: [sz_b, n_head, len_q, len_k]
The meaning of relevant parameters:
enc_input
The encoded input (that is, the result of adding word embedding and position encoding to the word) has a size of[sz_b,len_q,d_model]
d_model
: Dimensions of word embeddingslen_q
: the number of wordssz_b
: batch_size
d_x
: The dimensionality of the feature. (d_x stands for d_q, d_v, d_k)slf_attn_mask
: mask mask
Related code interpretation:
enc_output, enc_slf_attn = self.slf_attn(enc_input, enc_input, enc_input, mask=slf_attn_mask)
MultiHeadAttention
First, the size of the output obtained through the definition in the SubLayers.py file remains unchanged, still as[sz_b,len_q,d_model]
enc_output = self.pos_ffn(enc_output)
, and then send the output of MultiHeadAttention to the defined in the SubLayers.py filePositionwiseFeedForward
, and the size of the output remains unchanged, which is still[sz_b,len_q,d_model]
DecoderLayer
DecoderLayer
A Decoder Block module is defined. (The red box in the picture below)
# Decoder Block
class DecoderLayer(nn.Module):
''' Compose with three layers '''
def __init__(self, d_model, d_inner, n_head, d_k, d_v, dropout=0.1):
super(DecoderLayer, self).__init__()
self.slf_attn = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout)
self.enc_attn = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout)
self.pos_ffn = PositionwiseFeedForward(d_model, d_inner, dropout=dropout)
def forward(
self, dec_input, enc_output,
slf_attn_mask=None, dec_enc_attn_mask=None):
# dec_input: [sz_d,len_q,d_model]
dec_output, dec_slf_attn = self.slf_attn(
dec_input, dec_input, dec_input, mask=slf_attn_mask) # 第一个多头注意力: Self attention
# 输入的q,k,v 均是dec_input
# dec_output: [sz_b,len_q,d_model]
# dec_slf_attn: [sz_b, n_head, len_q, len_k]
dec_output, dec_enc_attn = self.enc_attn(
dec_output, enc_output, enc_output, mask=dec_enc_attn_mask) # 第二个多头注意力: Cross Attention
# 输入的q是上一个Decoder中多头注意力的输出, k,v是Encoder的输出
dec_output = self.pos_ffn(dec_output) # 前馈网络
return dec_output, dec_slf_attn, dec_enc_attn
# dec_output: 最终编码器的输出 [sz_b,len_q,d_model]
# dec_slf_attn: 第一个多头注意力的attention score [sz_b, n_head, len_q, len_k]
# dec_enc_attn: 第二个多头注意力的attention score [sz_b, n_head, len_q, len_k]
The meaning of relevant parameters:
dec_input
: The input to the decoder, of size[sz_b,len_q,d_model]
enc_output
: Encoder input,[sz_b,len_q,d_model]
slf_attn_mask
: mask of self-attentiondec_enc_attn_mask
: mask of cross-attention
Related code interpretation:
dec_output, dec_slf_attn = self.slf_attn(dec_input, dec_input, dec_input, mask=slf_attn_mask)
What is implemented here is the firstMasked Multi-Head Self-Attentiion
and Add&Norm layer in the decoder (as shown in the blue box).
The inputs of q, k, and v are all dec_input (as shown in the red circle in the figure below). The resulting outputdec_output
is of size[sz_b,len_q,d_model]
dec_output, dec_enc_attn = self.enc_attn(dec_output, enc_output, enc_output, mask=dec_enc_attn_mask)
What is implemented is the secondMulti-Head Cross-Attention
and Add&Norm layer in the decoder. (As shown in the blue box in the figure below)
The input of q comes from decoder_output (the output of a self-Attention on the decoder, the green circle in the figure below); the input of k and v comes from enc_output (the output of the encoder, in the figure below in the red circle). Because the sources of q and k, v are different, this multi-head attention is also called Cross-Attention. And when the sources of q, k, and v are the same, the multi-head attention is called Self-Attention.
The resulting outputdec_output
is still of size[sz_b,len_q,d_model]
dec_output = self.pos_ffn(dec_output)
PositionwiseFeedForward
The final output of the decoder is obtained through the definition in the Sublayers.py file , and the size is still[sz_b,len_q,d_model]
Models.py
get_pad_mask
get_pad_mask is implemented padding mask
, because the length of each batch of input sequences is different, that is, we need to align the input sequences.
# padding mask
def get_pad_mask(seq, pad_idx): # seq: [sz_b,len_q] pad_idx[sz_b,len_q]
return (seq != pad_idx).unsqueeze(-2) # [sz_b,1,len_q]
The meaning of relevant parameters:
seq
: input word sequence,pad_idx
: When the word index is pad_idx, the word embedding is padded with 0. For example pad_idx=3
Related code interpretation:
(seq != pad_idx).unsqueeze(-2)
Used to generate padding mask
Suppose now there is a dictionary containing three words {0: 'a', 1: 'b', 2: 'c'}, and pad_idx=3
For a batch, the input sentence sequence is The index corresponding to "abc" is [0,1,2], assuming that the length of the sentence is required to be 5, the sequence is filled as the output of [0,1,2,3,3]
seq!=pad_idx
[True,True,True,False, False]. The position of False is the place where the mask is dropped.
Going back to ScaledDotProductAttention, the code where the mask is called:attn = attn.masked_fill(mask == 0, -1e9)
the size of attn is[sz_b, n_head, len_q, len_k]
therefore for multiple batches,unsqueeze(-2)
in order to generate the head dimension, the output size is[sz_b,1,len_q]
. Among them, the dimension of len_k can be broadcasted.
When the value of the mask is False, it will be filled with negative infinitesimal, and it will approach 0 after softmax, and the value of attn in this area will be masked.
get_subsequent_mask
get_subsequent_mask
Used to generate sequence mask.
The sequence mask is to prevent the decoder from seeing future information. For a sequence, when time_step is t, our decoding output should only depend on the output before t, but not on the output after t. Therefore, we need to think of a way to hide the information after t. This is effective during training, because every time we input the target data into the decoder completely during training, it is not needed during prediction. During prediction, we can only get the output predicted at the previous moment.
# sequence mask
def get_subsequent_mask(seq):
sz_b, len_s = seq.size()
# sz_b: batch size l
# len_s: 句子中单词的个数
subsequent_mask = (1 - torch.triu(
torch.ones((1, len_s, len_s), device=seq.device), diagonal=1)).bool()
''''
x= torch.ones (1, len_s, len_s) : 生成大小为[1,len_s,len_s] 全为1的矩阵
y= torch.triu(x,diagonal=1)后,y 的形状类似于:
011
001
000
1-y后:
100
110
111
然后再转换成bool值
'''
return subsequent_mask
The meaning of relevant parameters:
sz_b
: batch_sizelen_s
: the number of words in the input sequence
Related code interpretation:
x=torch.ones((1, len_s, len_s)
Used to generate a matrix of size [1, len_s, len_s] all 1.
Assuming len_s is 3, then the generated x matrix is
1 1 1
1 1
1 1 1 1y= torch.triu(x,diagonal=1)
After that, the shape of y becomes an upper triangular matrix
0 1 1
0 0 1
0 0 01-y
Finally, it becomes a lower triangular matrix. This matrix is the sequence mask
1 0 0
1 1 0
1 1 1
combined withScaledDotProductAttention
the explanation of the mask, where the value of 0 in the mask, the content of attn is assigned to negative infinitesimal, softmax Afterwards it tends to 0. Therefore, the content of 0 is equivalent to the mask off.
For example:
the yellow rectangle in the above figure is equivalent to filling 0, and the green rectangle is equivalent to filling 1.
When Decoder's input matrix and Mask matrix input matrix contains five word representation vectors of "I have a cat" (0, 1, 2, 3, 4), Mask is a 5×5 matrix. In Mask, it can be found that word 0 can only use the information of word 0, while word 1 can use the information of words 0 and 1, that is, only the previous information can be used.
PositionalEncoding
PositionalEncoding
It is to add positional encoding to the word embedding input by the encoder and decoder. (As shown in the red box in the figure below)
Transformer uses 正余弦位置编码
. The position code is generated by using sine and cosine functions of different frequencies, and then added to the word vector of the corresponding position. The dimension of the position vector must be consistent with the dimension of the word vector.
# 位置编码
class PositionalEncoding(nn.Module):
def __init__(self, d_hid, n_position=200):
super(PositionalEncoding, self).__init__()
# Not a parameter
self.register_buffer('pos_table', self._get_sinusoid_encoding_table(n_position, d_hid))
def _get_sinusoid_encoding_table(self, n_position, d_hid):
''' Sinusoid position encoding table '''
# TODO: make it with torch instead of numpy
def get_position_angle_vec(position):
return [position / np.power(10000, 2 * (hid_j // 2) / d_hid) for hid_j in range(d_hid)]
sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(n_position)])
sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2j
sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim 2j+1
return torch.FloatTensor(sinusoid_table).unsqueeze(0)
def forward(self, x):
# x: 单词embedding [sz_b,len_q,d_model]
# pos_table: 位置encoding
return x + self.pos_table[:, :x.size(1)].clone().detach()
The meaning of relevant parameters:
x
: input word embedding (input embedding), size is[sz_b,len_q,d_model]
sz_b
: batch_sizelen_q
: the number of wordsd_model
: Dimensions of word embeddings
pos_table
: Generated position code
Related code interpretation:
_get_sinusoid_encoding_table
The function is used to generate the position coded table
assuming that the input represents X ∈ R n × d \mathbf{X} \in \mathbb{R}^{n \times d}X∈Rn × d contains a d-dimensional embedding representation of n tokens in a sequence. Position encoding uses a position embedding matrix of the same shapeP ∈ R n × d \mathbf{P} \in \mathbb{R}^{n \times d}P∈Rn × d outputsX + P \mathbf{X}+\mathbf{P}X+P ,iiLine i , 2nd j 2j2j columns and2j+1 2j+ 12 j+The elements on column 1 are:
p i , 2 j = sin ( i 1000 0 2 j / d ) p i , 2 j + 1 = cos ( i 1000 0 2 j / d ) . \begin{aligned} p_{i, 2 j} & =\sin \left(\frac{i}{10000^{2 j / d}}\right) \\ p_{i, 2 j+1} & =\cos \left(\frac{i}{10000^{2 j / d}}\right) . \end{aligned} pi , 2 jpi , 2 j + 1=sin(100002 j / di)=cos(100002 j / di).
part iii represents the absolute position of the word in the sentence,i = 0, 1, 2... i=0, 1, 2...i=0,1,2... Example:i = 2Jerry
in"Tom chase Jerry"
ii=2; d m o d e l d_{model} dmodelRepresents the dimension of the word vector, where dmodel = 512 d_{model}=512dmodel=512;2j 2j2 j和2 j + 1 2j+12 j+1 means parity,jjj represents the dimension in the word vector, for example heredmodel = 512 d_{model}=512dmodel=512 , soj = 0, 1, 2...255 j=0, 1, 2...255j=0,1,2…255。
Encoder
The Encoder implements the part of the red box in the figure below.
# 编码器
class Encoder(nn.Module):
''' A encoder model with self attention mechanism. '''
def __init__(
self, n_src_vocab, d_word_vec, n_layers, n_head, d_k, d_v,
d_model, d_inner, pad_idx, dropout=0.1, n_position=200, scale_emb=False):
super().__init__()
self.src_word_emb = nn.Embedding(n_src_vocab, d_word_vec, padding_idx=pad_idx) # 词嵌入
self.position_enc = PositionalEncoding(d_word_vec, n_position=n_position) # 位置编码
self.dropout = nn.Dropout(p=dropout)
self.layer_stack = nn.ModuleList([
EncoderLayer(d_model, d_inner, n_head, d_k, d_v, dropout=dropout)
for _ in range(n_layers)]) # n_layers个encoder block
self.layer_norm = nn.LayerNorm(d_model, eps=1e-6)
self.scale_emb = scale_emb
self.d_model = d_model
def forward(self, src_seq, src_mask, return_attns=False):
enc_slf_attn_list = []
# -- Forward
enc_output = self.src_word_emb(src_seq) # 词嵌入[sz_b,len_q]-> [sz_b,len_q,d_model]
if self.scale_emb:
enc_output *= self.d_model ** 0.5 # 归一化
enc_output = self.dropout(self.position_enc(enc_output)) # 位置编码
enc_output = self.layer_norm(enc_output) # 层归一化
for enc_layer in self.layer_stack: # N 个Encoder Block
enc_output, enc_slf_attn = enc_layer(enc_output, slf_attn_mask=src_mask)
enc_slf_attn_list += [enc_slf_attn] if return_attns else []
# enc_output: [sz_b,len_q,d_model]
# enc_slf_attn: [sz_b, n_head, len_q, len_k]
# enc_slf_attn_list 是n个encoder block产生的enc_slf_attn 构成的列表
if return_attns:
return enc_output, enc_slf_attn_list
return enc_output,
The meaning of relevant parameters:
src_seq
The original sequence of words input by the encoderscale_emb
Controls whether to scale word embeddingslayer_stack
Is a ModelList composed of n_layers encoder blocksn_src_vocab
: The total number of words in the word list defined by the nn.Embedding layerd_word_vec
: The feature dimension of the word embedding output by the nn.Embedding layer, equivalent to d_modelpadding_idx
: When the word index in the word table is padding_idx, the output word embedding is filled with 0.
Related code interpretation:
enc_output = self.src_word_emb(src_seq)
Input Embedding obtained through word embedding, the size is[sz_b,len_q,d_model]
enc_output *= self.d_model ** 0.5
If normalization is required, multiply the word embeddings by dmodel \sqrt{d_{model}}dmodelenc_output = self.dropout(self.position_enc(enc_output))
First encode the position, then add it to the word embedding, and then pass a dropoutenc_output = self.layer_norm(enc_output)
Normalize by a layerenc_output, enc_slf_attn = enc_layer(enc_output, slf_attn_mask=src_mask)
Then traverse the ModelList of layer_stack, input the output of the previous EncoderBlock to the next EncoderBlock each time, and pass n_layers Encoder Blocks in series. The size of the final output enc_output is[sz_b,len_q,d_model]
Decoder
Decoder implements the part of the red box in the figure below.
# 解码器
class Decoder(nn.Module):
''' A decoder model with self attention mechanism. '''
def __init__(
self, n_trg_vocab, d_word_vec, n_layers, n_head, d_k, d_v,
d_model, d_inner, pad_idx, n_position=200, dropout=0.1, scale_emb=False):
super().__init__()
self.trg_word_emb = nn.Embedding(n_trg_vocab, d_word_vec, padding_idx=pad_idx) # 单词嵌入
self.position_enc = PositionalEncoding(d_word_vec, n_position=n_position) # 位置编码
self.dropout = nn.Dropout(p=dropout)
self.layer_stack = nn.ModuleList([
DecoderLayer(d_model, d_inner, n_head, d_k, d_v, dropout=dropout)
for _ in range(n_layers)])
self.layer_norm = nn.LayerNorm(d_model, eps=1e-6)
self.scale_emb = scale_emb
self.d_model = d_model
def forward(self, trg_seq, trg_mask, enc_output, src_mask, return_attns=False):
dec_slf_attn_list, dec_enc_attn_list = [], []
# -- Forward
dec_output = self.trg_word_emb(trg_seq) # 单词嵌入 [sz_b,len_q,d_model]
if self.scale_emb:
dec_output *= self.d_model ** 0.5
dec_output = self.dropout(self.position_enc(dec_output)) # 位置编码
dec_output = self.layer_norm(dec_output) #层归一化
for dec_layer in self.layer_stack: # N个decoder block
dec_output, dec_slf_attn, dec_enc_attn = dec_layer(
dec_output, enc_output, slf_attn_mask=trg_mask, dec_enc_attn_mask=src_mask)
dec_slf_attn_list += [dec_slf_attn] if return_attns else []
dec_enc_attn_list += [dec_enc_attn] if return_attns else []
# dec_output: [sz_b,len_q,d_model]
# dec_slf_attn: self-attention的attn[sz_b, n_head, len_q, len_k]
# dec_enc_attn: cross-attention的attn[sz_b, n_head, len_q, len_k]
# dec_slf_attn_list 是n个decoder block产生的enc_slf_attn 构成的列表
# dec_enc_attn_list 是n个decoder block产生的enc_enc_attn 构成的列表
if return_attns:
return dec_output, dec_slf_attn_list, dec_enc_attn_list
return dec_output,
The meaning of relevant parameters:
tar_seq
The original sequence of words input to the decoderscale_emb
Controls whether to scale word embeddingslayer_stack
Is a ModelList composed of n_layers decoder blocks
Related code interpretation:
dec_output = self.trg_word_emb(trg_seq)
First, word embedding is performed on the input word sequence to obtain Input Embedding, the size of which is[sz_b,len_q,d_model]
dec_output *= self.d_model ** 0.5
If normalization is required, multiply the word embeddings by dmodel \sqrt{d_{model}}dmodeldec_output = self.dropout(self.position_enc(dec_output))
Add word embedding to position encoding and perform dropoutdec_output = self.layer_norm(dec_output)
Normalize by layerdec_output, dec_slf_attn, dec_enc_attn = dec_layer(dec_output, enc_output, slf_attn_mask=trg_mask, dec_enc_attn_mask=src_mask)
Through n_layers series decoder block,dec_output
the size of the final output is[sz_b,len_q,d_model]
Transformer
What Transformer implements is the overall architecture. (The content in the red box in the figure below)
# Transformer
class Transformer(nn.Module):
''' A sequence to sequence model with attention mechanism. '''
def __init__(
self, n_src_vocab, n_trg_vocab, src_pad_idx, trg_pad_idx,
d_word_vec=512, d_model=512, d_inner=2048,
n_layers=6, n_head=8, d_k=64, d_v=64, dropout=0.1, n_position=200,
trg_emb_prj_weight_sharing=True, emb_src_trg_weight_sharing=True,
scale_emb_or_prj='prj'):
super().__init__()
self.src_pad_idx, self.trg_pad_idx = src_pad_idx, trg_pad_idx
# In section 3.4 of paper "Attention Is All You Need", there is such detail:
# "In our model, we share the same weight matrix between the two
# embedding layers and the pre-softmax linear transformation...
# In the embedding layers, we multiply those weights by \sqrt{d_model}".
#
# Options here:
# 'emb': multiply \sqrt{d_model} to embedding output
# 'prj': multiply (\sqrt{d_model} ^ -1) to linear projection output
# 'none': no multiplication
assert scale_emb_or_prj in ['emb', 'prj', 'none']
scale_emb = (scale_emb_or_prj == 'emb') if trg_emb_prj_weight_sharing else False
self.scale_prj = (scale_emb_or_prj == 'prj') if trg_emb_prj_weight_sharing else False
self.d_model = d_model
self.encoder = Encoder(
n_src_vocab=n_src_vocab, n_position=n_position,
d_word_vec=d_word_vec, d_model=d_model, d_inner=d_inner,
n_layers=n_layers, n_head=n_head, d_k=d_k, d_v=d_v,
pad_idx=src_pad_idx, dropout=dropout, scale_emb=scale_emb)
self.decoder = Decoder(
n_trg_vocab=n_trg_vocab, n_position=n_position,
d_word_vec=d_word_vec, d_model=d_model, d_inner=d_inner,
n_layers=n_layers, n_head=n_head, d_k=d_k, d_v=d_v,
pad_idx=trg_pad_idx, dropout=dropout, scale_emb=scale_emb)
self.trg_word_prj = nn.Linear(d_model, n_trg_vocab, bias=False)
for p in self.parameters():
if p.dim() > 1:
nn.init.xavier_uniform_(p)
assert d_model == d_word_vec, \
'To facilitate the residual connections, \
the dimensions of all module outputs shall be the same.'
if trg_emb_prj_weight_sharing:
# Share the weight between target word embedding & last dense layer
self.trg_word_prj.weight = self.decoder.trg_word_emb.weight
if emb_src_trg_weight_sharing:
self.encoder.src_word_emb.weight = self.decoder.trg_word_emb.weight
def forward(self, src_seq, trg_seq):
# src_seq (b_sz,len_q)
src_mask = get_pad_mask(src_seq, self.src_pad_idx) # 对于输入,padding mask
trg_mask = get_pad_mask(trg_seq, self.trg_pad_idx) & get_subsequent_mask(trg_seq) # 对于输出:padding mask+ sequence mask
enc_output, *_ = self.encoder(src_seq, src_mask) # Encoder
dec_output, *_ = self.decoder(trg_seq, trg_mask, enc_output, src_mask) # Decoder
# enc_output: (b_sz,len_q,d_model)
# dec_output: (b_sz,len_q,d_model)
seq_logit = self.trg_word_prj(dec_output)
#seq_logit: (b_sz,len_q,n_trg_vocab)
if self.scale_prj:
seq_logit *= self.d_model ** -0.5
return seq_logit.view(-1, seq_logit.size(2)) # (b_sz*len_q,n_trg_vocab)
The meaning of relevant parameters:
src_seq
The original sequence of words input by the encodertrg_seq
The original sequence of words input to the decodern_trg_vocab
The length of the target vocabulary
Related code interpretation:
src_mask = get_pad_mask(src_seq, self.src_pad_idx)
For the input of the encoder, a padding mask is required to unify the length of the word sequencetrg_mask = get_pad_mask(trg_seq, self.trg_pad_idx) & get_subsequent_mask(trg_seq)
For the input of the decoder, not only the padding mask is required to unify the length of the word sequence, but also the sequence mask is required, so that when predicting, we can only get the output predicted at the previous moment, but cannot see the following words.enc_output, *_ = self.encoder(src_seq, src_mask)
First go through the encoderdec_output, *_ = self.decoder(trg_seq, trg_mask, enc_output, src_mask)
and then through the decoderseq_logit = self.trg_word_prj(dec_output)
Through a linear layer, the dimension of the word embedding is mapped to the dimension of the vocabulary, and the size(b_sz,len_q,d_model)
changes from(b_sz,len_q,n_trg_vocab)
as shown in the red box below
seq_logit *= self.d_model ** -0.5
If scale_prj is true, multiply the output seq_logic by dmodel \sqrt{d_{model}}dmodelseq_logit.view(-1, seq_logit.size(2))
Merge the first two dimensions of seq_logit together, and the size becomes(b_sz*len_q,n_trg_vocab)