[Code Notes] Detailed Interpretation of Transformer Code

Detailed interpretation of Transformer code

Introduction

Transformer is an extremely important classic model in the current natural language processing and even the entire deep learning field. It is the basic architecture of the current large-scale pre-trained language models such as BERT, GPT, and BART. It is helpful to fine-tune and improve the model based on this model. The key points of interpreting the code are:

  1. Find out the model framework of Transformer and the details of each module . You can refer to another blog [Study Notes] Transformer Model Interpretation
  2. From the whole to the part, figure out the flow shape of data input and output (shape) .

Other:

  • The source code is referenced from link 1 , which can be obtained through this link;
  • The interpretation refers to the main link 2 of station B up , and some small errors in the comments interpreted by the blogger are corrected, and some code details that are difficult for novices to understand are explained in detail;
  • The code of this blog can also refer to link 3 .
  • Implementation framework: Pytorch.

1. Data preparation

This article takes a simple German-to-English machine translation task Demo as an example.

1.1 Vocabulary construction

Word embedding itself is a look up process, so it is necessary to build a vocabulary: token and its index. In actual tasks now, APIs such as the Tokenizer of the Huggingface Transformers library are generally used to obtain them directly.

src_vocab = {
    
    'P': 0, 'ich': 1, 'mochte': 2, 'ein': 3, 'bier': 4}
src_vocab_size = len(src_vocab)

tgt_vocab = {
    
    'P': 0, 'i': 1, 'want': 2, 'a': 3, 'beer': 4, 'S': 5, 'E': 6}
tgt_vocab_size = len(tgt_vocab)

1.2 Data construction

In the actual task, you should read from the data set and then build the DataLoader. For the sake of easy interpretation, only one toy process is implemented in this article.

'S' (Start) means the start character, 'E' (End) means the end character, 'P' (Pad) means the filling character.

The input text is of string type, which needs to be converted into the index of characters in the vocabulary, and then converted into Tensor type.

sentences = ['ich mochte ein bier P', 'S i want a beer', 'i want a beer E']

def make_batch(sentences):
    # 把文本转成词表索引
    input_batch = [[src_vocab[n] for n in sentences[0].split()]]
    output_batch = [[tgt_vocab[n] for n in sentences[1].split()]]
    target_batch = [[tgt_vocab[n] for n in sentences[2].split()]]
    # 把索引转成tensor类型
    return torch.LongTensor(input_batch), torch.LongTensor(output_batch), torch.LongTensor(target_batch)

2. The overall structure of the model

2.1 Hyperparameter Settings

Some important model hyperparameter settings include:

  • Input and output sentence lengths;
  • Model word embedding size;
  • Feedforward neural network (FeedForward) layer hidden layer dimension;
  • K (in self-attention, also the size of Q), V vector size;
  • the number of layers in the encoder and decoder;
  • Number of heads for long self-attention.
src_len = 5 # length of source
tgt_len = 5 # length of target

## 模型参数
d_model = 512  # Embedding Size
d_ff = 2048  # FeedForward dimension
d_k = d_v = 64  # dimension of K(=Q), V
n_layers = 6  # number of Encoder of Decoder Layer
n_heads = 8  # number of heads in Multi-Head Attention

2.2 Overall Architecture

The entire network structure of Transformer consists of three parts: encoding layer, decoding layer, and output layer.

image-20220908191631120

  • process

    • Input text for word embedding and position encoding as the final text embedding;
    • The text embedding is encoded by Encoder, and the output encoding vector and self-attention weight matrix are obtained after attention weighting;
    • Then the encoded vector and the ground trurh of the sample are input into the decoder together, and the final context vector is output after operations such as attention weighting, and then mapped to the linear layer of the vocabulary size for decoding to generate text;
    • Finally returns the logits matrix representing the predicted results.
  • data shape

    enc_inputs:[batch_size,src_len]

    dec_inputs:[batch_size,tgt_len]

    enc_outputs:[batch_size,src_len,d_model]

    enc_self_attns:[batch_size,n_heads,src_len,src_len]

    dec_outputs:[batch_size,tgt_len,d_model]

    dec_self_attns:[batch_size,n_heads,tgt_len,tgt_len]

    dec_enc_attns:[batch_size,n_heads,tgt_len,src_len]

    dec_logits:[batch_size,tgt_len,tgt_vocab_size]

class Transformer(nn.Module):
    def __init__(self):
        super(Transformer, self).__init__()
        # 编码器
        self.encoder = Encoder()
        # 解码器
        self.decoder = Decoder()
        # 输出层,d_model是解码层每个token输出的维度大小,之后会做一个tgt_vocab_size大小的softmax
        # 因为解码输出过程相当于是在一个词表大小级别上的分类
        self.projection = nn.Linear(d_model, tgt_vocab_size, bias=False)
    def forward(self, enc_inputs, dec_inputs):
        # 输入输出部分具体要输入和返回什么参数,可以根据自己的任务和改进需要进行自定义修改,内部的执行过程是不变的
        # enc_outputs就是编码器的输出,enc_self_attns是QK转置相乘之后softmax之后的注意力矩阵,代表的是每个单词和其他单词相关性;
        # 由于多头注意力机制会分头计算注意力,所以注意力权重矩阵是个四维向量,
        # 即[batch_size,n_heads,src_len,src_len]
        enc_outputs, enc_self_attns = self.encoder(enc_inputs)
        # dec_self_attns类比于enc_self_attns,是查看每个单词对decoder中输入的其余单词的相关性;
        # dec_enc_attns是decoder中每个单词对encoder中每个单词的相关性,
        # 即Cross_Attention输出的注意力权重矩阵,形状为[batch_size,n_heads,tgt_len,src_len]
        # 注意,这里的xxx_attns都是list类型,因为在Encoder、Decoder中将每一层的注意力权重都保存下来添加到列表中返回了
        dec_outputs, dec_self_attns, dec_enc_attns = self.decoder(dec_inputs, enc_inputs, enc_outputs)
        # dec_outputs映射到词表大小
        dec_logits = self.projection(dec_outputs) # dec_logits : [batch_size,tgt_len, tgt_vocab_size]
        # 这里dec_logits进行view操作主要是为了适应后面的CrossEntropyLoss API的参数要求
        return dec_logits.view(-1, dec_logits.size(-1)), enc_self_attns, dec_self_attns, dec_enc_attns

2.2 Model training

model = Transformer()

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

enc_inputs, dec_inputs, target_batch = make_batch(sentences)

for epoch in range(10):
    optimizer.zero_grad()
    outputs, enc_self_attns, dec_self_attns, dec_enc_attns = model(enc_inputs, dec_inputs)
    # output:[batch_size x tgt_len,tgt_vocab_size]
    # 就这份代码而言,这里其实可以不写.contiguous(),因为target_batch这个tensor是连续的
    loss = criterion(outputs, target_batch.contiguous().view(-1))
    print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
    loss.backward()
    optimizer.step()

3. Encoder

3.1 Encoder

The encoder is formed by stacking N encoding layers.

  • process

    • The index tensor of the input text, the word embedding is obtained through the word embedding layer, and then linearly added to the position code as the final output of the input layer;

    • Subsequently, the output of each layer is the input of the coding block of the next layer, and operations such as attention calculation, feedforward neural network, residual connection, and layer normalization are performed in each coding block;

    • Finally returns the output of the last layer of the encoder and the attention weight matrix for each layer.

class Encoder(nn.Module):
    def __init__(self):
        super(Encoder, self).__init__()
        # 这个其实就是去定义生成一个词嵌入矩阵,大小是 src_vocab_size * d_model
        self.src_emb = nn.Embedding(src_vocab_size, d_model)  
        # 位置编码,这里是固定的正余弦函数,也可以使用类似词向量的nn.Embedding获得一个可以更新学习的位置编码
        self.pos_emb = PositionalEncoding(d_model) 
        # 使用ModuleList对多个encoder进行堆叠,因为后续的encoder并没有使用词向量和位置编码,所以抽离出来;
        self.layers = nn.ModuleList([EncoderLayer() for _ in range(n_layers)]) 

    def forward(self, enc_inputs):
        # enc_inputs形状是:[batch_size,src_len]
        # 下面这个代码通过src_emb,进行索引定位,enc_outputs输出形状是[batch_size, src_len, d_model]
        enc_outputs = self.src_emb(enc_inputs)

        # 位置编码和词嵌入相加,具体实现在PositionalEncoding里,enc_outputs:[batch_size,src_len,d_model]
        enc_outputs = self.pos_emb(enc_outputs.transpose(0, 1)).transpose(0, 1)
        # get_attn_pad_mask是为了得到句子中pad的位置信息,以便在计算注意力时忽略pad符号
        enc_self_attn_mask = get_attn_pad_mask(enc_inputs, enc_inputs)
        enc_self_attns = []
        for layer in self.layers:
            # 每一层的输出作为下一层的输入,enc_outputs:[batch_size,src_len,d_model]
            enc_outputs, enc_self_attn = layer(enc_outputs, enc_self_attn_mask)
            # 把每一层得到的注意力权重矩阵添加到列表里最后返回,enc_self_attn:[batch_size,src_len,src_len]
            enc_self_attns.append(enc_self_attn)
        return enc_outputs, enc_self_attns

3.2 Single encoding layer

class EncoderLayer(nn.Module):
    def __init__(self):
        super(EncoderLayer, self).__init__()
        self.enc_self_attn = MultiHeadAttention()
        self.pos_ffn = PoswiseFeedForwardNet()

    def forward(self, enc_inputs, enc_self_attn_mask):
        #enc_inputs形状是[batch_size x seq_len_q x d_model],注意,最初始的QKV矩阵是等同于这个输入的
        enc_outputs, attn = self.enc_self_attn(enc_inputs, enc_inputs, enc_inputs, enc_self_attn_mask) # enc_inputs to same Q,K,V
        enc_outputs = self.pos_ffn(enc_outputs) # enc_outputs: [batch_size x len_q x d_model]
        return enc_outputs, attn

3.3 Padding Mask

In the later part of the attention mechanism, after calculating Q ∗ KTQ*K^TQKAfter T is divided by the root sign, the size of the matrix obtained before softmax is [len_input * len_input], which represents the influence of each word on all (including itself) words. This function is used to obtain amatrix of the same size and shape,mark which position is the PAD symbol, and then set these places toinfinitesimal,so as to prevent Query from paying attention to these meaningless PAD symbols.

Note that the matrix shape obtained by this function is [batch_size x len_q x len_k], which identifies the pad symbols in K and does not identify the pad symbols in Q because it is unnecessary.

seq_q and seq_k are not necessarily consistent. For example, in interactive attention, q comes from the decoding end, and k comes from the encoding end, so it is enough to tell the model to encode the pad symbol information here, and the pad information on the decoding end is not used in the interactive attention layer. .

image-20221128204838986

def get_attn_pad_mask(seq_q, seq_k):
    batch_size, len_q = seq_q.size()
    batch_size, len_k = seq_k.size()
    # eq(zero) is PAD token
    pad_attn_mask = seq_k.data.eq(0).unsqueeze(1)  # batch_size x 1 x len_k, one is masking
    # 最终得到的应该是一个最后n列为1的矩阵,即K的最后n个token为PAD。
    return pad_attn_mask.expand(batch_size, len_q, len_k)  # batch_size x len_q x len_k

4. Decoder

4.1 Decoder

The decoder is stacked by N decoding layers.

The decoder is similar to the encoder, the difference is that each layer of decoder has two multi-head attention mechanisms. In the first multi-head attention mechanism, future words need to be masked ; the second multi-head attention mechanism is a cross-attention mechanism , which needs to use the output of the Encoder as K and V, and the output of the previous part of the Decoder as Q. Its significance is that the context information obtained by encoding can be used during decoding.

class DecoderLayer(nn.Module):
    def __init__(self):
        super(DecoderLayer, self).__init__()
        self.dec_self_attn = MultiHeadAttention()
        self.dec_enc_attn = MultiHeadAttention()
        self.pos_ffn = PoswiseFeedForwardNet()

    def forward(self, dec_inputs, enc_outputs, dec_self_attn_mask, dec_enc_attn_mask):
        dec_outputs, dec_self_attn = self.dec_self_attn(dec_inputs, dec_inputs, dec_inputs, dec_self_attn_mask)
        # 这里用dec_outputs作为Q,enc_outputs作为K和V,实现交叉注意力机制
        dec_outputs, dec_enc_attn = self.dec_enc_attn(dec_outputs, enc_outputs, enc_outputs, dec_enc_attn_mask)
        dec_outputs = self.pos_ffn(dec_outputs)
        return dec_outputs, dec_self_attn, dec_enc_attn

4.2 Single decoding layer

class Decoder(nn.Module):
    def __init__(self):
        super(Decoder, self).__init__()
        self.tgt_emb = nn.Embedding(tgt_vocab_size, d_model)
        self.pos_emb = PositionalEncoding(d_model)
        self.layers = nn.ModuleList([DecoderLayer() for _ in range(n_layers)])

    # dec_inputs : [batch_size x target_len]
    def forward(self, dec_inputs, enc_inputs, enc_outputs):
        dec_outputs = self.tgt_emb(dec_inputs)  # [batch_size, tgt_len, d_model]
        dec_outputs = self.pos_emb(dec_outputs.transpose(0, 1)).transpose(0, 1) # [batch_size, tgt_len, d_model]
        # 获取自注意力层pad的mask矩阵,1表示被mask
        dec_self_attn_pad_mask = get_attn_pad_mask(dec_inputs, dec_inputs)

        # 获取look ahead的mask矩阵,即让注意力机制看不到未来的单词,获得到一个上三角为1的矩阵,1表示被mask
        dec_self_attn_subsequent_mask = get_attn_subsequent_mask(dec_inputs)

        # 两个mask矩阵相加,大于0的为1,不大于0的为0,既屏蔽了pad的信息,也屏蔽了未来时刻的信息,为1的在之后就会被fill到无限小
        # 使用gt()函数,因为可能会有在两个mask都被屏蔽的情况,1+1=2
        dec_self_attn_mask = torch.gt((dec_self_attn_pad_mask + dec_self_attn_subsequent_mask), 0)

        # 获取交互注意力机制中的mask矩阵,decoder的输入是q,encoder的输入是k,需要知道k里面哪些是pad符号,
        # 注意,q肯定也是有pad符号,但是没有必要将其屏蔽
        # 这里不用再把q的未来词再mask了,因为前面已经mask过一次,输出向量对应部分的值应该都是0了
        dec_enc_attn_mask = get_attn_pad_mask(dec_inputs, enc_inputs)

        dec_self_attns, dec_enc_attns = [], []
        for layer in self.layers:
            dec_outputs, dec_self_attn, dec_enc_attn = layer(dec_outputs, enc_outputs, dec_self_attn_mask, dec_enc_attn_mask)
            dec_self_attns.append(dec_self_attn)
            dec_enc_attns.append(dec_enc_attn)
        return dec_outputs, dec_self_attns, dec_enc_attns

4.3 Sequence Mask

Mask future words so that current words cannot see future words. This function is used to indicate which of the Decoder's inputs are future words. Obviously, the Mask matrix should be an upper triangular matrix.

img

def get_attn_subsequent_mask(seq):
    """
    seq: [batch_size, tgt_len]
    """
    attn_shape = [seq.size(0), seq.size(1), seq.size(1)]
    # attn_shape: [batch_size, tgt_len, tgt_len]
    # np.triu()返回一个上三角矩阵,自对角线k以下元素全部置为0,k=0为主对角线
    subsequence_mask = np.triu(np.ones(attn_shape), k=1)  # 生成一个上三角矩阵
    # 如果没转成byte,这里默认是Double(float64),占据的内存空间大,浪费,用byte就够了
    subsequence_mask = torch.from_numpy(subsequence_mask).byte()
    return subsequence_mask  # [batch_size, tgt_len, tgt_len]

5. Position coding

image-20221128211633848

image-20220908191753204

e − 2 i d m o d e l ∗ l o g 10000 = 1 1000 0 2 i d m o d e l e^{\frac{-2i}{d_{model}}*log10000}=\frac{1}{10000^{\frac{2i}{d_{model}}}} edmodel−2i _ _log10000=10000dmodel2 i1
The implementation of position coding can be written directly according to the formula. The following code is just one of the implementation methods;
it should be noted that even numbers and odd numbers have a common part in the formula. Here, the exponential function e and the log function (with e as the base) are used. Take the power down for easy calculation;
pos represents the absolute index position of the word in the sentence, for example, max_len is 128, then the index is from 0,1,2,...,127, assuming d_model is 512, that is, use a 512 dimension tensor to encode an index position, then 0<=2i<512, then 0<=i<=255, then the corresponding value of 2i is 0, 2, 4...510, which is an even position; the value of 2i+1 is 1,3,5…511, that is, odd positions.

The final text embedding representation is obtained by adding word embedding and position encoding.

image-20221128205300431

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()

        self.dropout = nn.Dropout(p=dropout)
		# 生成一个形状为[max_len,d_model]的全为0的tensor
        pe = torch.zeros(max_len, d_model)
        # position:[max_len,1],即[5000,1],这里插入一个维度是为了后面能够进行广播机制然后和div_term直接相乘
        # 注意,要理解一下这里position的维度。每个pos都需要512个编码。
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        # 共有项,利用指数函数e和对数函数log取下来,方便计算
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        # 这里position * div_term有广播机制,因为div_term的形状为[d_model/2],即[256],符合广播条件,广播后两个tensor经过复制,形状都会变成[5000,256],*表示两个tensor对应位置处的两个元素相乘
        # 这里需要注意的是pe[:, 0::2]这个用法,就是从0开始到最后面,补长为2,其实代表的就是偶数位置赋值给pe
        pe[:, 0::2] = torch.sin(position * div_term)
        # 同理,这里是奇数位置
        pe[:, 1::2] = torch.cos(position * div_term)
        # 上面代码获取之后得到的pe:[max_len*d_model]

        # 下面这个代码之后,我们得到的pe形状是:[max_len*1*d_model]
        pe = pe.unsqueeze(0).transpose(0, 1)
		# 定一个缓冲区,其实简单理解为这个参数不更新就可以,但是参数仍然作为模型的参数保存
        self.register_buffer('pe', pe)  

    def forward(self, x):
        """
        x: [seq_len, batch_size, d_model]
        """
        # 这里的self.pe是从缓冲区里拿的
        # 切片操作,把pe第一维的前seq_len个tensor和x相加,其他维度不变
        # 这里其实也有广播机制,pe:[max_len,1,d_model],第二维大小为1,会自动扩张到batch_size大小。
        # 实现词嵌入和位置编码的线性相加
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

6. Multi-Head Attention

6.1 Multi-Head Attention Mechanism

image-20220915190436539

The actual code here may be slightly different from the explanation of some principles.

Map first, split later . That is, after getting the input, it is mapped to the d_k * n_heads dimension, and then divided into n_heads heads by transposition, so that there is no need to write the n_heads parameter matrix, and there is no need to perform splicing operations.

After the multi-head attention mechanism is completed, it will go through a linear layer.

class MultiHeadAttention(nn.Module):
    def __init__(self):
        super(MultiHeadAttention, self).__init__()
        # Wq,Wk,Wv其实就是一个线性层,用来将输入映射为K、Q、V
        # 这里输出是d_k * n_heads,因为是先映射,后分头。
        self.W_Q = nn.Linear(d_model, d_k * n_heads)
        self.W_K = nn.Linear(d_model, d_k * n_heads)
        self.W_V = nn.Linear(d_model, d_v * n_heads)
        self.linear = nn.Linear(n_heads * d_v, d_model)
        self.layer_norm = nn.LayerNorm(d_model)

    def forward(self, Q, K, V, attn_mask):
        # attn_mask:[batch_size,len_q,len_k]
        # 输入的数据形状: Q: [batch_size x len_q x d_model], K: [batch_size x len_k x d_model], 
        # V: [batch_size x len_k x d_model]
        residual, batch_size = Q, Q.size(0)
        # (B, S, D) -proj-> (B, S, D) -split-> (B, S, H, W) -trans-> (B, H, S, W)

        # 分头;一定要注意的是q和k分头之后维度是一致的,所以一看这里都是d_k
        # q_s: [batch_size x n_heads x len_q x d_k]
        q_s = self.W_Q(Q).view(batch_size, -1, n_heads, d_k).transpose(1,2)
        # k_s: [batch_size x n_heads x len_k x d_k]
        k_s = self.W_K(K).view(batch_size, -1, n_heads, d_k).transpose(1,2)
        # v_s: [batch_size x n_heads x len_k x d_v]
        v_s = self.W_V(V).view(batch_size, -1, n_heads, d_v).transpose(1,2)

        # attn_mask:[batch_size x len_q x len_k] ---> [batch_size x n_heads x len_q x len_k]
        # 就是把pad信息复制n份,重复到n个头上以便计算多头注意力机制
        attn_mask = attn_mask.unsqueeze(1).repeat(1, n_heads, 1, 1)

        # 计算ScaledDotProductAttention
        # 得到的结果有两个:context: [batch_size x n_heads x len_q x d_v],
        # attn: [batch_size x n_heads x len_q x len_k]
        context, attn = ScaledDotProductAttention()(q_s, k_s, v_s, attn_mask)
        # 这里实际上在拼接n个头,把n个头的加权注意力输出拼接,然后过一个线性层,context变成
        # [batch_size,len_q,n_heads*d_v]。这里context需要进行contiguous,因为transpose后源tensor变成不连续的
        # 了,view操作需要连续的tensor。
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, n_heads * d_v)
        output = self.linear(context)
        # 过残差、LN,输出output: [batch_size x len_q x d_model]和这一层的加权注意力表征向量
        return self.layer_norm(output + residual), attn

6.2 Attention mechanism for dot product scaling (ScaledDotProductAttention)

A key detail here is the joint operation of the Mask matrix and the attention weight matrix. The PAD part of the attention weight matrix should be made infinitely small to shield the query.

image-20220915180132448

class ScaledDotProductAttention(nn.Module):
    def __init__(self):
        super(ScaledDotProductAttention, self).__init__()

    def forward(self, Q, K, V, attn_mask):
        # 输入进来的维度分别是Q:[batch_size x n_heads x len_q x d_k]  K:[batch_size x n_heads x len_k x d_k]  V:[batch_size x n_heads x len_k x d_v]
        # matmul操作即矩阵相乘
        # [batch_size x n_heads x len_q x d_k] matmul [batch_size x n_heads x d_k x len_k] -> [batch_size x n_heads x len_q x len_k]
        scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(d_k)

        # masked_fill_(mask,value)这个函数,用value填充源向量中与mask中值为1位置相对应的元素,
        # 要求mask和要填充的源向量形状需一致
        # 把被mask的地方置为无穷小,softmax之后会趋近于0,Q会忽视这部分的权重
        scores.masked_fill_(attn_mask, -1e9) # Fills elements of self tensor with value where mask is one.
        attn = nn.Softmax(dim=-1)(scores)
        context = torch.matmul(attn, V)
        # context:[batch_size,n_heads,len_q,d_k]
        # attn:[batch_size,n_heads,len_q,len_k]
        return context, attn

7. Feedforward neural network (Poswise-FeedForward)

FeedForward is actually a two-layer linear layer that linearly transforms the input. Position-wise means to do it independently for each point, that is, to pass the same MLP independently for each token in the sequence, that is, to act on the last dimension of the input .

image-20220924181652351

There are two ways to implement MLP, one is realized by convolution, and the other is realized by linear layer. The difference between the two is not only the principle, but also the code details. For example, Conv1d requires the input to be [batch_size, channel, length], which must be a three-dimensional tensor, while Linear requires the input to be [batch_size, *, d_model], and there can be many dimensions.

7.1 Implementation 1: Conv1d

class PoswiseFeedForwardNet(nn.Module):
    def __init__(self):
        super(PoswiseFeedForwardNet, self).__init__()
        self.conv1 = nn.Conv1d(in_channels=d_model, out_channels=d_ff, kernel_size=1)
        self.conv2 = nn.Conv1d(in_channels=d_ff, out_channels=d_model, kernel_size=1)
        self.layer_norm = nn.LayerNorm(d_model)

    def forward(self, inputs):
        residual = inputs # inputs : [batch_size, len_q, d_model]
        # Conv1d的输入为[batch, channel, length],作用于第二个维度channel,所以这里要转置
        output = nn.ReLU()(self.conv1(inputs.transpose(1, 2)))
        output = self.conv2(output).transpose(1, 2)
        return self.layer_norm(output + residual)

7.2 Implementation 2: Linear

class PoswiseFeedForwardNet(nn.Module):
    def __init__(self):
        super(PoswiseFeedForwardNet, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(d_model, d_ff, bias=False),
            nn.ReLU(),
            nn.Linear(d_ff, d_model, bias=False))
        
    def forward(self, inputs):              # inputs: [batch_size, seq_len, d_model]
        residual = inputs
        output = self.fc(inputs)
        return nn.LayerNorm(d_model).(output + residual)   # [batch_size, seq_len, d_model]

Guess you like

Origin blog.csdn.net/m0_47779101/article/details/128087403