The principle and implementation of seq2seq model

As the name suggests, seq2seq refers to generating sequences from sequences, which is widely used in the field of machine translation. Its structure is composed of a set of encoders and a set of decoders composed of RNN.

 The blue module in the above figure is an encoder, which uses an embedding module to convert words to word vectors, and a GRU multi-layer loop module to generate hidden layer vectors. Its structure is as follows, which is a two-layer GRU loop network , whose input is the one-hot vector of the word, after inputting all the words into the encoder, output the final hidden layer vector of the encoder to the decoder module.

 The network structure of the encoder can be described by the following paddle code:

#@save
class Seq2SeqEncoder(nn.Layer):
    """用于序列到序列学习的循环神经网络编码器"""
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0, **kwargs):
        super(Seq2SeqEncoder, self).__init__(**kwargs)
        weight_ih_attr = paddle.ParamAttr(initializer=nn.initializer.XavierUniform())
        weight_hh_attr = paddle.ParamAttr(initializer=nn.initializer.XavierUniform())
        # 嵌入层
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size, 
                          num_hiddens, 
                          num_layers,
                          direction="forward", # foward指从序列开始到序列结束的单向GRU网络方向,bidirectional指从序列开始到序列结束,又从序列结束到开始的双向GRU网络方向
                          dropout=dropout,
                          time_major=True, # time_major为True,则Tensor的形状为[time_steps,batch_size,input_size],否则为[batch_size,time_steps,input_size]
                          weight_ih_attr=weight_ih_attr, 
                          weight_hh_attr=weight_hh_attr)

    def forward(self, X, *args):
        # 输出'X'的形状:(batch_size, num_steps, embed_size)
        X = self.embedding(X)
        # 在循环神经网络模型中,第一个轴对应于时间步
        X = X.transpose([1, 0, 2])
        # 如果未提及状态,则默认为0
        output, state = self.rnn(X)
        # PaddlePaddle的GRU层output的形状:(batch_size, time_steps, num_directions * num_hiddens),
        # 需设定time_major=True,指定input的第一个维度为time_steps
        # state[0]的形状:(num_layers,batch_size,num_hiddens)
        return output, state

The decoder is also composed of a two-layer GRU cyclic neural network as the main component. However, in addition to the input and embedding module, there is also a linear module at the output to restore the word vector to one-hot form. Its structure is shown in the figure below. In the initial case, the hidden layer vector of the decoder is set to the final output hidden layer vector of the encoder, and the input of the decoder is the vector formed by the output vector of the encoder and the predicted output cocat of the previous round of decoder.

  The network structure of the decoder can be described by the following paddle code:

class Seq2SeqDecoder(nn.Layer):
    """用于序列到序列学习的循环神经网络解码器"""
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0, **kwargs):
        super(Seq2SeqDecoder, self).__init__(**kwargs)
        self.embedding = nn.Embedding(vocab_size, embed_size)
        weight_attr = paddle.ParamAttr(initializer=nn.initializer.XavierUniform())
        weight_ih_attr = paddle.ParamAttr(initializer=nn.initializer.XavierUniform())
        weight_hh_attr = paddle.ParamAttr(initializer=nn.initializer.XavierUniform())
        self.rnn = nn.GRU(embed_size + num_hiddens, num_hiddens, num_layers, dropout=dropout,
                          time_major=True, weight_ih_attr=weight_ih_attr,weight_hh_attr=weight_hh_attr)
        self.dense = nn.Linear(num_hiddens, vocab_size, weight_attr=weight_attr)

    def init_state(self, enc_outputs, *args):
        return enc_outputs[1]

    def forward(self, X, state):
        # 输出'X'的形状:(batch_size,num_steps,embed_size)
        X = self.embedding(X).transpose([1, 0, 2]) # shape: (num_steps,batch_size,embed_size)
        # 广播context,使其具有与X相同的num_steps
        context = state[-1].tile([X.shape[0], 1, 1])
        X_and_context = paddle.concat((X, context), 2)
        output, state = self.rnn(X_and_context, state)
        output = self.dense(output).transpose([1, 0, 2])
        # output的形状:(batch_size,num_steps,vocab_size)
        # state[0]的形状:(num_layers,batch_size,num_hiddens)
        return output, state

The final code to form the seq2seq model is:

#@save
class EncoderDecoder(nn.Layer):
    """编码器-解码器架构的基类"""
    def __init__(self, encoder, decoder, **kwargs):
        super(EncoderDecoder, self).__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, enc_X, dec_X, *args):
        enc_outputs = self.encoder(enc_X, *args)
        dec_state = self.decoder.init_state(enc_outputs, *args)
        return self.decoder(dec_X, dec_state)

Another important issue is the loss function of the seq2seq model. The loss function of machine translation is similar to classification, which can be expressed by the cross-entropy loss function, but the redundant part of the same batch is eliminated through the mask. The specific loss function code is as follows:

class MaskedSoftmaxCELoss(nn.CrossEntropyLoss):
    """带遮蔽的softmax交叉熵损失函数"""
    def sequence_mask(self, X, valid_len, value=0):
        """在序列中屏蔽不相关的项"""
        maxlen = X.shape[1]
        mask = paddle.arange((maxlen), dtype=paddle.float32)[None, :] < valid_len[:, None]
        Xtype = X.dtype
        X = X.astype(paddle.float32)
        X[~mask] = float(value)
        return X.astype(Xtype)

    # pred的形状:(batch_size,num_steps,vocab_size)
    # label的形状:(batch_size,num_steps)
    # valid_len的形状:(batch_size,)
    def forward(self, pred, state, label, valid_len):
        weights = paddle.ones_like(label)
        weights = self.sequence_mask(weights, valid_len)
        self.reduction='none'
        unweighted_loss = super(MaskedSoftmaxCELoss, self).forward(
            pred, label)
        weighted_loss = (unweighted_loss * weights).mean(axis=1)
        return weighted_loss

Guess you like

Origin blog.csdn.net/tostq/article/details/129801771