Introduction to Deep Learning (65) Recurrent Neural Network - Sequence to Sequence Learning (seq2seq)

foreword

The core content comes from blog link 1 blog link 2 I hope you can support the author a lot
This article is used for records to prevent forgetting

Recurrent Neural Networks - Sequence to Sequence Learning (seq2seq)

courseware

machine translation

  • Given a sentence in the source language, automatically translate it into the target language
  • The two sentences can have different lengths
    insert image description here

seq2seq

insert image description here
The encoder is an RNN that reads the input sentence

  • Can be bidirectional
    decoder uses another RNN to output

Encoder-Decoder Details

insert image description here

  • Encoder is no output to RNN
  • The hidden state of the encoder at the last time step is used as the initial hidden state of the decoder

train

  • The decoder uses the target sentence as input during training
    insert image description here

BLEU to measure how good the generated sequence is

p n p_n pnis the accuracy of predicting all n-grams

  • The label sequence ABCDEF and the prediction sequence ABBCD have p 1 = 4 / 5 , p 2 = 3 / 4 , p 3 = 1 / 3 , p 4 = 0 p_1=4/5, p_2= 3/4, p_3 = 1/ 3, p_4=0p1=4/5,p2=3/4,p3=1/3,p4=0
    BLEU is defined as
    insert image description here

Summarize

  • Seq2seq to generate another sentence from one sentence
  • Both encoder and decoder are RNN
  • Use the final time hidden state of the encoder to the hidden state of the initial decoder to complete the information transfer
  • BLEU is commonly used to measure the quality of the generated sequence

Textbook

As we saw in the section on machine translation datasets, both input and output sequences in machine translation are of variable length. To solve this kind of problem, we design a general "encoder-decoder" architecture in the Encoder-Decoder Architecture section. In this section, we will use two recurrent neural networks for encoder and decoder and apply them to 序列到序列(sequence to sequence,seq2seq)the class learning task.

Following the design principles of the encoder-decoder architecture, a recurrent neural network encoder takes a variable-length sequence as input and transforms it into a fixed-shape hidden state. In other words, the information of the input sequence is encoded into the hidden state of the RNN encoder. To continuously generate tokens of the output sequence, a separate RNN decoder predicts the next token based on the encoded information of the input sequence and tokens already seen or generated in the output sequence. The figure below demonstrates how to use two recurrent neural networks for sequence-to-sequence learning in machine translation.

insert image description here
In the figure, the specific " <eos>" indicates the sequence end token. The model stops predicting once the output sequence produces this token. At the initialization time step of the RNN decoder, there are two specific design decisions: First, the specific " " <bos>denotes the sequence start token, which is the first token of the input sequence to the decoder. Second, the hidden state of the decoder is initialized with the final hidden state of the recurrent neural network encoder. For example, in the design of (Sutskever et al., 2014), it is based on this design that the encoding information of the input sequence is fed into the decoder to generate the output sequence. In some other designs, as shown, the encoder's final hidden state is included as part of the decoder's input sequence at every time step. Similar to the training of the language model in Section 8.3, the label can be allowed to become the original output sequence, from the source sequence token “ ” “ <bos>Ils” “regardent” “.” to the new sequence token “Ils” “regardent” “.”” <eos>" to move the predicted position.

Next, we will build the design above and train this machine translation model based on the "English-French" dataset.

import collections
import math
import torch
from torch import nn
from d2l import torch as d2l

1 encoder

Technically, an encoder converts a variable-length input sequence into a context variable cc of fixed shapec , and encode information about the input sequence in this context variable. As shown, an encoder can be designed using a recurrent neural network.

Consider a sample consisting of a sequence (batch size is 1). Suppose the input sequence is x 1 , … , x T x_1, \ldots, x_Tx1,,xT, where xt x_txtis the ttth in the input text sequencet words. at time stepttt , the recurrent neural network converts the word unitxt x_txtThe input feature vector xt \mathbf{x}_txt h t − 1 \mathbf{h} _{t-1} ht1(that is, the hidden state of the previous time step) is converted to ht \mathbf{h}_tht(i.e. the hidden state of the current step). using a function fff to describe the transformation made by the recurrent layer of the recurrent neural network: ht = f ( xt , ht − 1 ) . \mathbf{h}_t = f(\mathbf{x}_t, \mathbf{h}_{t- 1}).ht=f(xt,ht1) .
In summary, the encoder passes the selected functionqqq , convert the hidden state of all time steps into context variables: c = q ( h 1 , … , h T ) . \mathbf{c} = q(\mathbf{h}_1, \ldots, \mathbf{h} _T).c=q(h1,,hT) .
For example, when choosingq ( h 1 , … , h T ) = h T q(\mathbf{h}_1, \ldots, \mathbf{h}_T) = \mathbf{h}_Tq(h1,,hT)=hT(as in the figure), the context variable is simply the hidden state h T of the input sequence at the last time step \mathbf{h}_ThT

So far, we have used a unidirectional recurrent neural network to design the encoder, where the hidden state depends only on the input subsequence, which is from the beginning of the input sequence to the position of the time step where the hidden state is located (including The time step where the hidden state is located). We can also use a bidirectional recurrent neural network to construct an encoder, where the hidden state depends on two input subsequences, the two subsequences are the sequence before and after the time step where the hidden state is located (including the time where the hidden state is located step), so the hidden state encodes the information of the entire sequence.

Now, let's implement the Recurrent Neural Network Encoder. Note that we used 嵌入层(embedding layer)to obtain the feature vector for each token in the input sequence. The weight of the embedding layer is a matrix with the number of rows equal to the size of the input vocabulary ( vocab_size) and the number of columns equal to the dimension of the feature vectors ( embed_size). For any input token index iii , the embedding layer obtains the iith part of the weightmatrixi rows (starting from 0) to return their feature vectors. In addition, this paper chooses a multi-layer gated recurrent unit to implement the encoder.

class Seq2SeqEncoder(d2l.Encoder):
    """用于序列到序列学习的循环神经网络编码器"""
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0, **kwargs):
        super(Seq2SeqEncoder, self).__init__(**kwargs)
        # 嵌入层
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size, num_hiddens, num_layers,
                          dropout=dropout)

    def forward(self, X, *args):
        # 输出'X'的形状:(batch_size,num_steps,embed_size)
        X = self.embedding(X)
        # 在循环神经网络模型中,第一个轴对应于时间步
        X = X.permute(1, 0, 2)
        # 如果未提及状态,则默认为0
        output, state = self.rnn(X)
        # output的形状:(num_steps,batch_size,num_hiddens)
        # state的形状:(num_layers,batch_size,num_hiddens)
        return output, state

For the description of the variables returned by the loop layer, please refer to the section of RNN Concise Implementation

Below, we instantiate the implementation of the above encoder: We use a two-layer gated recurrent unit encoder with 16 hidden units. Given a small batch of input sequences X (batch size 4, time steps 7). After completing all time steps, the output of the hidden state of the last layer is a tensor ( outputreturned by the recurrent layer of the encoder) with shape (时间步数,批量大小,隐藏单元数).

encoder = Seq2SeqEncoder(vocab_size=10, embed_size=8, num_hiddens=16,
                         num_layers=2)
encoder.eval()
X = torch.zeros((4, 7), dtype=torch.long)
output, state = encoder(X)
output.shape

output:

torch.Size([7, 4, 16])

Since gated recurrent units are used here, the shape of the multilayer hidden state at the last time step is (隐藏层的数量,批量大小,隐藏单元的数量). If a LSTM network is used, statethe memory cell information will also be included in .

state.shape

output:

torch.Size([2, 4, 16])

2 decoders

As mentioned above, the context variable cc output by the encoderc for the entire input sequencex 1 , … , x T x_1, \ldots, x_Tx1,,xTto encode. The output sequence y 1 , y 2 , … , y T ′ y_1, y_2, \ldots, y_{T'} from the training datasety1,y2,,yT, for each time step t ′ t't (with input sequence or encoder time stepttt is different), the decoder outputsyt ′ y_{t'}ytThe probability of depends on the previous output subsequence y 1 , … , yt ′ − 1 y_1, \ldots, y_{t'-1}y1,,yt1and the context variable c \mathbf{c}c, 即 P ( y t ′ ∣ y 1 , … , y t ′ − 1 , c ) P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c}) P ( andty1,,yt1,c)

To model this conditional probability on sequences, we can use another recurrent neural network as a decoder. At any time step t ′ t^\prime on the output sequencet , the RNN will outputyt ′ − 1 y_{t^\prime-1}yt1and the context variable c \mathbf{c}c as its input, and then combine them with the previous hidden statest ′ − 1 \mathbf{s}_{t^\prime-1}st1Convert to hidden state st ′ \mathbf{s}_{t^\prime}st. Therefore, the function gg can be usedg to represent the transformation of the hidden layer of the decoder: st ′ = g ( yt ′ − 1 , c , st ′ − 1 ) . \mathbf{s}_{t^\prime} = g(y_{t^\prime -1}, \mathbf{c}, \mathbf{s}_{t^\prime-1}).st=g(yt1,c,st1) .
After obtaining the hidden state of the decoder, we can use the output layer and the softmax operation to compute at time stept ′ t^\primet outputyt ′ y_{t^\prime}ytThe conditional probability distribution P ( yt ′ ∣ y 1 , … , yt ′ − 1 , c ) P(y_{t^\prime} \mid y_1, \ldots, y_{t^\prime-1}, \mathbf{ c})P ( andty1,,yt1,c)

According to the above figure, when implementing the decoder, we directly use the hidden state of the encoder at the last time step to initialize the hidden state of the decoder. This requires that the encoder and decoder implemented using recurrent neural networks have the same number of layers and hidden units. To further contain information about the encoded input sequence, context variables are concatenated with the decoder input at all time steps. In order to predict the probability distribution of output tokens, a fully connected layer is used in the last layer of the RNN decoder to transform the hidden state.

class Seq2SeqDecoder(d2l.Decoder):
    """用于序列到序列学习的循环神经网络解码器"""
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0, **kwargs):
        super(Seq2SeqDecoder, self).__init__(**kwargs)
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size + num_hiddens, num_hiddens, num_layers,
                          dropout=dropout)
        self.dense = nn.Linear(num_hiddens, vocab_size)

    def init_state(self, enc_outputs, *args):
        return enc_outputs[1]

    def forward(self, X, state):
        # 输出'X'的形状:(batch_size,num_steps,embed_size)
        X = self.embedding(X).permute(1, 0, 2)
        # 广播context,使其具有与X相同的num_steps
        context = state[-1].repeat(X.shape[0], 1, 1)
        X_and_context = torch.cat((X, context), 2)
        output, state = self.rnn(X_and_context, state)
        output = self.dense(output).permute(1, 0, 2)
        # output的形状:(batch_size,num_steps,vocab_size)
        # state的形状:(num_layers,batch_size,num_hiddens)
        return output, state

Below, we instantiate the decoder with the same hyperparameters as in the previously mentioned encoder. As we can see, the output shape of the decoder becomes (批量大小,时间步数,词表大小), where the last dimension of the tensor stores the predicted token distribution.

decoder = Seq2SeqDecoder(vocab_size=10, embed_size=8, num_hiddens=16,
                         num_layers=2)
decoder.eval()
state = decoder.init_state(encoder(X))
output, state = decoder(X, state)
output.shape, state.shape

output

(torch.Size([4, 7, 10]), torch.Size([2, 4, 16]))

In summary, the layers in the above recurrent neural network "encoder-decoder" model are shown in the figure below.
insert image description here

3 loss function

At each time step, the decoder predicts the probability distribution of the output tokens. Similar to language models, distributions can be obtained using softmax and optimized by computing a cross-entropy loss function. Recall from the section on Machine Translation Datasets that specific filler tokens are added to the end of the sequences so that sequences of different lengths can be loaded in mini-batches of the same shape. However, we should exclude the prediction of filler tokens from the calculation of the loss function.

To do this, we can use the function below sequence_maskto mask out irrelevant terms by nulling them, so that any subsequent calculation of irrelevant predictions is a product with zero, and the result is equal to zero. For example, if two sequences have effective lengths (excluding filler tokens) of 1 and 2, respectively, the first item of the first sequence and the remaining items after the first two items of the second sequence will be cleared to zero.

def sequence_mask(X, valid_len, value=0):
    """在序列中屏蔽不相关的项"""
    maxlen = X.size(1)
    mask = torch.arange((maxlen), dtype=torch.float32,
                        device=X.device)[None, :] < valid_len[:, None]
    X[~mask] = value
    return X

X = torch.tensor([[1, 2, 3], [4, 5, 6]])
sequence_mask(X, torch.tensor([1, 2]))

output

tensor([[1, 0, 0],
        [4, 5, 0]])

We can also use this function to mask all terms in the last few axes. These items can also be replaced with specified non-zero values ​​if desired.

X = torch.ones(2, 3, 4)
sequence_mask(X, torch.tensor([1, 2]), value=-1)

output

tensor([[[ 1.,  1.,  1.,  1.],
         [-1., -1., -1., -1.],
         [-1., -1., -1., -1.]],

        [[ 1.,  1.,  1.,  1.],
         [ 1.,  1.,  1.,  1.],
         [-1., -1., -1., -1.]]])

We can now mask irrelevant predictions by extending the softmax cross-entropy loss function. Initially, the masks for all predicted tokens are set to 1. Once a valid length is given, the mask corresponding to the filler token will be set to 0. Finally, the loss for all tokens is multiplied by the mask to filter out irrelevant predictions produced by padding tokens in the loss.

class MaskedSoftmaxCELoss(nn.CrossEntropyLoss):
    """带遮蔽的softmax交叉熵损失函数"""
    # pred的形状:(batch_size,num_steps,vocab_size)
    # label的形状:(batch_size,num_steps)
    # valid_len的形状:(batch_size,)
    def forward(self, pred, label, valid_len):
        weights = torch.ones_like(label)
        weights = sequence_mask(weights, valid_len)
        self.reduction='none'
        unweighted_loss = super(MaskedSoftmaxCELoss, self).forward(
            pred.permute(0, 2, 1), label)
        weighted_loss = (unweighted_loss * weights).mean(dim=1)
        return weighted_loss

We can create three identical sequences for code sanity checking, and then specify effective lengths of these sequences to be 4, 2, and 0, respectively. As a result, the loss for the first sequence should be twice that of the second sequence, and the loss for the third sequence should be zero.

loss = MaskedSoftmaxCELoss()
loss(torch.ones(3, 4, 10), torch.ones((3, 4), dtype=torch.long),
     torch.tensor([4, 2, 0]))

output

tensor([2.3026, 1.1513, 0.0000])

4 training

In the following cyclic training process, as shown in the first figure, the specific sequence start token (" <bos>") and the original output sequence (excluding the sequence end token " <eos>") are spliced ​​together as the input of the decoder. This is called 强制教学(teacher forcing)because the original output sequence (labels of tokens) is fed into the decoder. Alternatively, use the predicted token from the previous time step as the current input to the decoder.

#@save
def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device):
    """训练序列到序列模型"""
    def xavier_init_weights(m):
        if type(m) == nn.Linear:
            nn.init.xavier_uniform_(m.weight)
        if type(m) == nn.GRU:
            for param in m._flat_weights_names:
                if "weight" in param:
                    nn.init.xavier_uniform_(m._parameters[param])

    net.apply(xavier_init_weights)
    net.to(device)
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
    loss = MaskedSoftmaxCELoss()
    net.train()
    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
                     xlim=[10, num_epochs])
    for epoch in range(num_epochs):
        timer = d2l.Timer()
        metric = d2l.Accumulator(2)  # 训练损失总和,词元数量
        for batch in data_iter:
            optimizer.zero_grad()
            X, X_valid_len, Y, Y_valid_len = [x.to(device) for x in batch]
            bos = torch.tensor([tgt_vocab['<bos>']] * Y.shape[0],
                          device=device).reshape(-1, 1)
            dec_input = torch.cat([bos, Y[:, :-1]], 1)  # 强制教学
            Y_hat, _ = net(X, dec_input, X_valid_len)
            l = loss(Y_hat, Y, Y_valid_len)
            l.sum().backward()      # 损失函数的标量进行“反向传播”
            d2l.grad_clipping(net, 1)
            num_tokens = Y_valid_len.sum()
            optimizer.step()
            with torch.no_grad():
                metric.add(l.sum(), num_tokens)
        if (epoch + 1) % 10 == 0:
            animator.add(epoch + 1, (metric[0] / metric[1],))
    print(f'loss {
      
      metric[0] / metric[1]:.3f}, {
      
      metric[1] / timer.stop():.1f} '
        f'tokens/sec on {
      
      str(device)}')

Now, on a machine translation dataset, we can create and train a recurrent neural network "encoder-decoder" model for sequence-to-sequence learning.

embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.1
batch_size, num_steps = 64, 10
lr, num_epochs, device = 0.005, 300, d2l.try_gpu()

train_iter, src_vocab, tgt_vocab = d2l.load_data_nmt(batch_size, num_steps)
encoder = Seq2SeqEncoder(len(src_vocab), embed_size, num_hiddens, num_layers,
                        dropout)
decoder = Seq2SeqDecoder(len(tgt_vocab), embed_size, num_hiddens, num_layers,
                        dropout)
net = d2l.EncoderDecoder(encoder, decoder)
train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device)

output

loss 0.019, 11451.2 tokens/sec on cuda:0

insert image description here

If the error is as follows

Cell In[19], line 5
      2 batch_size, num_steps = 64, 10
      3 lr, num_epochs, device = 0.005, 300, d2l.try_gpu()
----> 5 train_iter, src_vocab, tgt_vocab = d2l.load_data_nmt(batch_size, num_steps)
      6 encoder = Seq2SeqEncoder(len(src_vocab), embed_size, num_hiddens, num_layers,
      7                         dropout)
      8 decoder = Seq2SeqDecoder(len(tgt_vocab), embed_size, num_hiddens, num_layers,
      9                         dropout)

File D:\python\lib\site-packages\d2l\torch.py:883, in load_data_nmt(batch_size, num_steps, num_examples)
    881 def load_data_nmt(batch_size, num_steps, num_examples=600):
    882     """Return the iterator and the vocabularies of the translation dataset."""
--> 883     text = preprocess_nmt(read_data_nmt())
    884     source, target = tokenize_nmt(text, num_examples)
    885     src_vocab = d2l.Vocab(source, min_freq=2,
    886                           reserved_tokens=['<pad>', '<bos>', '<eos>'])

File D:\python\lib\site-packages\d2l\torch.py:828, in read_data_nmt()
    826 data_dir = d2l.download_extract('fra-eng')
    827 with open(os.path.join(data_dir, 'fra.txt'), 'r') as f:
--> 828     return f.read()

UnicodeDecodeError: 'gbk' codec can't decode byte 0xaf in position 33: illegal multibyte sequence


This is an encoding problem. You can specify encoding='UTF-8' when opening the file.

find the corresponding function

def read_data_nmt():
    """载入⼊“英语-法语”数据集。"""
    data_dir = d2l.download_extract('fra-eng')
    with open(os.path.join(data_dir, 'fra.txt'), 'r',encoding='UTF-8') as f:
        return f.read()
raw_text = read_data_nmt()

5 predictions

In order to predict the output sequence token by token, the input of each decoder at the current time step will come from the token predicted at the previous time step. Similar to training, the sequence start token (“ <bos>”) is fed into the decoder at an initial time step. The prediction process is shown in the figure below. When the prediction of the output sequence encounters the sequence end word (“ <eos>”), the prediction ends.
insert image description here
We will introduce different sequence generation strategies in the next section.

#@save
def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps,
                    device, save_attention_weights=False):
    """序列到序列模型的预测"""
    # 在预测时将net设置为评估模式
    net.eval()
    src_tokens = src_vocab[src_sentence.lower().split(' ')] + [
        src_vocab['<eos>']]
    enc_valid_len = torch.tensor([len(src_tokens)], device=device)
    src_tokens = d2l.truncate_pad(src_tokens, num_steps, src_vocab['<pad>'])
    # 添加批量轴
    enc_X = torch.unsqueeze(
        torch.tensor(src_tokens, dtype=torch.long, device=device), dim=0)
    enc_outputs = net.encoder(enc_X, enc_valid_len)
    dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)
    # 添加批量轴
    dec_X = torch.unsqueeze(torch.tensor(
        [tgt_vocab['<bos>']], dtype=torch.long, device=device), dim=0)
    output_seq, attention_weight_seq = [], []
    for _ in range(num_steps):
        Y, dec_state = net.decoder(dec_X, dec_state)
        # 我们使用具有预测最高可能性的词元,作为解码器在下一时间步的输入
        dec_X = Y.argmax(dim=2)
        pred = dec_X.squeeze(dim=0).type(torch.int32).item()
        # 保存注意力权重(稍后讨论)
        if save_attention_weights:
            attention_weight_seq.append(net.decoder.attention_weights)
        # 一旦序列结束词元被预测,输出序列的生成就完成了
        if pred == tgt_vocab['<eos>']:
            break
        output_seq.append(pred)
    return ' '.join(tgt_vocab.to_tokens(output_seq)), attention_weight_seq

6 Evaluation of the predicted sequence

We can evaluate the predicted sequence by comparing it with the real label sequence. Although Papineni's proposed BLEU(bilingual evaluation understudy)method was originally used to evaluate the results of machine translation, it has now been widely used to measure the quality of output sequences for many applications. In principle, for any n-grams (n-grams) in the predicted sequence, the evaluation of BLEU is whether this n-gram appears in the label sequence.

We define BLEU as:

exp ⁡ ( min ⁡ ( 0 , 1 − len label len pred ) ) ∏ n = 1 kpn 1 / 2 n , \exp\left(\min\left(0, 1 - \frac{\mathrm{len}_{ \text{label}}}{\mathrm{len}_{\text{pred}}}\right)\right) \prod_{n=1}^k p_n^{1/2^n},exp(min(0,1lenbeforelenlabel))n=1kpn1/2n,

Include the label \mathrm{label}_{\text{label}}lenlabelIndicates the number of tokens in the label sequence and len pred \mathrm{len}_{\text{pred}}lenbeforeIndicates the number of tokens in the predicted sequence, kkk is the longest nfor matchingn -grams. Also, withpn p_npnmeans nnThe precision of the n-gram, which is the ratio of two quantities: the first is the n -numberThe number of n -grams, the second isnnThe ratio of the number of n -grams. Specifically, a given tag sequenceA , B , C , D , E , FA, B, C, D, E, FA , B , C , D , E , F and forecast sequenceA , B , B , C , DA, B, B, C, DA , B , B , C , D , we havep 1 = 4/5 p_1 = 4/5p1=4/5 p 2 = 3 / 4 p_2 = 3/4 p2=3/4 p 3 = 1 / 3 p_3 = 1/3 p3=1/3 andp 4 = 0 p_4 = 0p4=0

According to the definition of BLEU, BLEU is when the predicted sequence is exactly the same as the label sequence. Furthermore, since nnThe longer the n -gram, the more difficult it is to match, so BLEU is the longer nnAccuracy of n -grams is assigned greater weight. Specifically, whenpn p_npnWhen fixed, pn 1 / 2 n p_n^{1/2^n}pn1/2nwill follow nnincreases as n grows (the original paper usespn 1 / n p_n^{1/n}pn1/n). Also, since the predicted sequence is shorter the obtained pn p_npnThe higher the value, the coefficient before the multiplicative term in the above equation is used to penalize shorter prediction sequences. For example, when k = 2 k=2k=2 , given the label sequenceA , B , C , D , E , FA, B, C, D, E, FA , B , C , D , E , F and predicted sequenceA , BA , BA , B , althoughp 1 = p 2 = 1 p_1 = p_2 = 1p1=p2=1 , penalty factorexp ⁡ ( 1 − 6 / 2 ) ≈ 0.14 \exp(1-6/2) \approx 0.14exp(16/2)0.14 will lower the BLEU.

The code implementation of BLEU is as follows.

def bleu(pred_seq, label_seq, k):  #@save
    """计算BLEU"""
    pred_tokens, label_tokens = pred_seq.split(' '), label_seq.split(' ')
    len_pred, len_label = len(pred_tokens), len(label_tokens)
    score = math.exp(min(0, 1 - len_label / len_pred))
    for n in range(1, k + 1):
        num_matches, label_subs = 0, collections.defaultdict(int)
        for i in range(len_label - n + 1):
            label_subs[' '.join(label_tokens[i: i + n])] += 1
        for i in range(len_pred - n + 1):
            if label_subs[' '.join(pred_tokens[i: i + n])] > 0:
                num_matches += 1
                label_subs[' '.join(pred_tokens[i: i + n])] -= 1
        score *= math.pow(num_matches / (len_pred - n + 1), math.pow(0.5, n))
    return score

Finally, using the trained recurrent neural network "encoder-decoder" model, several English sentences are translated into French and the final result of BLEU is calculated.

engs = ['go .', "i lost .", 'he\'s calm .', 'i\'m home .']
fras = ['va !', 'j\'ai perdu .', 'il est calme .', 'je suis chez moi .']
for eng, fra in zip(engs, fras):
    translation, attention_weight_seq = predict_seq2seq(
        net, eng, src_vocab, tgt_vocab, num_steps, device)
    print(f'{
      
      eng} => {
      
      translation}, bleu {
      
      bleu(translation, fra, k=2):.3f}')

output:

go . => va !, bleu 1.000
i lost . => j'ai perdu ., bleu 1.000
he's calm . => il est bon ?, bleu 0.537
i'm home . => je suis chez moi debout ., bleu 0.803

7 Summary

  • According to the design of the "encoder-decoder" architecture, we can use two recurrent neural networks to design a sequence-to-sequence learning model.

  • When implementing the encoder and decoder, we can use a multi-layer recurrent neural network.

  • We can use masking to filter irrelevant computations, for example when computing losses.

  • In "encoder-decoder" training, the forced teaching method feeds the raw output sequence (rather than the predicted result) into the decoder.

  • BLEU is a commonly used evaluation method, which evaluates predictions by measuring the matching degree of metasyntax between prediction sequences and label sequences.

Guess you like

Origin blog.csdn.net/qq_52358603/article/details/128391591