[Hands-on deep learning] Li Mu - Recurrent Neural Network

In this chapter, Teacher Li Mu mainly explains about the recurrent neural network and encoder-decoder. He mainly focuses on the practical direction and does not delve deeply into its theory. If you are interested, you can read these two chapters of mine. This blog is the study notes of teacher Li Hongyi, which provides a more detailed explanation of the theoretical part:
[Machine Learning] Li Hongyi - Recurrent Neural Network (Recurrent Neural Network)
[Machine Learning] Li Hongyi——AE Auto-encoder (Auto-encoder)

sequence model

In real life, a lot of data has a sequential structure, so the study of the sequential structure is also necessary.

Generally for time series structures, the observation value at the tth time point x t x_t xtIt is related to the observed values ​​​​at the previous t-1 moments, but the converse may not be feasible in reality, that is:
Insert image description here

Then we model the conditional probability, that is:
Insert image description here

Then if you can learn the model f and the probability calculation method p, you can make predictions.

So there are two common research methods for this problem:


Markov Hypothesis:

Because our previous description isThe observation value at the current moment is related to the observation value at all previous moments, then the Markov hypothesis is the assumption The current data is only the same as the past τ \tau τ data points are related, so that the input of function f is converted from indefinite length to fixed length, so it is much more convenient:τ

Insert image description here

Then it can be implemented with a simple MLP.


Latent displacement model:

Introduce latent variables h t h_t ht to represent past information h t = f ( x 1 , . . . , x t − 1 ) h_t=f(x_1,...,x_{t-1}) ht=f(x1,...,xt1),那么 x t = p ( x t ∣ h t ) x_t=p(x_t \mid h_t) xt=p(xtht)

Insert image description here


The specific teacher used a small example to show us training and prediction:

import torch
from matplotlib import pyplot as plt
from torch import nn
from d2l import torch as d2l

T = 1000  # 总共产生1000个点
time = torch.arange(1, T + 1, dtype=torch.float32)
x = torch.sin(0.01 * time) + torch.normal(0, 0.2, (T,))  # 加上噪音
# d2l.plot(time, [x], 'time', 'x', xlim=[1, 1000], figsize=(6, 3))

tau = 4
features = torch.zeros((T - tau, tau))  # 因为前tau个之间没有tau个可以作为输入
for i in range(tau):
    features[:, i] = x[i: T - tau + i]  # 例如第0列就是每个数据的前面第4个
labels = x[tau:].reshape((-1, 1))  # 从第4个往后都是前面tau个造成的输出了

batch_size, n_train = 16, 600
# 用前600个样本来训练,然后后面400个完成预测任务
train_iter = d2l.load_array((features[:n_train], labels[:n_train]),
                            batch_size, is_train=True)


# 初始化网络权重的函数
def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.xavier_uniform_(m.weight)


# 一个简单的多层感知机
def get_net():
    net = nn.Sequential(nn.Linear(4, 10),
                        nn.ReLU(),
                        nn.Linear(10, 1))
    net.apply(init_weights)
    return net


loss = nn.MSELoss(reduction='none')


def train(net, train_iter, loss, epochs, lr):
    trainer = torch.optim.Adam(net.parameters(), lr)
    for epoch in range(epochs):
        for X, y in train_iter:
            trainer.zero_grad()
            l = loss(net(X), y)
            l.sum().backward()
            trainer.step()
        print(f'epoch {
      
      epoch + 1}, '
              f'loss: {
      
      d2l.evaluate_loss(net, train_iter, loss):f}')


net = get_net()
train(net, train_iter, loss, 5, 0.01)

onestep_preds = net(features)
# 单步预测,就是每次都给定4个真实值来让你预测下一个
# 注意这里采用detach是将本来含有梯度的变量,复制一个不含有梯度,不过都是指向同一个数值
# 不含有梯度是因为画图不需要计算梯度,防止在画图中发生计算过程而改变梯度
d2l.plot([time, time[tau:]],
         [x.detach().numpy(), onestep_preds.detach().numpy()], 'time',
         'x', legend=['data', '1-step preds'], xlim=[1, 1000],figsize=(6, 3))
plt.show()
# 多步预测,只知道600个,然后可以结合真实数据预测到604个,那么后面都是靠预测值来预测
multistep_preds = torch.zeros(T)
multistep_preds[: n_train + tau] = x[: n_train + tau]
for i in range(n_train + tau, T):
    multistep_preds[i] = net(multistep_preds[i - tau:i].reshape((1, -1)))

d2l.plot([time, time[tau:], time[n_train + tau:]],
         [x.detach().numpy(), onestep_preds.detach().numpy(),
          multistep_preds[n_train + tau:].detach().numpy()], 'time',
         'x', legend=['data', '1-step preds', 'multistep preds'],
         xlim=[1, 1000], figsize=(6, 3))
plt.show()
max_steps = 64
features = torch.zeros((T - tau - max_steps + 1, tau + max_steps))
# 列i(i<tau)是来自x的观测,其时间步从(i)到(i+T-tau-max_steps+1)
for i in range(tau):
    features[:, i] = x[i: i + T - tau - max_steps + 1]

# 列i(i>=tau)是来自(i-tau+1)步的预测,其时间步从(i)到(i+T-tau-max_steps+1)
for i in range(tau, tau + max_steps):
    features[:, i] = net(features[:, i - tau:i]).reshape(-1)

steps = (1, 4, 16, 64)
d2l.plot([time[tau + i - 1: T - max_steps + i] for i in steps],
         [features[:, (tau + i - 1)].detach().numpy() for i in steps], 'time', 'x',
         legend=[f'{
      
      i}-step preds' for i in steps], xlim=[5, 1000],
         figsize=(6, 3))

plt.show()

Insert image description here

It can be seen that the results of single-step prediction are still very accurate.

Insert image description here

But in multi-step prediction, if we only give the first 600 data points and then ask it to predict the next 400, the results will be ridiculously bad.

Insert image description here

It can be seen that the results of multi-step predictions of 1, 4, and 16 are pretty good, but when it is increased to 64, there is a significant difference.


小结

  • In the time series model, the current data is related to the previously observed data.
  • Autoregressive models use their own past data to predict the future
  • The Markov model assumes that the current situation is only related to a small number of recent data, thus simplifying the model.
  • Latent variable models use latent variables to summarize historical information

Text preprocessing

This chapter mainly introduces the processing of a simple text file and generates a data set that can be used.

import collections
import re
from d2l import torch as d2l

# @save
d2l.DATA_HUB['time_machine'] = (d2l.DATA_URL + 'timemachine.txt',
                                '090b5e7e70c295757f55df93cb0a180b9691891a')


# 下载数据集

def read_time_machine():
    with open(d2l.download('time_machine'), 'r') as f:
        lines = f.readlines()
    return [re.sub('[^A-Za-z]+', ' ', line).strip().lower() for line in lines]


# 就是将除了A-Z和a-z,还有空格,其他的符号都去掉,再去掉回车,再转成小写

lines = read_time_machine()
print(f'# 文本总行数: {
      
      len(lines)}')
print(lines[0])
print(lines[10])


def tokenize(lines, token='word'):  # @save
    if token == 'word':
        return [line.split() for line in lines]
    elif token == 'char':
        return [list(line) for line in lines]
    else:
        print('错误:未知词元类型:' + token)


tokens = tokenize(lines)
for i in range(11):
    print(tokens[i])


def count_corpus(tokens):  # @save
    """统计词元的频率"""
    # 这里的tokens是1D列表或2D列表
    if len(tokens) == 0 or isinstance(tokens[0], list):
        # 所有单词都展开成一个列表
        tokens = [token for line in tokens for token in line]
    return collections.Counter(tokens)


class Vocab:
    def __init__(self, tokens=None, min_freq=0, reserved_tokens=None):
        if tokens is None:
            tokens = []
        if reserved_tokens is None:
            reserved_tokens = []
        # 按照出现的频率来进行排序
        counter = count_corpus(tokens)
        self._token_freqs = sorted(counter.items(), key=lambda x: x[1],
                                   reverse=True)
        # 按照出现频率从大到小排序
        # 未知词元索引在0,包括出现频率太少,还有句子起始和结尾的标志
        self.idx_to_token = ['<unk>'] + reserved_tokens
        self.token_to_idx = {
    
    token: idx for idx, token in enumerate(self.idx_to_token)}

        for token, freq in self._token_freqs:
            if freq < min_freq:
                break
            if token not in self.token_to_idx:
                self.idx_to_token.append(token)
                self.token_to_idx[token] = len(self.idx_to_token) - 1
        # 不断添加进去词汇并建立和位置之间的索引关系

    def __len__(self):
        return len(self.idx_to_token)

    @property
    def unk(self):  # 未知词元的索引为0,装饰器,可以直接self.unk,不用加括号
        return 0

    def __getitem__(self, tokens):
        if not isinstance(tokens, (list, tuple)):  # 如果是单个
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]  # 如果是多个

    def to_tokens(self, indices):
        if not isinstance(indices, (list, tuple)):
            return self.idx_to_token[indices]  # 从索引到单词
        return [self.idx_to_token[index] for index in indices]

    @property
    def token_freqs(self):
        return self._token_freqs


vocab = Vocab(tokens)
print(list(vocab.token_to_idx.items())[:10])


def load_corpus_time_machine(max_tokens=-1):  # @save
    """返回时光机器数据集的词元索引列表和词表"""
    lines = read_time_machine()
    tokens = tokenize(lines, 'char')
    vocab = Vocab(tokens)
    # 因为时光机器数据集中的每个文本行不一定是一个句子或一个段落,
    # 所以将所有文本行展平到一个列表中
    corpus = [vocab[token] for line in tokens for token in line]
    if max_tokens > 0:
        corpus = corpus[:max_tokens]
    return corpus, vocab


corpus, vocab = load_corpus_time_machine()
len(corpus), len(vocab)

Language models and datasets

The language model refers to the given text sequence x 1 , . . . , x T x_1,...,x_T x1,...,xT, whose goal is to estimate the joint probability p ( x 1 , . . . , x T ) p(x_1,...,x_T) p(x1,...,xT), that is, the probability of occurrence of this text sequence.

Assume that the sequence length is 2, then we can use the counting method, and the simple calculation is:
p ( x , x ′ ) = p ( x ) p ( x ′ ∣ x ) = n ( x ) n a l l n ( x , x ′ ) n ( x ) p(x,x^{\prime})=p(x)p(x^{\prime}\mid x)=\frac{ n(x)}{n_{all}}\frac{n(x,x^{\prime})}{n(x)} p(x,x)=p(x)p(xx)=nalln(x)n(x)n(x,x)
Then a similar counting method can be used to continue to expand the sequence length.

ButIf the sequence length is too long, it may occur if the text size is not large enough n ( x 1 , . . . , x T ) ≤ 1 n(x_1,...,x_T)\leq 1 n(x1,...,xT)1 case, then the Markov hypothesis can be used to alleviate this problem:

  • Unidimensional language: p ( x 1 , x 2 , x 3 , x 4 ) = p ( x 1 ) p ( x 2 ) p ( x 3 ) p ( x 4 ) p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2)p(x_3)p(x_4) p(x1,x2,x3,x4)=p(x1)p(x2)p(x3)p(x4)
  • Bigram: p ( x 1 , x 2 , x 3 , x 4 ) = p ( x 1 ) p ( x 2 ∣ x 1 ) p ( x 3 ∣ x 2 ) p ( x 4 ∣ x 3 ) p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2\mid x_1)p(x_3\mid x_2)p(x_4\mid x_3) p(x1,x2,x3,x4)=p(x1)p(x2x1)p(x3x2)p(x4x3)
  • Ternary syntax: p ( x 1 , x 2 , x 3 , x 4 ) = p ( x 1 ) p ( x 2 ∣ x 1 ) p ( x 3 ∣ x 1 , x 2 ) p ( x 4 ∣ x 2 , x 3 ) p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2\mid x_1)p(x_3\mid x_1,x_2)p (x_4\mid x_2,x_3) p(x1,x2,x3,x4)=p(x1)p(x2x1)p(x3x1,x2)p(x4x2,x3)

The code is:

import random
import torch
from d2l import torch as d2l
from matplotlib import pyplot as plt

tokens = d2l.tokenize(d2l.read_time_machine())
# 因为每个文本行不一定是一个句子或一个段落,因此我们把所有文本行拼接到一起
corpus = [token for line in tokens for token in line]
vocab = d2l.Vocab(corpus)  # 计算频率得到的词汇列表

freqs = [freq for token, freq in vocab.token_freqs]  # 将频率变化画出来
d2l.plot(freqs, xlabel='token: x', ylabel='frequency: n(x)', xscale='log', yscale='log')
plt.show()  # 以上这是单个单词的情况

# 我们来看看连续的两个单词和三个单词的情况,即二元语法和三元语法
bigram_tokens = [pair for pair in zip(corpus[:-1], corpus[1:])]
bigram_vocab = d2l.Vocab(bigram_tokens)
trigram_tokens = [triple for triple in zip(corpus[:-2], corpus[1:-1], corpus[2:])]
trigram_vocab = d2l.Vocab(trigram_tokens)

bigram_freqs = [freq for token, freq in bigram_vocab.token_freqs]
trigram_freqs = [freq for token, freq in trigram_vocab.token_freqs]
d2l.plot([freqs, bigram_freqs, trigram_freqs], xlabel='token: x',
         ylabel='frequency: n(x)', xscale='log', yscale='log',
         legend=['unigram', 'bigram', 'trigram'])
plt.show()  # 画出来对比


# 下面我们对一个很长的文本序列,随机在上面采样得到我们指定长度的子序列,方便我们输入到模型中
def seq_data_iter_random(corpus, batch_size, num_steps):  # @save
    """使用随机抽样生成一个小批量子序列"""
    # 从随机偏移量开始对序列进行分区,随机范围包括num_steps-1
    corpus = corpus[random.randint(0, num_steps - 1):]
    # 因为长度为num_steps是肯定的,那我们如果每次都从0开始,那么例如2-7这种就得不到
    # 因此每次都随机的初始点开始就可以保证我们能够采样得到不同的数据
    # 减去1,是因为我们需要考虑标签
    num_subseqs = (len(corpus) - 1) // num_steps
    # 长度为num_steps的子序列的起始索引
    initial_indices = list(range(0, num_subseqs * num_steps, num_steps))
    # 在随机抽样的迭代过程中,
    # 来自两个相邻的、随机的、小批量中的子序列不一定在原始序列上相邻
    random.shuffle(initial_indices)

    def data(pos):
        # 返回从pos位置开始的长度为num_steps的序列
        return corpus[pos: pos + num_steps]

    num_batches = num_subseqs // batch_size
    for i in range(0, batch_size * num_batches, batch_size):
        # 在这里,initial_indices包含子序列的随机起始索引
        initial_indices_per_batch = initial_indices[i: i + batch_size]
        X = [data(j) for j in initial_indices_per_batch]
        Y = [data(j + 1) for j in initial_indices_per_batch]
        # 这里解释一下,一开始我认为应该输入序列x之后我们要输出x之后的下一个单词,因此认为y应该为长度为1
        # 但是实际上在训练时我们并不是5个丢进去,然后生成1个出来
        # 我们是丢进去第一个,然后生成第二个,然后结合1,2的真实标签,去预测第三个,以此类推
        # 直到后面结合5个去预测第6个
        yield torch.tensor(X), torch.tensor(Y)


# 这个函数是让相邻两个小批量中的子序列在原始序列上是相邻的
def seq_data_iter_sequential(corpus, batch_size, num_steps):  # @save
    """使用顺序分区生成一个小批量子序列"""
    # 从随机偏移量开始划分序列
    offset = random.randint(0, num_steps)
    num_tokens = ((len(corpus) - offset - 1) // batch_size) * batch_size
    Xs = torch.tensor(corpus[offset: offset + num_tokens])
    Ys = torch.tensor(corpus[offset + 1: offset + 1 + num_tokens])
    Xs, Ys = Xs.reshape(batch_size, -1), Ys.reshape(batch_size, -1)
    num_batches = Xs.shape[1] // num_steps
    for i in range(0, num_steps * num_batches, num_steps):
        X = Xs[:, i: i + num_steps]
        Y = Ys[:, i: i + num_steps]
        yield X, Y


class SeqDataLoader:  # @save
    """加载序列数据的迭代器"""

    def __init__(self, batch_size, num_steps, use_random_iter, max_tokens):
        if use_random_iter:
            self.data_iter_fn = d2l.seq_data_iter_random
        else:
            self.data_iter_fn = d2l.seq_data_iter_sequential
        self.corpus, self.vocab = d2l.load_corpus_time_machine(max_tokens)
        self.batch_size, self.num_steps = batch_size, num_steps

    def __iter__(self):
        return self.data_iter_fn(self.corpus, self.batch_size, self.num_steps)

# 封装,同时返回数据迭代器和词汇表
def load_data_time_machine(batch_size, num_steps,  #@save
                           use_random_iter=False, max_tokens=10000):
    """返回时光机器数据集的迭代器和词表"""
    data_iter = SeqDataLoader(
        batch_size, num_steps, use_random_iter, max_tokens)
    return data_iter, data_iter.vocab

Insert image description here
Insert image description here

recurrent neural network

Its model can be represented by the following figure:

Insert image description here

That is, the intermediate hidden variable is used to capture and retain the historical information of the number sequence until its current time step. Its internal principle is:

  • Update hidden state: h t = ϕ ( W h h h t − 1 + W h x x t − 1 + b h ) \pmb{h}_t=\phi(\pmb{W}_ {hh}\pmb{h}_{t-1}+\pmb{W}_{hx}\pmb{x}_{t-1}+\pmb{b}_h) hht=ϕ(ININhh hht1+ININhxxxt1+bbh)
  • 输出: o t = ( W h o h t + b o ) \pmb{o}_t=(\pmb{W}_{ho}\pmb{h}_t+\pmb{b}_o) OOt=(ININhohht+bbo)

Example t 1 t_1 t1Time import x 1 = x_1= x1="you", then we hope it can be calculated h 1 h_1 h1get it out o 1 o_1 O1="OK", then the next input is x 2 x_2 x2=“Good”, I hope o 2 o_2 O2="世"and so on.

Insert image description here

Insert image description here


To measure the quality of a sentence, perplexity is used, which is internally implemented using average cross entropy: < a i=3> π = 1 n ∑ i = 1 n − log ⁡ p ( x t ∣ x t − 1 , . . . , x 1 ) \pi=\frac{1}{n}\sum_{i=1}^ n-\log p(x_t\mid x_{t-1},...,x_1)
Pi=n1i=1nlogp(xtxt1,...,x1)
Note that here refers to the currently known x 1 , . . . , x t − 1 x_{1},...,x_{t-1} x1,...,xt1In the case of (all real labels), we can predict the correct result x t x_t xtThe probability of , then if it can be predicted correctly every time, it means p=1, then log=0. The common one is to use exp ⁡ ( π ) \exp(\pi) exp(π)raider, inko< /span>. 1 representative perfect beauty, no tings the most difference situation


The next knowledge point isGradient clipping.

In order to prevent numerical instability caused by continuous superposition when calculating the gradients on T time steps during the iteration process, gradient clipping is introduced: the gradients of all layers are combined into a vector g, then if The L2 norm of this vector exceeds the set value θ \theta θAdmission to the General's Progress, Correction 为 θ \theta θ值,即:
g ← min ⁡ ( 1 , θ ∥ g ∥ ) g \pmb{g}\leftarrow \min (1,\frac{\theta}{\Vert \pmb{g} \Vert})\pmb{g} ggmin(1,ggθ)gg


RNN has many application scenarios:

Insert image description here


小结

  • The output of RNN depends on the current input and the hidden variables of the previous moment.
  • When applied to a language model, RNN predicts the next word based on the current word.
  • Perplexity is often used to measure the quality of a language model

Implementation of RNN from scratch

The completed code is as follows, and the points that need attention and explanations are in the comments.

import math
import torch
from matplotlib import pyplot as plt
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l


batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)


# 接下来引入独热编码的使用
# print(F.one_hot(torch.tensor([0, 2]), len(vocab)))
# 第一个参数0和2,代表我有两个编码,第一个编码在0的位置取1,第二个在2的位置取1,而长度就是第二个参数

# 而我们每次采样得到都是批量大小*时间步数,将每个取值(标量)转换为独热编码就是三维
# 批量大小*时间步数*独热编码,那为了方便,我们将维度转换为时间步数*批量大小再去变成独热编码
# 这样每个时刻的数值就连在一起了方便使用,如下:
X = torch.arange(10).reshape((2, 5))  # 批量为2,时间步长为5
# print(F.one_hot(X.T, 28).shape)  # 输出为5,2,28


# 初始化模型参数
def get_params(vocab_size, num_hiddens, device):
    num_inputs = num_outputs = vocab_size
    # 因为输入是一个字符,就是1个独特编码,输出是预测的下一个字符也是独热编码,因此长度都是独特编码的长度
    def normal(shape):
        return torch.randn(size=shape, device=device) * 0.01

    # 隐藏层参数
    W_xh = normal((num_inputs, num_hiddens))
    W_hh = normal((num_hiddens, num_hiddens))
    b_h = torch.zeros(num_hiddens, device=device)
    # 输出层参数
    W_hq = normal((num_hiddens, num_outputs))
    b_q = torch.zeros(num_outputs, device=device)
    # 附加梯度
    params = [W_xh, W_hh, b_h, W_hq, b_q]
    for param in params:
        param.requires_grad_(True)  # 我们后面要计算梯度
    return params


# 下面是对RNN模型的定义
# 定义初始隐藏层的状态
def init_rnn_state(batch_size, num_hiddens, device):
    # 这里用元组的原因是为了和后面LSTM统一
    return (torch.zeros((batch_size, num_hiddens), device=device), )

def rnn(inputs, state, params):
    # inputs的形状:(时间步数量,批量大小,词表大小)
    W_xh, W_hh, b_h, W_hq, b_q = params
    H, = state  # 注意state是元组,第二个参数我们暂时不要,所有不用接受,但是要有逗号,否则H是元组
    outputs = []
    # X的形状:(批量大小,词表大小),这就是我们前面转置的原因,方便对同一时间步的输入做预测
    for X in inputs:
        H = torch.tanh(torch.mm(X, W_xh) + torch.mm(H, W_hh) + b_h)  # 更新H
        Y = torch.mm(H, W_hq) + b_q  # 对Y做出预测
        outputs.append(Y)  # 这里outputs是数组,长度为时间步数量,每个元素都是批量大小*词表大小
    # 那个下面对output进行堆叠,就是将时间步维度去掉,行数为(时间步*批量大小),列为词表大小
    return torch.cat(outputs, dim=0), (H,)


# 用类来封装这些函数
class RNNModelScratch: #@save
    """从零开始实现的循环神经网络模型"""
    def __init__(self, vocab_size, num_hiddens, device,get_params, init_state, forward_fn):
        self.vocab_size, self.num_hiddens = vocab_size, num_hiddens
        self.params = get_params(vocab_size, num_hiddens, device)
        # 下面这两个其实是函数,第一个就是刚才初始化隐状态的函数,第二个就是rnn函数进行前向计算
        self.init_state, self.forward_fn = init_state, forward_fn

    def __call__(self, X, state):
        X = F.one_hot(X.T, self.vocab_size).type(torch.float32)
        return self.forward_fn(X, state, self.params)  # 进行前向计算后返回

    def begin_state(self, batch_size, device):
        return self.init_state(batch_size, self.num_hiddens, device)

# 检查输出是否具有正确的形状
num_hiddens = 512
net = RNNModelScratch(len(vocab), num_hiddens, d2l.try_gpu(), get_params,
                      init_rnn_state, rnn)
state = net.begin_state(X.shape[0], d2l.try_gpu())
Y, new_state = net(X.to(d2l.try_gpu()), state)
print(Y.shape, "\n",len(new_state),"\n", new_state[0].shape)


# 定义预测函数
def predict_ch8(prefix, num_preds, net, vocab, device):  #@save
    """在prefix后面生成新字符"""
    # prefix是用户提供的一个包含多个字符的字符串
    state = net.begin_state(batch_size=1, device=device)
    outputs = [vocab[prefix[0]]]  # 将其第一个字符转换为数字放入其中
    get_input = lambda: torch.tensor([outputs[-1]], device=device).reshape((1, 1))
    # 上是一个匿名函数,可以在每次outputs更新后都调用outputs的最后一个元素
    for y in prefix[1:]:  # 预热期,此时不做预测,我们用这些字符不断来更新state
        _, state = net(get_input(), state)
        outputs.append(vocab[y])  # 将下一个待作为输入的转换为数字进入
    for _ in range(num_preds):  # 预测num_preds步
        y, state = net(get_input(), state)  # 预测并更新state
        outputs.append(int(y.argmax(dim=1).reshape(1)))  # 这就是将预测的放入,并作为下一次的输入
    return ''.join([vocab.idx_to_token[i] for i in outputs])  # 拼接成字符

print(predict_ch8('time traveller ', 10, net, vocab, d2l.try_gpu()))  # 看看效果


# 梯度裁剪
def grad_clipping(net, theta):  #@save
    """裁剪梯度"""
    if isinstance(net, nn.Module):
        params = [p for p in net.parameters() if p.requires_grad]
        # 取出那些需要更新的梯度
    else:
        params = net.params
    norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params))
    if norm > theta:
        for param in params:
            param.grad[:] *= theta / norm  # 对梯度进行修剪


# 训练模型
#@save
def train_epoch_ch8(net, train_iter, loss, updater, device, use_random_iter):
    """训练网络一个迭代周期(定义见第8章)"""
    state, timer = None, d2l.Timer()
    metric = d2l.Accumulator(2)  # 训练损失之和,词元数量
    for X, Y in train_iter:
        if state is None or use_random_iter:
            # 如果用的是打乱的,那么后一个小批量和前一个小批量的样本之间并不是连接在一起的
            # 那么它们的隐变量不存在关系,所以必须初始化
            # 在第一次迭代或使用随机抽样时初始化state
            state = net.begin_state(batch_size=X.shape[0], device=device)
        else:  # 否则的话,我们就可以沿用上次计算完的隐变量,只不过detach是断掉链式求导,我们现在隐变量是数值了,跟之前的没有关系了
            if isinstance(net, nn.Module) and not isinstance(state, tuple):
                # state对于nn.GRU是个张量,这部分可以认为我们将state变换为常数
                # 那么梯度更新时就不会再和前面批次的梯度进行相乘,这里就直接断掉梯度的链式法则了
                state.detach_()
            else:
                # state对于nn.LSTM或对于我们从零开始实现的模型是个张量,这部分在后面有用
                for s in state:
                    s.detach_()
        y = Y.T.reshape(-1)
        X, y = X.to(device), y.to(device)
        y_hat, state = net(X, state)
        l = loss(y_hat, y.long()).mean()
        if isinstance(updater, torch.optim.Optimizer):
            updater.zero_grad()  # 清空梯度
            l.backward()
            grad_clipping(net, 1)
            updater.step()  # 更新参数
        else:
            l.backward()
            grad_clipping(net, 1)
            # 因为已经调用了mean函数
            updater(batch_size=1)
        metric.add(l * y.numel(), y.numel())
    return math.exp(metric[0] / metric[1]), metric[1] / timer.stop()


#@save
def train_ch8(net, train_iter, vocab, lr, num_epochs, device,use_random_iter=False):
    """训练模型(定义见第8章)"""
    loss = nn.CrossEntropyLoss()
    animator = d2l.Animator(xlabel='epoch', ylabel='perplexity',
                            legend=['train'], xlim=[10, num_epochs])
    # 初始化
    if isinstance(net, nn.Module):
        updater = torch.optim.SGD(net.parameters(), lr)
    else:
        updater = lambda batch_size: d2l.sgd(net.params, lr, batch_size)
    predict = lambda prefix: predict_ch8(prefix, 50, net, vocab, device)  # 预测函数
    # 训练和预测
    for epoch in range(num_epochs):
        ppl, speed = train_epoch_ch8(net, train_iter, loss, updater, device, use_random_iter)
        if (epoch + 1) % 10 == 0:
            print(predict('time traveller'))
            animator.add(epoch + 1, [ppl])
    print(f'困惑度 {
      
      ppl:.1f}, {
      
      speed:.1f} 词元/秒 {
      
      str(device)}')
    print(predict('time traveller'))
    print(predict('traveller'))


# 顺序采样
num_epochs, lr = 500, 1
train_ch8(net, train_iter, vocab, lr, num_epochs, d2l.try_gpu())
plt.show()
# 随机采用
net = RNNModelScratch(len(vocab), num_hiddens, d2l.try_gpu(), get_params,init_rnn_state, rnn)
train_ch8(net, train_iter, vocab, lr, num_epochs, d2l.try_gpu(),use_random_iter=True)
plt.show()

Insert image description here

困惑度 1.0, 50189.8 词元/秒 cuda:0
time traveller for so it will be convenient to speak of himwas e
traveller with a slight accession ofcheerfulness really thi

Insert image description here

困惑度 1.5, 47149.4 词元/秒 cuda:0
time traveller proceeded anyreal body must have extension in fou
traveller held in his hand was a glitteringmetallic furmime

小结

  • We can train a character-level language model based on a recurrent neural network to generate subsequent text based on the prefix of the user-supplied text
  • A simple recurrent neural network language model includes input encoding, recurrent neural network model and output generation
  • The recurrent neural network model needs to be initialized before training, but random sampling and sequential division use different initialization methods.
  • When using sequential partitioning, we need to detach the gradient to reduce the amount of computation (detach)
  • Before making any predictions, the model updates itself through a warm-up period (obtaining a better hidden state than the initial value, training only modifies the parameters, not the state)
  • Gradient clipping can prevent gradient explosion, but it cannot cope with gradient disappearance.

Simple implementation of RNN

import torch
from matplotlib import pyplot as plt
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)


# 定义模型
num_hiddens = 256
rnn_layer = nn.RNN(len(vocab), num_hiddens)  # 直接调用模型
# 初始化隐状态
state = torch.zeros((1, batch_size, num_hiddens))
X = torch.rand(size=(num_steps, batch_size, len(vocab)))
Y, state_new = rnn_layer(X, state)
# 这里要注意的是rnn_layer的输出Y并不是我们想要的预测变量!而是隐状态!里面只进行了隐状态的计算而已


# 完成的RNN模型
#@save
class RNNModel(nn.Module):
    """循环神经网络模型"""
    def __init__(self, rnn_layer, vocab_size, **kwargs):
        super(RNNModel, self).__init__(**kwargs)
        self.rnn = rnn_layer  # 计算隐状态
        self.vocab_size = vocab_size
        self.num_hiddens = self.rnn.hidden_size
        # 如果RNN是双向的(之后将介绍),num_directions应该是2,否则应该是1
        if not self.rnn.bidirectional:
            self.num_directions = 1
            self.linear = nn.Linear(self.num_hiddens, self.vocab_size)  # 输出层计算Y
        else:
            self.num_directions = 2
            self.linear = nn.Linear(self.num_hiddens * 2, self.vocab_size)

    def forward(self, inputs, state):
        X = F.one_hot(inputs.T.long(), self.vocab_size)
        X = X.to(torch.float32)
        Y, state = self.rnn(X, state)
        # 全连接层首先将Y的形状改为(时间步数*批量大小,隐藏单元数)
        # 它的输出形状是(时间步数*批量大小,词表大小)。
        output = self.linear(Y.reshape((-1, Y.shape[-1])))
        return output, state

    def begin_state(self, device, batch_size=1):
        if not isinstance(self.rnn, nn.LSTM):
            # nn.GRU以张量作为隐状态
            return  torch.zeros((self.num_directions * self.rnn.num_layers,
                                 batch_size, self.num_hiddens),device=device)
        else:
            # nn.LSTM以元组作为隐状态
            return (torch.zeros((
                self.num_directions * self.rnn.num_layers,
                batch_size, self.num_hiddens), device=device),
                    torch.zeros((self.num_directions * self.rnn.num_layers,
                        batch_size, self.num_hiddens), device=device))

# 训练与预测
device = d2l.try_gpu()
net = RNNModel(rnn_layer, vocab_size=len(vocab))
net = net.to(device)
num_epochs, lr = 500, 1
d2l.train_ch8(net, train_iter, vocab, lr, num_epochs, device)
plt.show()

Insert image description here

perplexity 1.3, 390784.5 tokens/sec on cuda:0
time travellerit s against reatou dimensions of space generally 
traveller pus he iryed it apredinnen it a mamul redoun abs 

小结

  • The high-level API of the deep learning framework provides the implementation of the RNN layer
  • The RNN layer of the high-level API returns an output and an updated hidden state. We also need another linear layer to calculate the output of the entire model.
  • Compared to implementing RNN from scratch, implementing it using high-level APIs can speed up training

Back propagation through time

In RNN, the calculation of forward propagation is relatively simple, but itsback propagation through time actually requires us to expand the RNN one time step at a time to obtain The dependency relationship between model variables and parameters is then used to calculate and store gradients using backpropagation based on the chain rule. This leads to the possibility that the dependency relationship will be quite long when the time length T is large.

Suppose RNN can be expressed as:
h t = f ( x t , h t − 1 , w h ) o t = g ( h t , w o ) The loss function is: L ( x 1 , . . , x T , y 1 , . . . , y T , w h , w o ) = 1 T ∑ t = 1 T l ( y t , o t ) h_t=f(x_t, h_{t-1},w_h)\\ o_t=g(h_t,w_o)\\ The loss function is: L(x_1,..,x_T,y_1,...,y_T,w_h,w_o)=\frac{1}{T}\sum_{t=1 }^Tl(y_t,o_t) ht=f(xt,ht1,Inh)Ot=g(ht,Ino)Functions:L(x1,..,xT,and1,...,andT,Inh,Ino)=T1t=1Tl(yt,Ot)
那么在计算梯度时:
∂ L ∂ w h = 1 T ∑ t = 1 T ∂ l ( y t , o t ) ∂ w h = 1 T ∑ t = 1 T ∂ l ( y t , o t ) ∂ o t ∂ g ( h t , w o ) ∂ h t ∂ h t ∂ w h \frac{\partial L}{\partial w_h}=\frac{1}{T}\sum_{t=1}^T \frac{\partial l(y_t,o_t)}{\partial w_h}\\ =\frac{1}{T}\sum_{t=1}^T\frac{\partial l(y_t,o_t)}{\partial o_t}\frac{\partial g(h_t,w_o)}{\partial h_t}\frac{\partial h_t}{\partial w_h} whL=T1t=1Twhl(yt,Ot)=T1t=1Totl(yt,Ot)htg(ht,Ino)whht
The most troublesome of the above calculations is the third one, because h t h_t ht不仅用赖于 w h w_h Inh,还depend赖于 h t − 1 h_{t-1} ht1,而 h t − 1 h_{t-1} ht1也い赖于 w h w_h Inh,这样就会不停计算下去,即:
∂ h t ∂ w h = ∂ f ( x t , h t − 1 , w h ) ∂ w h + ∑ i = 1 t − 1 ( ∏ j = i + 1 t ∂ f ( x j , h j − 1 , w h ) ∂ h j − 1 ) ∂ f ( x i , h i − 1 , w h ) ∂ w h \frac{\partial h_t}{\partial w_h}=\frac{\partial f(x_t,h_{t-1},w_h)}{\partial w_h}+\sum_{i=1}^{t-1}(\prod_{j=i+1}^t \frac{\partial f(x_j,h_{j-1},w_h)}{\partial h_{j-1}})\frac{\partial f(x_i,h_{i-1},w_h)}{\partial w_h} whht=whf(xt,ht1,Inh)+i=1t1(j=i+1thj1f(xj,hj1,Inh))whf(xi,hi1,Inh)
Then if the chain calculation completed above is used, when t is large, the chain will become very long and difficult to calculate. Specifically, there are the following methods:


fully calculated

Obviously the simplest idea is to calculate directly, but this is very slow, and gradient explosion is likely to occur,because small changes in the initial conditions may result in continuous multiplication. It has a huge impact on the results, which is similar to the butterfly effect, which is undesirable.


Truncate time step

possible τ \tau The above summation operation is truncated after τ step, that is, the chain rule terminates at ∂ h t − τ ∂ w h \frac{\partial h_{t-\tau}}{\partial w_h} whhtτ, this is often called truncated backpropagation through time. Doing soresults in a model that focuses primarily on short-term effects rather than long-term effects, which biases estimates towards simpler and more stable models.


random truncation

Introduce a random variable to replace ∂ h t ∂ w h \frac{\partial h_t}{\partial w_h} whht,即这些 P ( ξ t = 0 ) = 1 − π t , P ( ξ t = π t − 1 ) = π t P(\xi_t=0) =1-\pi_t,P(\xi_t=\pi_t^{-1})=\pi_t P(ξt=0)=1Pit,P(ξt=Pit1)=Pit,nanao E [ ξ t ] = 1 E[\xi_t]=1 E[ξt]=1,令:
z t = ∂ f ( x t , h t − 1 , w h ) ∂ w h + ξ t ∂ f ( x t , h t − 1 , w h ) ∂ h t − 1 ∂ h t − 1 ∂ w h z_t=\frac{\partial f(x_t,h_{t-1},w_h)}{\partial w_h}+\xi_t \frac{\partial f(x_t,h_{t-1},w_h)}{\partial h_{t-1}}\frac{\partial h_{t-1}}{\partial w_h} Witht=whf(xt,ht1,Inh)+Xtht1f(xt,ht1,Inh)whht1
Then it can be derived E [ z t ] = ∂ h t ∂ w h E[z_t]=\frac{\partial h_t}{\partial w_h} E[zt]=whht, which results in truncation of different lengths,

Gated Recirculating Unit GRU

This mechanism introduces reset gates and update gates to better control the transmission of timing information, as follows:

Insert image description here

You can see among them R t , Z t R_t, Z_t RtZt are called reset gate and update gate respectively, then when Z t = 1 Z_t=1 WITHt=1时, H t = H t − 1 H_t=H_{t-1} Ht=Ht1, it is equivalent to the information being passed directly without updating at all; and when Z t = 0 , R t = 0 Z_t=0,R_t=0 WITHt=0,Rt=When 0, it is equivalent to not paying attention at this time H t − 1 H_{t-1} Ht1 information, truncation of timing transmission is equivalent to initialization.

Note here R t and H t − 1 R_t and H_{t-1} RtHt1The calculation is element-wise multiplication.

import torch
from matplotlib import pyplot as plt
from torch import nn
from d2l import torch as d2l

batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)


# 初始化模型参数,这部分和RNN不同
def get_params(vocab_size, num_hiddens, device):
    num_inputs = num_outputs = vocab_size  # 输入输出都是这个长度的向量

    def normal(shape):
        return torch.randn(size=shape, device=device) * 0.01

    def three():  # 用这个函数可以减少重复写
        return (normal((num_inputs, num_hiddens)),
                normal((num_hiddens, num_hiddens)),
                torch.zeros(num_hiddens, device=device))

    W_xz, W_hz, b_z = three()  # 更新门参数
    W_xr, W_hr, b_r = three()  # 重置门参数
    W_xh, W_hh, b_h = three()  # 候选隐状态参数
    # 输出层参数
    W_hq = normal((num_hiddens, num_outputs))
    b_q = torch.zeros(num_outputs, device=device)
    # 附加梯度
    params = [W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q]
    for param in params:
        param.requires_grad_(True)
    return params


# 初始化隐状态
def init_gru_state(batch_size, num_hiddens, device):
    return (torch.zeros((batch_size, num_hiddens), device=device),)


# 定义模型
def gru(inputs, state, params):
    W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q = params  # 获取参数
    H, = state  # 隐状态
    outputs = []
    for X in inputs:
        Z = torch.sigmoid((X @ W_xz) + (H @ W_hz) + b_z)  # 计算更新门,@是矩阵乘法
        R = torch.sigmoid((X @ W_xr) + (H @ W_hr) + b_r)
        H_tilda = torch.tanh((X @ W_xh) + ((R * H) @ W_hh) + b_h)  # 注意这里R*H是按元素
        H = Z * H + (1 - Z) * H_tilda  # 这里也是按元素
        Y = H @ W_hq + b_q  # 输出
        outputs.append(Y)
    return torch.cat(outputs, dim=0), (H,)  # 同样是叠在一起


vocab_size, num_hiddens, device = len(vocab), 256, d2l.try_gpu()
num_epochs, lr = 500, 1
model = d2l.RNNModelScratch(len(vocab), num_hiddens, device, get_params,
                            init_gru_state, gru)
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)
plt.show()

Insert image description here

perplexity 1.1, 16015.2 tokens/sec on cuda:0
time traveller for so it will be convenient to speak of himwas e
traveller for so it will be convenient to speak of himwas e

Then the concise implementation of GRU is also very simple:

num_inputs = vocab_size
gru_layer = nn.GRU(num_inputs, num_hiddens)
model = d2l.RNNModel(gru_layer, len(vocab))  # 封装成model的同时会加上线型层
model = model.to(device)
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)
plt.show()

Insert image description here

perplexity 1.0, 256679.5 tokens/sec on cuda:0
time traveller for so it will be convenient to speak of himwas e
travelleryou can show black is white by argument said filby

You can see that calling high-level APIs is much faster than implementing it from scratch.


小结

  • Gated RNNs better capture dependencies on sequences with long time steps apart
  • Reset gate helps capture short-term dependencies in a series
  • Update gates help capture long-term dependencies in sequences
  • When the reset gate is opened, the gated recurrent unit contains the basic recurrent neural network; when the update gate is opened, the gated recurrent unit can skip subsequences

Long short-term memory network (LSTM)

This part of the teacher’s lecture is relatively simple and focuses more on implementation. For a more comprehensive introduction to LSTM, you can watch the relevant chapters in Teacher Li Hongyi’s course, or read my blog [here]( [Machine Learning] Li Hongyi - Recurrent Neural Network (Recurrent Neural Network)_FavoriteStar's Blog-CSDN Blog).

The structure of LSTM is as follows:

Insert image description here

Its main feature isIt introduces three gates and another state C t C_t Ct to better store and control information, the three gates are:

  • Input gate: determines whether to ignore input data
  • Forget gate: Decrease value towards zero
  • Output gate: determines whether to use hidden state

I t = σ ( X t W x i + H t − 1 W h i + b i ) F t = σ ( X t W x f + H t − 1 W h f + b f ) O t = σ ( X t W x o + H t − 1 W h o + b o ) C ~ t = tanh ⁡ ( X t W x c + H t − 1 W h c + b c ) C t = F t ⊙ C t − 1 + I t ⊙ C ~ t H t = O t ⊙ tanh ⁡ ( C t ) I_t=\sigma(X_tW_{xi}+H_{t-1}W_{hi}+b_i)\\ F_t=\sigma(X_tW_{xf}+H_{t-1}W_{hf}+b_f)\\ O_t=\sigma(X_tW_{xo}+H_{t-1}W_{ho}+b_o)\\ \tilde{C}_t=\tanh(X_tW_{xc}+H_{t-1}W_{hc}+b_c)\\ C_t=F_t\odot C_{t-1}+I_t\odot \tilde{C}_t\\ H_t=O_t\odot \tanh(C_t) It=σ(XtINxi+Ht1INhi+bi)Ft=σ(XtINxf+Ht1INhf+bf)Ot=σ(XtINxo+Ht1INho+bo)C~t=tanh(XtINxc+Ht1INhc+bc)Ct=FtCt1+ItC~tHt=Ottanh(Ct)

For a detailed explanation, you can read the blog I mentioned above, which explains it in more detail.

import torch
from matplotlib import pyplot as plt
from torch import nn
from d2l import torch as d2l

batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)

# 初始化模型参数
def get_lstm_params(vocab_size, num_hiddens, device):
    num_inputs = num_outputs = vocab_size

    def normal(shape):
        return torch.randn(size=shape, device=device)*0.01

    def three():
        return (normal((num_inputs, num_hiddens)),
                normal((num_hiddens, num_hiddens)),
                torch.zeros(num_hiddens, device=device))

    W_xi, W_hi, b_i = three()  # 输入门参数
    W_xf, W_hf, b_f = three()  # 遗忘门参数
    W_xo, W_ho, b_o = three()  # 输出门参数
    W_xc, W_hc, b_c = three()  # 候选记忆元参数
    # 输出层参数
    W_hq = normal((num_hiddens, num_outputs))
    b_q = torch.zeros(num_outputs, device=device)
    # 附加梯度
    params = [W_xi, W_hi, b_i, W_xf, W_hf, b_f, W_xo, W_ho, b_o, W_xc, W_hc,
              b_c, W_hq, b_q]
    for param in params:
        param.requires_grad_(True)
    return params

# 初始化隐状态,这部分就是两个了
def init_lstm_state(batch_size, num_hiddens, device):
    return (torch.zeros((batch_size, num_hiddens), device=device),
            torch.zeros((batch_size, num_hiddens), device=device))

# 定义模型
def lstm(inputs, state, params):
    [W_xi, W_hi, b_i, W_xf, W_hf, b_f, W_xo, W_ho, b_o, W_xc, W_hc, b_c,
     W_hq, b_q] = params  # 获取参数
    (H, C) = state  # 获取隐状态
    outputs = []
    for X in inputs:
        I = torch.sigmoid((X @ W_xi) + (H @ W_hi) + b_i)
        F = torch.sigmoid((X @ W_xf) + (H @ W_hf) + b_f)
        O = torch.sigmoid((X @ W_xo) + (H @ W_ho) + b_o)
        C_tilda = torch.tanh((X @ W_xc) + (H @ W_hc) + b_c)
        C = F * C + I * C_tilda
        H = O * torch.tanh(C)
        Y = (H @ W_hq) + b_q  # 计算输出
        outputs.append(Y)
    return torch.cat(outputs, dim=0), (H, C)

vocab_size, num_hiddens, device = len(vocab), 512, d2l.try_gpu()
num_epochs, lr = 500, 1
model = d2l.RNNModelScratch(len(vocab), num_hiddens, device, get_lstm_params,
                            init_lstm_state, lstm)
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)
plt.show()

Insert image description here

perplexity 1.1, 13369.1 tokens/sec on cuda:0
time traveller well the wild the urais diff me time srivelly are
travelleryou can show black is white by argument said filby

Here is a concise implementation:

num_inputs = vocab_size
lstm_layer = nn.LSTM(num_inputs, num_hiddens)
model = d2l.RNNModel(lstm_layer, len(vocab))  # 同样会补上输出的线型层实现
model = model.to(device)
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)
plt.show()

Insert image description here

perplexity 1.0, 147043.6 tokens/sec on cuda:0
time travelleryou can show black is white by argument said filby
travelleryou can show black is white by argument said filby

The speed of calling high-level APIs is ten times faster than implementing it from scratch.


小结

  • LSTM has three types of gates: input gates, forget gates and output gates
  • The output of the hidden layer of LSTM includes hidden states and memory elements. Only the hidden state will be passed to the output layer, while the memory elements are completely internal information.
  • LSTM can alleviate the problems of gradient disappearance and gradient explosion, so tanh is used many times to map the output to [-1,1]. For details, see the end of my blog.

deep recurrent neural network

In order to obtain more nonlinearity and stronger representation capabilities, we can extend the recurrent neural network in depth:

Insert image description here

This part is still very simple and easy to understand, and can also be used for GRU and LSTM.

import torch
from matplotlib import pyplot as plt
from torch import nn
from d2l import torch as d2l

batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)

vocab_size, num_hiddens, num_layers = len(vocab), 256, 2  # 指定隐藏层的层数
num_inputs = vocab_size
device = d2l.try_gpu()
lstm_layer = nn.LSTM(num_inputs, num_hiddens, num_layers)  # 第三个参数指定隐藏层数目
model = d2l.RNNModel(lstm_layer, len(vocab))
model = model.to(device)

num_epochs, lr = 500, 2
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)
plt.show()

Insert image description here

perplexity 1.0, 128068.2 tokens/sec on cuda:0
time travelleryou can show black is white by argument said filby
travelleryou can show black is white by argument said filby

小结

  • In deep recurrent neural networks, the information about the hidden state is passed to the next time step of the current layer and the current time step of the next layer.
  • There are many different styles of deep recurrent neural networks, such as LSTM, GRU, RNN, etc. These models can all be implemented using the high-level API of the deep learning framework.
  • In general, deep recurrent neural networks require a lot of parameter tuning (such as learning rate and pruning) to ensure proper convergence, and model initialization also needs to be careful.

bidirectional recurrent neural network

Previous models observed historical data to predict future data, but if it is in some tasks such as filling in the blanks, future information is also crucial to the blank:

Insert image description here

So the bidirectional recurrent neural network is that can observe future information. It has a forward RNN hidden layer and a reverse RNN hidden layer, and then the input of the output layer is these two The merging of hidden states of each layer is as follows:

Insert image description here

Although this is no problem during training,This model cannot be used for prediction tasks, so it cannot know future information, which will cause very bad results. The result. Its main useis to extract features from sequences, because it can observe future information, so feature extraction will be more comprehensive.

import torch
from matplotlib import pyplot as plt
from torch import nn
from d2l import torch as d2l

# 加载数据
batch_size, num_steps, device = 32, 35, d2l.try_gpu()
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
# 通过设置“bidirective=True”来定义双向LSTM模型
vocab_size, num_hiddens, num_layers = len(vocab), 256, 2
num_inputs = vocab_size
lstm_layer = nn.LSTM(num_inputs, num_hiddens, num_layers, bidirectional=True)
model = d2l.RNNModel(lstm_layer, len(vocab))  # 里面已经设置了当为双向时线性层会不同
model = model.to(device)
# 训练模型
num_epochs, lr = 500, 1
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)
plt.show()

Insert image description here

perplexity 1.1, 76187.7 tokens/sec on cuda:0
time travellerererererererererererererererererererererererererer
travellerererererererererererererererererererererererererer

It can be seen that the prediction effect is extremely poor.


小结

  • In a bidirectional recurrent neural network, the hidden state of each time step is determined simultaneously by the data before and after the current time step.
  • Bidirectional recurrent neural networks are similar to the "forward-backward" algorithm in probabilistic graphical models.
  • Bidirectional recurrent neural networks are mainly used for sequence encoding and observation estimation given bidirectional context.
  • Bidirectional recurrent neural networks are very expensive to train due to longer gradient chains.

Machine Translation and Datasets

import os
import torch
from d2l import torch as d2l

#@save
from matplotlib import pyplot as plt

d2l.DATA_HUB['fra-eng'] = (d2l.DATA_URL + 'fra-eng.zip',
                           '94646ad1522d915e7b0f9296181140edcf86a4f5')

#@save
def read_data_nmt():
    """载入“英语-法语”数据集"""
    data_dir = d2l.download_extract('fra-eng')
    with open(os.path.join(data_dir, 'fra.txt'), 'r',encoding='utf-8') as f:
        return f.read()

raw_text = read_data_nmt()
print(raw_text[:75])

#@save
def preprocess_nmt(text):
    """预处理“英语-法语”数据集"""
    def no_space(char, prev_char):
        return char in set(',.!?') and prev_char != ' '

    # 使用空格替换不间断空格
    # 使用小写字母替换大写字母
    text = text.replace('\u202f', ' ').replace('\xa0', ' ').lower()  # 将utf-8中半角全角空格都换成空格
    # 在单词和标点符号之间插入空格
    out = [' ' + char if i > 0 and no_space(char, text[i - 1]) else char
           for i, char in enumerate(text)]
    return ''.join(out)

text = preprocess_nmt(raw_text)
print(text[:80])


#@save
def tokenize_nmt(text, num_examples=None):
    """词元化“英语-法语”数据数据集"""
    source, target = [], []
    for i, line in enumerate(text.split('\n')):
        if num_examples and i > num_examples:
            break
        parts = line.split('\t')  # 按照制表符将英文和法文分开
        if len(parts) == 2:  # 说明前面是英文,后面是法文
            source.append(parts[0].split(' '))  # 按照我们前面插入的空格来划分
            target.append(parts[1].split(' '))
    return source, target

source, target = tokenize_nmt(text)
print(source[:6], target[:6])

def show_list_len_pair_hist(legend, xlabel, ylabel, xlist, ylist):
    """绘制列表长度对的直方图"""
    d2l.set_figsize()
    _, _, patches = d2l.plt.hist(
        [[len(l) for l in xlist], [len(l) for l in ylist]])
    d2l.plt.xlabel(xlabel)
    d2l.plt.ylabel(ylabel)
    for patch in patches[1].patches:
        patch.set_hatch('/')
    d2l.plt.legend(legend)

show_list_len_pair_hist(['source', 'target'], '# tokens per sequence',
                        'count', source, target)
plt.show()

src_vocab = d2l.Vocab(source, min_freq=2,reserved_tokens=['<pad>', '<bos>', '<eos>'])
# 转换成词表,然后加入一些特殊的词,分别是填充、开始、结尾
print(len(src_vocab))

#@save
def truncate_pad(line, num_steps, padding_token):  # 这是为了保证我们的输入都是等长的
    """截断或填充文本序列"""
    if len(line) > num_steps:  # 如果这个句子的长度大于设定长度,我们就截断
        return line[:num_steps]  # 截断
    return line + [padding_token] * (num_steps - len(line))  # 如果小于就进行填充

print(truncate_pad(src_vocab[source[0]], 10, src_vocab['<pad>']))

#@save
def build_array_nmt(lines, vocab, num_steps):
    """将机器翻译的文本序列转换成小批量"""
    lines = [vocab[l] for l in lines]  # 将文本转换为向量
    lines = [l + [vocab['<eos>']] for l in lines]  # 每一个都要加上结尾符
    array = torch.tensor([truncate_pad(l, num_steps, vocab['<pad>']) for l in lines])
    # 进行填充或截断
    valid_len = (array != vocab['<pad>']).type(torch.int32).sum(1)
    # 这是把每个句子除填充外的有效长度都标注出来,之后计算会用到
    return array, valid_len

#@save
def load_data_nmt(batch_size, num_steps, num_examples=600):
    """返回翻译数据集的迭代器和词表"""
    text = preprocess_nmt(read_data_nmt())  # 预处理
    source, target = tokenize_nmt(text, num_examples)  # 生成英文和法文两部分
    # 转成词典
    src_vocab = d2l.Vocab(source, min_freq=2,reserved_tokens=['<pad>', '<bos>', '<eos>'])
    tgt_vocab = d2l.Vocab(target, min_freq=2,reserved_tokens=['<pad>', '<bos>', '<eos>'])
    src_array, src_valid_len = build_array_nmt(source, src_vocab, num_steps)
    tgt_array, tgt_valid_len = build_array_nmt(target, tgt_vocab, num_steps)
    data_arrays = (src_array, src_valid_len, tgt_array, tgt_valid_len)
    data_iter = d2l.load_array(data_arrays, batch_size)  # 一次迭代含有4个变量
    return data_iter, src_vocab, tgt_vocab

train_iter, src_vocab, tgt_vocab = load_data_nmt(batch_size=2, num_steps=8)
for X, X_valid_len, Y, Y_valid_len in train_iter:
    print('X:', X.type(torch.int32))
    print('X的有效长度:', X_valid_len)
    print('Y:', Y.type(torch.int32))
    print('Y的有效长度:', Y_valid_len)
    break

Insert image description here

Go.	Va !
Hi.	Salut !
Run!	Cours !
Run!	Courez !
Who?	Qui ?
Wow!	Ça alors !

go .	va !
hi .	salut !
run !	cours !
run !	courez !
who ?	qui ?
wow !	ça alors !
[['go', '.'], ['hi', '.'], ['run', '!'], ['run', '!'], ['who', '?'], ['wow', '!']] [['va', '!'], ['salut', '!'], ['cours', '!'], ['courez', '!'], ['qui', '?'], ['ça', 'alors', '!']]
10012
[47, 4, 1, 1, 1, 1, 1, 1, 1, 1]
X: tensor([[  7,   0,   4,   3,   1,   1,   1,   1],
        [118,  55,   4,   3,   1,   1,   1,   1]], dtype=torch.int32)
X的有效长度: tensor([4, 4])
Y: tensor([[6, 7, 0, 4, 3, 1, 1, 1],
        [0, 4, 3, 1, 1, 1, 1, 1]], dtype=torch.int32)
Y的有效长度: tensor([5, 3])

小结

  • Machine translation refers to the automatic translation of text sequences from one language to another.
  • The vocabulary size when using word-level lemmatization will be significantly larger than when using character-level lemmatization. To alleviate this problem, we can treat low-frequency tokens as the same unknown tokens.
  • By truncating and padding text sequences, you can ensure that all text sequences are the same length so that they can be loaded in small batches.

Encoder-decoder architecture

This is a very important model. Because machine translation is a core issue in sequence transformation models, its inputs and outputs are sequences of variable length. So in order to deal with this type of structure, we use the encoder-decoder architecture.

First there is the encoder, which accepts a variable-length sequence as input and converts it into an encoding state with a fixed shape; then there is the decoder, which maps the fixed-shape encoding state to a variable-length sequence, as follows:

Insert image description here

For an introduction to the AE autoencoder, you can read my blog. It is explained in more detail and will help you understand this structure.

from torch import nn


#@save
class Encoder(nn.Module):
    """编码器-解码器架构的基本编码器接口"""
    def __init__(self, **kwargs):
        super(Encoder, self).__init__(**kwargs)

    def forward(self, X, *args):
        raise NotImplementedError

#@save
class Decoder(nn.Module):
    """编码器-解码器架构的基本解码器接口"""
    def __init__(self, **kwargs):
        super(Decoder, self).__init__(**kwargs)

    def init_state(self, enc_outputs, *args):  # 这部分就是编码后的状态
        raise NotImplementedError

    def forward(self, X, state):
        raise NotImplementedError

#@save
class EncoderDecoder(nn.Module):
    """编码器-解码器架构的基类"""
    def __init__(self, encoder, decoder, **kwargs):
        super(EncoderDecoder, self).__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, enc_X, dec_X, *args):
        enc_outputs = self.encoder(enc_X, *args)  # 计算编码后的值
        dec_state = self.decoder.init_state(enc_outputs, *args)
        return self.decoder(dec_X, dec_state)  # 解码

小结

  • The encoder-decoder architecture can take variable-length sequences as input and output, so it is suitable for sequence transformation problems such as machine translation.
  • The encoder takes a sequence of variable length as input and converts it into an encoded state with a fixed shape.
  • The decoder maps encoding states with fixed shapes into sequences of variable length.

Sequence-to-sequence learning (Seq2Seq)

This kind of task is to give a sequence and we want to transform it into another sequence. The most typical application is machine translation, which gives a source language sentence and translates it into the target language. This then requires that the length of a given sentence is variable, and that the translated sentences can be of different lengths.

Then this task was initially done using the encoder-decoder architecture:
Insert image description here

Both the encoder and decoder use RNN models.

In the encoder, the RNN uses a variable-length sequence as input and converts it into a fixed-shape hidden state. At this time, the information of the input sequence is encoded into the hidden state; then the last sequence of the encoder is The hidden state serves as the initial hidden state of the decoder, and the decoder's RNN starts to predict based on the initial hidden state and its own input.

Then this architecture is different during training and prediction. During training the input of the decoder is always the correct prediction result, while during prediction the decoder The input is the last result predicted by itself, which may not be correct.

Insert image description here

And because now we are not just predicting letters, we are predicting entire sentences,There is a need for a new metric to quantify how well a sentence is predicted. The commonly used one is BLEU, its details are as follows:

p n p_n pn is the accuracy of all n-grams in the prediction, such as the real sequence ABCDEF and the predicted sequence ABBCD, then p 1 p_1 p1 is to predict whether a single unit (A, B, B, C, D) in the sequence appears in the real sequence. You can see that a total of 4 units appear (B only appears once), so p 1 = 4 5 p_1=\frac{4}{5} p1=54,Same logic p 2 = 3 4 p_2=\frac{3}{4} p2=43 p 3 = 1 3 p_3=\frac{1}{3} p3=31 p 4 = 0 p_4=0 p4=0

And BLEU is defined as follows:
exp ⁡ ( min ⁡ ( 0 , 1 − l e n l a b e l l e n p r e d ) ) ∏ n = 1 k p n 1 2 n \exp \bigg( \min \Bigg (0,~~1-\frac{len_{label}}{len_{pred}} \Big) \bigg)\prod_{n=1}^k p_n^{\frac{1}{2^n}} exp(min(0,  1lenpredlenlabel))n=1kpn2n1
where the exponential term is to penalize predictions that are too short, so if I only predict a single unit, then as long as it appears all my p n p_n pn(也还是 p 1 p_1 p1) is 1, but this is not possible. Because p is less than 1 for the second term, the longer one matches its exponent ( 1 2 n \frac{1}{2^n} 2n1) will be smaller, so it can be considered that has a greater weight.


import collections
import math
import torch
from matplotlib import pyplot as plt
from torch import nn
from d2l import torch as d2l


# @save
class Seq2SeqEncoder(d2l.Encoder):
    """用于序列到序列学习的循环神经网络编码器"""

    def __init__(self, vocab_size, embed_size, num_hiddens,
                 num_layers, dropout=0, **kwargs):
        super(Seq2SeqEncoder, self).__init__(**kwargs)
        # 嵌入层
        self.embedding = nn.Embedding(vocab_size, embed_size)
        # 词嵌入,将文字自动转换成词向量
        self.rnn = nn.GRU(embed_size, num_hiddens, num_layers, dropout=dropout)
        # 因为已经转换为词向量,因此输入为词向量的长度

    def forward(self, X, *args):
        # 输出'X'的形状:(batch_size,num_steps,embed_size)
        X = self.embedding(X)  # 先转换为词向量
        # 在循环神经网络模型中,第一个轴对应于时间步
        X = X.permute(1, 0, 2)  # 转换为时间步*批量大小*长度
        # 如果未提及状态,则默认为0
        output, state = self.rnn(X)
        # output的形状:(num_steps,batch_size,num_hiddens),
        # 因为有多层,它可以认为是最后一层的所有时间步的隐状态输出
        # state的形状:(num_layers,batch_size,num_hiddens)
        # 它是所有层的最后一个时间步的隐状态输出
        return output, state


class Seq2SeqDecoder(d2l.Decoder):
    """用于序列到序列学习的循环神经网络解码器"""

    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0, **kwargs):
        super(Seq2SeqDecoder, self).__init__(**kwargs)
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size + num_hiddens, num_hiddens, num_layers,
                          dropout=dropout)
        # 因为下面做了拼接处理,因此这里输入的维度为embed_size+num_hiddens
        self.dense = nn.Linear(num_hiddens, vocab_size)

    def init_state(self, enc_outputs, *args):
        return enc_outputs[1]  # 里面有output,state,【1】就是把state拿出来

    def forward(self, X, state):
        # 输出'X'的形状:(batch_size,num_steps,embed_size)
        X = self.embedding(X).permute(1, 0, 2)
        # 广播context,使其具有与X相同的num_steps  state[-1]是前面最后一层的最后一个隐状态
        context = state[-1].repeat(X.shape[0], 1, 1)
        # context的维度为num_steps,1,num_hiddens
        X_and_context = torch.cat((X, context), 2)
        # 将它们拼在一起,即输入embed_size+num_hiddens
        # 这里可以认为是:我觉得单纯的隐状态的传递不够,我再将最后一层的最后一个隐状态
        # 和我的第一个输入拼在一起,我觉得它浓缩了很多信息,也一起来作为输入
        output, state = self.rnn(X_and_context, state)
        output = self.dense(output).permute(1, 0, 2)
        # output的形状:(batch_size,num_steps,vocab_size)
        # state的形状:(num_layers,batch_size,num_hiddens)
        return output, state


# @save
def sequence_mask(X, valid_len, value=0):  # 该函数生成mask并进行遮挡
    """在序列中屏蔽不相关的项"""
    maxlen = X.size(1)  # 取出X中的第一维度的数量
    mask = torch.arange((maxlen), dtype=torch.float32,
                        device=X.device)[None, :] < valid_len[:, None]
    # arange生成一个一维的tensor,[None,:]是将其变成二维的,1*maxlen的tensor
    # 而valid_len是长度为max_len的向量,[:,None]就变成了max_len*1的tensor
    # 然后小于就会触发广播机制,例如max_len=4,那么arange生成的[[1,2,3,4]]就会广播
    # 变成4行,每一行都是[1,2,3,4],那么将每一列和valid_len这个列比较
    # 因为valid_len中的元素就是多少个有效的,那么假设2,就是前两个为true,后两个为false
    # 这样就可以将其提取出来了
    X[~mask] = value
    return X


# @save
class MaskedSoftmaxCELoss(nn.CrossEntropyLoss):
    """带遮蔽的softmax交叉熵损失函数"""

    # pred的形状:(batch_size,num_steps,vocab_size)
    # label的形状:(batch_size,num_steps)
    # valid_len的形状:(batch_size,)
    def forward(self, pred, label, valid_len):
        weights = torch.ones_like(label)
        weights = sequence_mask(weights, valid_len)
        self.reduction = 'none'  # 不对求出来的损失求和、平均等操作
        unweighted_loss = super(MaskedSoftmaxCELoss, self).forward(
            pred.permute(0, 2, 1), label)  # 这里维度转换为torch本身的要求
        weighted_loss = (unweighted_loss * weights).mean(dim=1)  # 按元素相乘
        return weighted_loss


# @save
def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device):
    """训练序列到序列模型"""

    def xavier_init_weights(m):
        if type(m) == nn.Linear:
            nn.init.xavier_uniform_(m.weight)
        if type(m) == nn.GRU:
            for param in m._flat_weights_names:
                if "weight" in param:
                    nn.init.xavier_uniform_(m._parameters[param])

    net.apply(xavier_init_weights)
    net.to(device)
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
    loss = MaskedSoftmaxCELoss()
    net.train()  # 开启训练模式
    animator = d2l.Animator(xlabel='epoch', ylabel='loss', xlim=[10, num_epochs])
    for epoch in range(num_epochs):
        timer = d2l.Timer()
        metric = d2l.Accumulator(2)  # 训练损失总和,词元数量
        for batch in data_iter:
            optimizer.zero_grad()  # 清空梯度
            X, X_valid_len, Y, Y_valid_len = [x.to(device) for x in batch]
            # 英文、英文有效长度、法文、法文有效长度
            bos = torch.tensor([tgt_vocab['<bos>']] * Y.shape[0], device=device).reshape(-1, 1)  # 转换为相同纬度
            # Y是法文,在训练时是作为解码器的输入的,然后我们需要一个开始标注
            # 因此我们将Y的最后一个单词去掉,再在第一个前面加上一个开始标志bos
            # 这样我们强制让它学习bos去预测第一个词,而最后一个词它不会用来做预测,因此在预测时它去掉没关系
            # 那之后在真正预测的时候,我们就只需要给解码器第一个为bos,后面它自己生成的拿来做输入就可以
            dec_input = torch.cat([bos, Y[:, :-1]], 1)  # 强制教学
            Y_hat, _ = net(X, dec_input, X_valid_len)  # 这个模型的第一个参数的编码器输入,第二个是解码器输入
            # 第三个是编码器输入的有效长度
            l = loss(Y_hat, Y, Y_valid_len)  # 这里就是用原来的Y去和预测的做损失
            l.sum().backward()  # 损失函数的标量进行“反向传播”
            d2l.grad_clipping(net, 1)  # 梯度裁剪
            num_tokens = Y_valid_len.sum()
            optimizer.step()
            with torch.no_grad():
                metric.add(l.sum(), num_tokens)
        if (epoch + 1) % 10 == 0:
            animator.add(epoch + 1, (metric[0] / metric[1],))
    print(f'loss {
      
      metric[0] / metric[1]:.3f}, {
      
      metric[1] / timer.stop():.1f} '
          f'tokens/sec on {
      
      str(device)}')


embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.1
batch_size, num_steps = 64, 10  # 10是句子最长为10,超过裁剪,不足就补充
lr, num_epochs, device = 0.005, 300, d2l.try_gpu()

train_iter, src_vocab, tgt_vocab = d2l.load_data_nmt(batch_size, num_steps)
encoder = Seq2SeqEncoder(len(src_vocab), embed_size, num_hiddens, num_layers, dropout)
decoder = Seq2SeqDecoder(len(tgt_vocab), embed_size, num_hiddens, num_layers, dropout)
net = d2l.EncoderDecoder(encoder, decoder)
train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device)
plt.show()


# @save
def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps, device,
                    save_attention_weights=False):
    """序列到序列模型的预测"""
    # 在预测时将net设置为评估模式
    net.eval()
    # 将输入的句子变成小写再按空格分隔再加上结尾符,并且都经过src_vocab这个类转换成向量了
    src_tokens = src_vocab[src_sentence.lower().split(' ')] + [src_vocab['<eos>']]
    # 该句子的有效长度
    enc_valid_len = torch.tensor([len(src_tokens)], device=device)
    # 对该句子检查长度进行填充或者裁剪
    src_tokens = d2l.truncate_pad(src_tokens, num_steps, src_vocab['<pad>'])
    # 添加批量轴,将src_tokens添加上批量这个维度,因此变成批量*时间步*vocabsize
    enc_X = torch.unsqueeze(torch.tensor(src_tokens, dtype=torch.long, device=device),
                            dim=0)
    # 计算encoder的输出
    enc_outputs = net.encoder(enc_X, enc_valid_len)
    # 计算decoder应该接受的初始状态
    dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)
    # 添加批量轴,因为现在是预测因此decoer的输入只有一个<bos>,那么为它添加一个批量轴
    dec_X = torch.unsqueeze(torch.tensor(
        [tgt_vocab['<bos>']], dtype=torch.long, device=device), dim=0)  # 增加一个维度
    output_seq, attention_weight_seq = [], []
    for _ in range(num_steps):
        Y, dec_state = net.decoder(dec_X, dec_state)  # 输出和隐状态
        # 我们使用具有预测最高可能性的词元,作为解码器在下一时间步的输入
        dec_X = Y.argmax(dim=2)  # 更新输入
        pred = dec_X.squeeze(dim=0).type(torch.int32).item()  # 去除批量这个维度
        # 保存注意力权重(稍后讨论)
        if save_attention_weights:
            attention_weight_seq.append(net.decoder.attention_weights)
        # 一旦序列结束词元被预测,输出序列的生成就完成了
        if pred == tgt_vocab['<eos>']:
            break
        output_seq.append(pred)
    return ' '.join(tgt_vocab.to_tokens(output_seq)), attention_weight_seq


def bleu(pred_seq, label_seq, k):  #@save
    """计算BLEU"""
    pred_tokens, label_tokens = pred_seq.split(' '), label_seq.split(' ')
    len_pred, len_label = len(pred_tokens), len(label_tokens)
    score = math.exp(min(0, 1 - len_label / len_pred))
    for n in range(1, k + 1):
        num_matches, label_subs = 0, collections.defaultdict(int)
        for i in range(len_label - n + 1):
            # 这个循环是将真实序列中的各种长度的连续词汇都变成词典计数
            label_subs[' '.join(label_tokens[i: i + n])] += 1
        for i in range(len_pred - n + 1):
            if label_subs[' '.join(pred_tokens[i: i + n])] > 0:
                num_matches += 1  # 这里判断出来预测序列中有对应的n-gram
                label_subs[' '.join(pred_tokens[i: i + n])] -= 1  # 要减一
        score *= math.pow(num_matches / (len_pred - n + 1), math.pow(0.5, n))
    return score

engs = ['go .', "i lost .", 'he\'s calm .', 'i\'m home .']
fras = ['va !', 'j\'ai perdu .', 'il est calme .', 'je suis chez moi .']
for eng, fra in zip(engs, fras):
    translation, attention_weight_seq = predict_seq2seq(
        net, eng, src_vocab, tgt_vocab, num_steps, device)
    print(f'{
      
      eng} => {
      
      translation}, bleu {
      
      bleu(translation, fra, k=2):.3f}')

Insert image description here

loss 0.019, 12068.0 tokens/sec on cuda:0
go . => va !, bleu 1.000
i lost . => j'ai perdu ., bleu 1.000
he's calm . => il est riche ., bleu 0.658
i'm home . => je suis chez moi chez moi chez moi juste ., bleu 0.537

小结

  • According to the design of the "encoder-decoder" architecture, we can use two recurrent neural networks to design a sequence-to-sequence learning model.
  • When implementing encoders and decoders, we can use multi-layer recurrent neural networks.
  • We can use masking to filter out irrelevant calculations, such as when calculating losses.
  • In encoder-decoder training, the forced teaching method feeds the original output sequence (rather than the predicted results) into the decoder.
  • BLEU is a commonly used evaluation method that evaluates predictions by measuring the match of n-grams between the prediction sequence and the label sequence.

beam search

In the previous prediction, the strategy we adopted isgreedy strategy, that is, the current probability is chosen every time we predict The biggest comes as a result. Then the final result of the greedy strategy is usually not optimal, and then the computational complexity of exhaustive search is too great, so there is another method: a> to improve. Beam search

Beam search has a key parameter which is beam width k k k. At time step 1, that is, when making the first prediction based on <bos>, we not only select the one with the highest probability for output, but also select the one with the highest probability k k k tokens. For example, in the figure below, we selected A and C at the first time step. Then in subsequent time steps, it will be based on the k k k candidate sequences from k ∣ Y ∣ k\vert Y\vert Select from kY k k k candidate output sequences:

Insert image description here

Moreover, we not only consider the final long sequence, but also consider each sequence selected during the selection process, that is, the two sequences A, C, AB, CE, ABD, and CED. To evaluate them we use the following formula:
1 L α log ⁡ P ( y 1 , . . . , y L ) = 1 L α ∑ t ′ = 1 L log ⁡ P ( y t ′ ∣ y 1 , . . . , y t ′ − 1 ) \frac{1}{L^{\alpha}}\log P(y_1,...,y_L)=\frac{1}{L^ {\alpha}}\sum_{t^{\prime}=1}^L \log P(y_{t^{\prime}}\mid y_1,...,y_{t^{\prime}-1 }) La1logP(y1,...,andL)=La1t=1LlogP(ytand1,...,andt1)
where L is the length of the sequence, α \alpha α is often taken as 0.75. This part is to balance the gap between long and short sequences, because the probability of multiplication of short sequences is always higher, so this part is used for and, which is equivalent to adding a certain penalty to the selection of short sequences.

The time complexity of beam search is O ( k ∣ Y ∣ T ) O(k\vert Y\vert T) O(kYT)


小结

  • Sequence search strategies include greedy search, exhaustive search, and beam search.
  • The sequence selected by greedy search requires the least amount of calculation, but the accuracy is relatively low.
  • The sequence selected by exhaustive search has the highest accuracy, but requires the most calculations.
  • Beam search trades off accuracy and computational cost by flexibly selecting the beam width.

Guess you like

Origin blog.csdn.net/StarandTiAmo/article/details/127682727