Deep Learning Practice - Recurrent Neural Network (RNN, LSTM, GRU)

       For details of Yiru’s complete project/code, please refer to github: https://github.com/yiru1225 (reprinted and marked with the source, do not star for projects thanks)

Table of contents

Series Article Directory

1. Experimental summary

1. Experimental tools and content

2. Experimental data

3. Experimental objectives

4. Experimental steps

2. Overview of Recurrent Neural Networks

1. Introduction to Recurrent Neural Networks

1.1 Recurrent Neural Network Background

1.2 Concept and Principle of Recurrent Neural Network

1.3 Development History of Recurrent Neural Network

2. Recurrent neural network related knowledge import

2.1 Sequence Model

2.2 Text preprocessing

2.3 Language Model

2.4 Datasets

3. The principle, implementation and optimization of classic cyclic neural network

1.RNN

1.1 Principle

1.2 Code implementation (self-purchased and built)

1.3 Code implementation (API)

1.4 Ablation experiment

    1.4.1 num_hiddens

    1.4.2 num_steps

    1.4.3 batch_size

    1.4.4 lr

    1.4.5 epoch

    1.4.6 Overview

2.LSTM

2.1 Principle

2.2 Code implementation

2.3 Ablation experiment

3.GRU

3.1 Principle

3.2 Code implementation

3.3 Ablation experiment

4. Comparative Analysis

4. Introduction and selection of advanced recurrent neural network architecture

1. Deep Recurrent Neural Network

1.1 Principle

1.2 Code implementation

1.3 Ablation experiment

2. Bidirectional Recurrent Neural Network

3. Densely connected network

4. Machine Translation and Datasets

5. Encoder-Decoder Architecture

6. Sequence-to-sequence learning

V. Summary

1. Experimental conclusion

2. References


Series Article Directory

This series of blogs focuses on the practice of deep learning (if you have any questions, please discuss and point them out in the comment area, or contact me directly by private message).

The first chapter   of deep learning combat - different ways of model deployment (CNN, Yolo)_How to deploy cnn_@李梦如的博客

Chapter 2   Deep Learning Practice - Convolutional Neural Network/CNN Practice (LeNet, Resnet)_@李忆如的博客-CSDN博客

Chapter 3 Deep Learning Practice - Recurrent Neural Network (RNN, LSTM, GRU)


synopsis

This blog mainly introduces the principles of several cyclic neural networks, and conducts code practice and optimization (including code and data sets).


1. Experimental summary

This chapter mainly summarizes the experimental ideas, environment, and steps, sorts out the structure and ideas of the entire experimental report, and facilitates positioning.

1. Experimental tools and content

This experiment mainly uses Pycharm to complete the code Pytorch architecture implementation and optimization of several recurrent neural networks , and conducts performance comparisons after data collection and analysis through ablation experiments with different parameters . In addition, through papers and materials, I studied the advanced recurrent neural network, and tried to complete the implementation and comparison of its training/reasoning, and gave some optimization ideas.

2. Experimental data

Most of the data in this experiment comes from the official data set of the cyclic neural network model , and some test data comes from the network .

3. Experimental objectives

The main purpose of this experiment is to deeply analyze the principle and model definition of the cyclic neural network, and to understand the meaning of different parameters and the contribution to the model (performance impact), and to complete the performance comparison of different models and parameters through practice, so as to guide the development of real projects in the application.

4. Experimental steps

The general process of this experiment is shown in Table 1:

Table 1 Experimental process

1. Summary of Experimental Ideas

2. Overview of Recurrent Neural Networks

3. The principle, implementation and optimization of classic recurrent neural network

4. Introduction and selection of advanced recurrent neural network architecture

2. Overview of Recurrent Neural Networks

Whether this experiment is practicing RNN or other advanced/modern architectures, it belongs to the cyclic neural network. Therefore, this chapter first summarizes the concept and principle of the cyclic neural network, and briefly describes its development process.

1. Introduction to Recurrent Neural Networks

1.1 Recurrent Neural Network Background

Through Experiment 1 and Experiment 2, we mainly deal with two types of data: tabular data or image data . For image data, we designed a special convolutional neural network architecture to model this type of special data structure, that is, we can effectively use information such as the pixel position/label of the image. In previous experiments, the principle, implementation, and The whole process of training evaluation and deployment is involved, so I won't go into details here.

But so far our default data comes from a certain distribution, and all samples are independent and identically distributed, that is, we don't pay much attention to the order/context of the data . However, this is not the case for most data. For example, words in an article are written in order, and if the order is rearranged randomly, it is difficult to understand the original meaning of the article. Likewise, image frames in videos, audio signals in conversations, and browsing behavior on websites are sequential.

We demonstrate the above with an example of NLP named entity recognition, see Table 2, and the network comparison is shown in Figure 1:

Table 2 Examples of named entity recognition

The first sentence: I like eating apple ! (I love apples!)

The second sentence: The  a apple  is a great company! (Apple is such a great company!)

Analysis: The task is to label the apple. We all know that the two apples are fruits and companies. Assume that there is a large amount of labeled data for training models. When we use fully connected neural networks/convolutional neural networks, The method is to input the eigenvector of the word apple into our model. When outputting the result, let our label have the highest probability of being the correct label. However, in our corpus, some apple labels are fruits, and some are companies. This will lead to the accuracy of the prediction, depending on which label is more in the training set , such a model is meaningless. The problem is that we did not train the model in combination with the context, but trained the label of the word apple alone.

Figure 1 Comparison of network architectures (FCN vs CNN vs RNN)

Another problem arises from the fact that not only can we receive a sequence as input, but we may also expect to keep guessing the successor of this sequence . This is fairly common in time series analysis and can be used to predict the volatility of the stock market, the temperature profile of a patient, or the required acceleration of a race car. For the same reason, we need specific models that can handle this data.

1.2 Concept and Principle of Recurrent Neural Network

Tips: The cyclic neural network is similar to CNN and is a type of network. The specific principles are explained in detail in the next two chapters.

According to 1.1 and previous experimental analysis, CNN can effectively process spatial information, but there are still limitations in the performance of data correlation and data prediction , and the recurrent neural network (recurrent neural network, RNN, which is mainly analyzed in this experiment) is a classic representative ) came into being, which can better deal with sequence information and semantic information .

The core principle of the recurrent neural network is to store past information and current input by introducing state variables , so that the current output can be determined. That is, it has the ability to remember and make inferences based on the content of these memories, which is also an important reason why it can use context to process sequence information.

1.3 Development History of Recurrent Neural Network

In 1982, John Hopfield, a physicist at the California Institute of Technology, invented a single-layer feedback neural network Hopfield network to solve combinatorial optimization problems. This is the prototype of the earliest RNN. In 1986, with recurrent, Jordan network was proposed. In 1990, the Jordan network was simplified, and the BP algorithm was used for training . Now there is the simplest RNN model containing a single self-connected node.

After that, in order to solve the problem of gradient explosion and gradient disappearance , LSTM appeared. For other problems, modern cyclic neural network architectures such as GRU, bidirectional cyclic neural network, and seq2seq appeared. The specific development is shown in Figure 2 and Figure 3:

Figure 2 Development process of recurrent neural network - graphic form 

Figure 3 Development history of recurrent neural network - table form

2. Recurrent neural network related knowledge import

Before officially entering the architecture and implementation of the cyclic neural network, we need to introduce some relevant knowledge.

Many examples of using recurrent networks are based on text data, so we will focus on language models in this lab. After a more detailed review of sequence data, we introduce practical techniques for text preprocessing. We then discuss the basic concepts of language models and use this discussion as inspiration for the design of recurrent neural networks. Finally, we describe gradient computation methods for recurrent neural networks to explore issues that may be encountered when training such networks.

2.1 Sequence Model

According to 1.1 and 1.2, we know that the cyclic neural network exists to better process sequence data. We will supplement the definition of sequence data/model in this section.

Sequence data, as we mentioned in the introduction above, is essentially data with a certain context/change over time , such as user reviews and stock prices. Taking user evaluation as an example, for example, movie evaluation and time may have anchoring effects, hedonic adaptation, seasonality, etc. Some other scenarios are summarized in Table 3:

Table 3 Sequence data samples of different scenarios

  1. Many users have strong, specific habits when using programs. For example, social media apps are more popular after students are out of school. Stock market trading software is more commonly used when the market is open.
  2. Predicting tomorrow's stock price is more difficult than past stock price, although both are just estimates of a number. After all, foresight is much harder than hindsight. In statistics, the former (predicting beyond known observations) is called extrapolation, while the latter (estimating between existing observations) is called interpolation.
  3. Music, speech, text, and video are all continuous in nature. If their sequence is rearranged by us, then the original meaning will be lost. For example, a text title "dog bites man" is far less surprising than "man bites dog", even though the words that make up the two sentences are identical.
  4. Earthquakes have a strong correlation, that is, after a large earthquake, there are likely to be several small aftershocks, and these aftershocks are much stronger than aftershocks after a non-major earthquake. In fact, earthquakes are spatiotemporally correlated, i.e. aftershocks usually occur in short time spans and within close distances.
  5. Interactions between humans are also continuous, as can be seen in the quarrels and debates on Weibo.

    So consistent with the background of RNN, how to use sequence data/data correlation to build a model (such as using time dynamics) is the core issue of "sequence model". To build a sequence model, the core is statistical tools + model selection . Taking stock forecasting as an example, the statistical case is shown in Figure 4:

Fig. 4 Statistical sample of serial data (FTSE 100 index price)

    And our input model needs to be converted into a mathematical expression, that is, xt is used to represent the price. Note that t is usually discrete for the sequences in this paper and varies over integers or a subset thereof. Assuming that a trader wants to perform well in the stock market on day t, he predicts through the following formula 1:

x_{t} \sim P\left(x_{t} \mid x_{t-1}, \ldots, x_{1}\right)

Formula 1 stock forecast sample formula 

In order to achieve this prediction, we often need to introduce models or strategies. Common hidden variable autoregressive models, Markov models, causality, etc., as shown in Figure 5: 

Figure 5 Common choices and core definitions of sequence models 

Tips: For the observation sequence up to time step t, its predicted output at time step t+k is "k-step prediction". As we increase the value of the forecast time t, it will cause a rapid accumulation of errors and a rapid decline in the quality of the forecast .

2.2 Text preprocessing

For the sequence data processing problem, we evaluate the required statistical tools and challenges in forecasting in Section 2.1. Such data exists in many forms, text being one of the most common examples. For example, an article can be viewed simply as a sequence of words, or even a sequence of characters. In this section, we will analyze the common preprocessing steps of the text as shown in Table 4:

Table 4 Text preprocessing steps

1. Read the dataset :

    Load the text into memory as a string.

2. Lemmatization :

Split a string into tokens (such as words and characters, eg: ['the', 'time', 'machine', 'by', 'h', 'g', 'wells']).

3. Build a vocabulary :

为方便模型使用,建立词表(string->num,例:[('<unk>', 0), ('the', 1), ('i', 2), ('and', 3) , ('of', 4), ('a', 5), ('to', 6), ('was', 7), ('in', 8)]), map the split token to the numerical index (sub-list).

4. Function integration :

Convert the text into a numerical index sequence, and package all functions, and return corpus (word element index list) and vocab (corpus vocabulary) through load_corpus_time_machine, for example: (170580, 28).

2.3 Language Model

In 2.2, we learned how to map text data into lexical units. They are still sequence data in essence, so we can use the method of 2.1 to predict them, and we can only get a "reasonable" prediction . Still haven't gotten the model to actually "understand" the text . But this is still meaningful (such as semantic ambiguity discrimination), so we need to make some supplements to the core concepts of language models and datasets in this section, and also facilitate the subsequent smooth transition to the principles of recurrent neural networks.

First of all, the core problem of the language model is consistent with the sequence data/model, that is, " how to model a document, or even a lexical sequence? ", and the basic probability model is consistent with "autoregressive + hypothesis", as shown in Figure 5 Show. To train a language model, we need to calculate the probability of a word , and the conditional probability of a word given the previous words . These probabilities are essentially the parameters of the language model. Some common methods are shown in Figure 6:

Figure 6 Common calculation methods of word probability/conditional probability

But such a model is easily invalid, mainly because we need to store all the counts, and the meaning of the words is not considered .

Supplement: The approximate formula for sequence modeling is derived from the Markov model and n-grams, such as formula 2:

\begin{array}{c} P\left(x_{1}, x_{2}, x_{3}, x_{4}\right)=P\left(x_{1}\right) P\left(x_{2}\right) P\left(x_{3}\right) P\left(x_{4}\right) \\ P\left(x_{1}, x_{2}, x_{3}, x_{4}\right)=P\left(x_{1}\right) P\left(x_{2} \mid x_{1}\right) P\left(x_{3} \mid x_{2}\right) P\left(x_{4} \mid x_{3}\right) \\ P\left(x_{1}, x_{2}, x_{3}, x_{4}\right)=P\left(x_{1}\right) P\left(x_{2} \mid x_{1}\right) P\left(x_{3} \mid x_{1}, x_{2}\right) P\left(x_{4} \mid x_{2}, x_{3}\right) \end{array}

Equation 2 Approximate formula of Markov model and n-gram -> sequence modeling

Tips: Usually, probabilistic formulations involving one, two, and three variables are called unigram, bigram, and trigram models, respectively. This can guide us on how to design better models.

Next, let's talk about other core knowledge through natural language statistics     on real data . Through the lexical frequency statistics of unary, binary, and trigram grammars (a sample is shown in Figure 7), we can find that word frequency decays rapidly in a definite way . After eliminating the first few words as exceptions, all remaining words roughly follow a straight line on the log-log plot. This means that the frequency of words satisfies Zipf’s law , that is, the frequency ni of the i-th most frequently used word is as in Equation 3:

\begin{array}{c} n_{i} \propto \frac{1}{i^{\alpha}} \\ \log n_{i}=-\alpha \log i+c \end{array}

Tips: where a describes the index of the distribution, and c is a constant.

Figure 7 Example of lexical frequency statistics

At the same time, according to Figure 7, we can summarize several characteristics (the core background of the cyclic neural network):

  1. In addition to unary grammatical words, word sequences also seem to obey Zipf's law, although the exponent a in Equation 3 is smaller (the magnitude of the exponent is affected by the length of the sequence).
  2. The number of n-tuples in the vocabulary is not that large, suggesting that there is quite a bit of structure in the language that gives us hope for applying the model.
  3. Many n-tuples occur infrequently, which makes Laplacian smoothing very unsuitable for language modeling. Instead, we will use a deep learning based model.

Finally, there is a question, how to read long sequence data ?

Since sequence data is continuous in nature, we need to solve this problem when processing data. A sample problem (how to segment text) is shown in Figure 8. A common solution is random sampling (each sample is in the original long arbitrarily captured subsequences on a sequence) with sequential partitioning (the order of split subsequences is preserved during minibatch-based iteration).

Figure 8 Long sequence data reading/processing problems

So how to measure the quality of the language model , which is the key to evaluating the model based on the cyclic neural network in the subsequent part, the answer is perplexity (Perplexity) .

A good language model can use highly accurate lemmas to predict what we will see next. For example, we want to use the language model to continue writing "It is raining ...". A few examples are shown in Table 5:

Table 5 Language model prediction examples

"It is raining outside"

"It is raining banana tree"

"It is raining piouw;kcj pwepoiut" (piouw;kcj pwepoiut is raining)

Obviously, the first answer is the most reasonable. If we quantitatively measure this rationality criterion, the core is to calculate the likelihood probability of the sequence + softmax regression , that is, we can pass all n tokens in a sequence The average value of the cross-entropy loss is measured, and the perplexity is its index, as shown in Equation 4, so the essence of the perplexity is "the harmonic mean of the actual number of choices for the next lexical element ".

\exp \left(-\frac{1}{\mathrm{n}} \sum_{\mathrm{t}=1}^{\mathrm{n}} \log P\left(x_{t} \mid x_{t-1}, \ldots, x_{1}\right)\right)

So far, the introduction of the core pre-knowledge of the cyclic neural network has been completed.

2.4 Datasets

In this part, we will introduce the main data sets used in the later experiments. This experiment uses the text data set as an example to explore the effect of different recurrent neural networks. The main use is the time machine data set of HGWells, which is essentially a This book, see details: Time Machine The Time Machine (Douban) (douban.com) , import method see Code1:

Code1 Dataset Import (Time Machine - Text)

from d2l import torch as d2l

train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)

3. The principle, implementation and optimization of classic cyclic neural network

上一章中我们对循环神经网络的背景、概念、发展历程做了梳理,并将序列模型、文本预处理、语言模型与数据集等核心前置知识做了补充,本章即将进入经典循环神经网络的原理详解、代码实现、参数与网络优化,本章主要以RNN(经典)、LSTM、GRU为例。

1.RNN

1.1 原理

参考论文:Finding Structure in Time - 1990 (wiley.com) 

参考资料:The Unreasonable Effectiveness of Recurrent Neural Networks 

第一章铺垫了那么多,让我们正式进入最经典的循环神经网络。RNN最重要、最核心的创新是循环,正是因为循环才可以利用数据的相关性/上下文,从而在序列数据中表现优秀。我们来看这么一个展开/迭代例子(标准结构)如图9:

图9 RNN标准结构-循环本质/迭代推导

分析:CNN中我们知道神经网络是分层顺序激活的,而RNN通过循环将训练“学”到的东西蕴藏在权值W中

    补充:左侧是折叠起来的样子,右侧是展开的样子,左侧中h旁边的箭头代表此结构中的“循环“体现在隐层。图中O代表输出,y代表样本给出的确定值,L代表损失函数。

泛化一点讲,RNN的核心结构如图10所示:

图10 RNN核心结构

Tips:神经网络A(包含若干层)输入向量为xt,输出向量为ht,它允许网络将这一步的输出传递到下一步作为输入,堆叠/展开后与图9保持一致。

把神经网络看作函数f,其中的权重为w,那RNN 本质上是循环/递推函数,如式5:

h_{(t)}=f\left(h_{t-1}, x_{t}, w\right)

式5 RNN本质函数

根据RNN简介与核心定义,总结其特点如下:

  1. (1)权值共享,图中的W全是相同的,U和V也一样
  2. (2)前面的输出会影响后面的输出,适合处理序列数据
  3. (3)损失也是随着序列的推荐而不断积累的

而根据RNN的不同堆叠形式/数据的不同输入输出,产生了多种变体(非优化),如图11所示:

图11 常见RNN变体汇总

Tips:上图中每个正方形代表一个向量,箭头代表函数。输入向量是红色,输出向量是蓝色,绿色向量装的是RNN的状态,总结如表6:

表6 不同变体中RNN状态(对应图11左至右)

1、one to one

非RNN的普通过程,从固定尺寸的输入到固定尺寸的输出(比如图像分类),也即输入是x,经过变换Wx+b和激活函数f得到输出y。

2、one to many

输出是序列(例如图像标注:输入是一张图像,输出是单词的序列),同时还有一种结构是把输入信息X作为每个阶段的输入。

3、many to one

输入是序列(例如情绪分析:输入是一个句子,输出是对句子属于正面还是负面情绪的分类)。

4、many to many(n to n)

输入输出都是序列(比如机器翻译:RNN输入一个英文句子输出一个法文句子)。或同步的输入输出序列(比如视频分类中,我们将对视频的每一帧都打标签)。

Tips:当然除了图11,还有Encoder-Decoder(n to m,Seq2Seq)等重要RNN变体这里没有全部体现。

而以上主要是从概念、结构部分的原理解析,接下来让我们进入数理推导部分。对于神经网络最重要的是前向传播/反向传播的部分(如何更新参数),RNN也是如此。

同样先展开一个典型的RNN,如图12所示:

图12 典型RNN展开

图12中,有一条单向流动的信息流是从输入单元到达隐藏单元的,同时另一条单向流动的信息流从隐藏单元到达输出单元。在某些情况下,RNNs会打破后者的限制,引导信息从输出单元返回隐藏单元,这些被称为“Back Projections”,并且隐藏层的输入还包括上一隐藏层的状态,即隐藏层内的节点可以自连也可以互连。(这实际上就是LSTM,后文详解)

右侧为计算时便于理解记忆而产开的结构。简单说,x为输入层,o为输出层,s为隐含层,而t指第几次的计算;V,W,U为权重,其中计算第t次的隐含层状态时如式6:

s_{t}=f\left(U * x_{t}+W * s_{t-1}\right)

式6 隐含层状态计算

    即通过此实现当前输入结果与之前的计算挂钩的目的,更直观的表达可见图13:

图13 RNN“记忆”核心

根据上述描述与图13,我们可以推理RNN前向传播条件如式7,loss常用重构误差交叉熵,如式8和式9:

同理根据RNN展开去推理反向传播,常出现梯度消失问题,同样因激活函数产生,在此不展开,详见:循环神经网络RNN论文解读_循环神经网络论文_纸上得来终觉浅~的博客-CSDN博客。 

最后我们从是否有隐状态基于RNN的字符级语言模型作为RNN原理部分的结尾,核心知识总结如图14所示:

图14 RNN网络架构总结及字符级语言模型样例

1.2 代码实现(自购建)

    Tips:RNN及其变体是非常经典且有意义的工作,故代码实现有多种方式,总体来说分为自购建与API调用,本实验RNN分别采用自购建和API调用作为双实现样例,其他架构基本均使用API单实现,参考代码来自李沐老师,详见:8.5. 循环神经网络的从零开始实现 — 动手学深度学习 2.0.0 documentation (d2l.ai)

根据1.1中对RNN原理/架构的解析,以及基于RNN的字符级语言模型的定义,我们在本部分实现从0到1的RNN实现,代码文件为RNN(0to1).py,在此仅作核心代码的解析。其中,RNN的自购建步骤总结如表7,输入输出编码如图15所示:

表7 RNN自购建流程

输入:数据集(本实验基本均为H.G.Wells的时光机器数据集 - 文字)

1、独热编码

即NLP中的基本操作one-hot encoding,将文本预处理(string->num),并将索引映射为互补相同的单位向量,方便后续模型读入。

2、初始化模型参数

需要定义隐藏层参数(重要)、输出层参数、附加梯度等模型参数。

3、模型/网络定义

根据需求与RNN定义去搭建模型,包括隐状态返回(初始化时)、计算与输出,以及模型的激活与迭代。

4、预测

定义预测函数来生成prefix(一个用户提供的包含多个字符的字符串)之后的新字符。

5、梯度裁剪

根据1.1中的论述,正常的RNN反向传播会产生O(T)的矩阵乘法链,T较大时可能导致梯度爆炸或消失,故需要进行梯度裁剪。

6、训练

将处理后数据“喂”给模型,进行迭代训练(顺序分区/随机抽样),以困惑度或epoch作为停止训练指标。

输出:训练好的模型/文本预测结果

图15 RNN输入/输出编码形式

本部分主要解析RNN模型代码与梯度裁剪代码,网络模型及解析如Code2:

# 初始化时返回隐状态(张量,形状为(批量大小,隐藏单元数))
def init_rnn_state(batch_size, num_hiddens, device):
return (torch.zeros((batch_size, num_hiddens), device=device), )
# 定义如何在一个时间步内计算隐状态和输出(函数作为激活函数)
def rnn(inputs, state, params):
    # inputs的形状:(时间步数量,批量大小,词表大小)
    W_xh, W_hh, b_h, W_hq, b_q = params
    H, = state
    outputs = []
    # X的形状:(批量大小,词表大小)
    for X in inputs:
        H = torch.tanh(torch.mm(X, W_xh) + torch.mm(H, W_hh) + b_h)
        Y = torch.mm(H, W_hq) + b_q
        outputs.append(Y)
    return torch.cat(outputs, dim=0), (H,)
# 定义类去包装函数(并存储从零开始实现的循环神经网络模型的参数)
class RNNModelScratch: #@save
    """从零开始实现的循环神经网络模型"""
    def __init__(self, vocab_size, num_hiddens, device,
                 get_params, init_state, forward_fn):
        self.vocab_size, self.num_hiddens = vocab_size, num_hiddens
        self.params = get_params(vocab_size, num_hiddens, device)
        self.init_state, self.forward_fn = init_state, forward_fn

    def __call__(self, X, state):
        X = F.one_hot(X.T, self.vocab_size).type(torch.float32)
        return self.forward_fn(X, state, self.params)

    def begin_state(self, batch_size, device):
        return self.init_state(batch_size, self.num_hiddens, device)
# 模型样例定义类去包装函数(检查输出是否具有正确的形状)
num_hiddens = 512
net = RNNModelScratch(len(vocab), num_hiddens, d2l.try_gpu(), get_params, init_rnn_state, rnn)
state = net.begin_state(X.shape[0], d2l.try_gpu())
Y, new_state = net(X.to(d2l.try_gpu()), state)
Y.shape, len(new_state), new_state[0].shape

Tips:我们可以看到输出形状是(时间步数x批量大小,词表大小), 而隐状态形状保持不变,即(批量大小,隐藏单元数)。

而关于梯度裁剪,从数理逻辑来说它的常见方案如式10(通过将梯度g投影回给定半径 (例如θ)的球来裁剪梯度g),一个代码样例如Code3:

g \leftarrow \min \left(1, \frac{\theta}{\|g\|}\right) g

式10 梯度裁剪常见方案

分析:通过这样做,我们知道梯度范数永远不会超过θ, 并且更新后的梯度完全与g的原始方向对齐,有一定的稳定性。

def grad_clipping(net, theta):  #@save
    """裁剪梯度"""
    if isinstance(net, nn.Module):
        params = [p for p in net.parameters() if p.requires_grad]
    else:
        params = net.params
    norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params))
    if norm > theta:
        for param in params:
            param.grad[:] *= theta / norm

其他部分代码详见附件,在完成代码的编写后我们进入实验/RNN的测试,本实验用到的默认样例参数总结于表8中,作为后续对比实验与消融实验的baseline。

表8 RNN模型默认参数样例(本实验)

参数名

取值

batch_size

32

num_steps(小批量数据时间步)

35

num_hiddens

512

vocab(词元数)

10000

num_epoch

500

lr

1

optimizer

SGD

激活函数

Tanh

    万事俱备,让我们正式开始自购建模型的训练与测试,数据集加载/初始化与训练过程如图16所示,单次测试结果如图17所示:

图18 自购建RNN结果样例

分析:根据图18,处理后的数据能正常进入RNN模型,经过500次epoch,最终困惑度为1.2,在个人cpu上速度为24505.1词元/秒,验证了自购建模型设计与代码实现的合理性与正确性。

根据第一章2.3我们知道RNN是有顺序分区和随机抽样两种策略的,上面的样例是顺序分区的,我们再来测一个随机抽样方案的RNN,代码部分只要在训练函数中加入“use_random_iter=True”,如Code4,测试结果样例如图19所示:

train_ch8(net, train_iter, vocab, lr, num_epochs, d2l.try_gpu(),
          use_random_iter=True)  # 随机

 

图19 自购建RNN结果样例(随机抽样)

分析:根据图19,经过500次epoch,RNN随机抽样模型的最终困惑度为1.4,在个人cpu上速度为22995.0词元/秒,速度与性能均略低于顺序分区(本参数组合样例中)。

为了验证自购建RNN不同方案的速度与性能对比,使用两种RNN均做20次实验,取平均词元/秒与平均困惑度分别作为度量指标,结果数据汇总于表9,效果对比如图20所示:

表9 顺序分区RNN vs 随机抽样RNN(数据汇总)

困惑度

词元/秒

顺序分区

1.26

24701.5

随机抽样

1.43

23078.2

图20 顺序分区RNN vs 随机抽样RNN(效果对比)

分析:根据表9与图20,我们可以看到顺序分区RNN的平均困惑度小于随机抽样RNN,且词元/秒前者大于后者,故在本数据集&本参数组合下可验证顺序分区RNN速度与性能均优于随机抽样RNN

而对于梯度爆炸或消失的问题,本实验同样可以尝试去删除梯度裁剪这一步去探究。一个简单方法是“把train_epoch_ch8里的gradient_clip函数打成注释,并打印loss”,经测试,本样例中顺序分区出现问题的概率远远大于随机采样(约99% vs 1%),而由于本样例只是tiny example,而其他很多情况下没有gradient_clip 会导致loss变成nan,原因不赘述。

1.3 代码实现(API)

通过1.2自购建的方式可以实现不同方案/策略的RNN,但无论是代码实现难度、效率/性能都不是最优选择,由于RNN类模型是经典模型,故Tensorflow、Pytorch等主流框架中均做了定义(API)与优化,便于我们快速搭建模型并应用,在本部分做一下探究。

通过API的代码实现非常简洁,全流程为数据集读入->模型定义/引入(通过API)->训练与预测。代码核心即模型的引入,如Code5所示,而用于控制与管理函数的RNNModel类定义与自购建RNN中的RNNModelScratch类似,这里不赘述,完整代码见RNN(API).py。

rnn_layer = nn.RNN(len(vocab), num_hiddens)

Tips:这里只包含隐藏的循环层,输出层需要单独创建。

完成代码编写后,进入模型测试,数据集和参数与表8一致,结果样例如图21所示:

 

图21 RNN(API)结果样例

分析:如图21,经过500次epoch,最终困惑度为1.0,在个人cpu上速度为28759.6词元/秒,验证了API模型设计与代码实现的合理性与正确性。且比较表9,API实现在实现难度、速度与性能上均优于自购建RNN

为验证RNN(API)与自购建RNN的速度与性能对比,使用两种RNN均做20次实验,取平均词元/秒与平均困惑度分别作为度量指标,结果数据汇总于表10,效果对比如图22:

表10 RNN(API) vs 自购建RNN(数据汇总)

困惑度

词元/秒

自购建RNN(顺序)

1.26

24701.5

自购建RNN(随机)

1.43

23078.2

RNN(API)

1.03

28992.3

图22 RNN(API) vs 自购建RNN(效果对比)

分析:根据表10与图22,我们可以看到RNN(API)的困惑度均低于两种自购建RNN,且词元/秒也是最高,故在本数据集&本参数组合下可验证RNN(API)全方位相对与自购建RNN的优越性

至此,不同方案、不同实现的RNN代码解析与测试结束,总体来说,使用API提供的RNN是省时省力的较优选择,但除了表8中的默认参数选择(自拟)与RNN的基本实现,仍有较大的探索和优化空间,在下两部分着重解析。

1.4 消融实验

前面两部分无论是自购建RNN(顺序分区与随机抽取)还是RNN(API),均是基于表8的参数,但根据实验2我们知道参数的选择对同样的模型在同样数据集的效果有很大影响,常见的超参数总结于表11:

表11 重要/常见超参数总结

1、损失函数:

损失可以衡量模型的预测值和真实值的不一致性,由一个非负实值函数损失函数定义

2、优化器:

为使损失最小,定义loss后可根据不同优化方式定义对应的优化器

3、epoch:

学习回合数,表示整个训练过程要遍历多少次训练集

4、学习率:

学习率描述了权重参数每次训练之后以多大的幅度(step)沿梯下降的方向移动

5、归一化:

    在训练神经神经网络中通常需要对原始数据进行归一化,以提高网络的性能

6、Batchsize:

每次计算损失loss使用的训练数据数量

7、网络超参数:

    包括输入图像的大小,各层的超参数(卷积核数、尺寸、步长,池化尺寸、步长、方法,激活函数等)

而对于循环神经网络则主要关注num_hiddens。接下来我们进行一些消融实验来探究参数选择对RNN的影响。

Tips:消融实验构造的RNN全部基于API。

1.4.1 num_hiddens

num_hiddens即隐藏层数量,是影响RNN效果的重要超参数,我们保持表8中其他参数不变,仅改变num_hiddens,每个取值进行20组实验取平均值,探究困惑度与词元/秒的变化趋势,数据汇总于表12,效果对比如图23:

表12 num_hiddens对RNN的影响(数据汇总)

num_hiddens

32

64

128

256

512

1024

困惑度

5.21

3.69

1.98

1.28

1.03

1.02

词元/秒

289066.5

182857.1

160003.4

84529.1

28992.3

9542.2

图23 num_hiddens对RNN的影响(效果对比)

分析:如表12与图23,我们可以发现随着困惑度随num_hiddens增大不断减少至较稳定(效果变好),而词元/秒则逐渐减小(效率降低),故如何做好速度和性能的平衡或取舍可通过num_hiddens的选择来决定,而表8中512的num_hiddens是一个不错的选择。

1.4.2 num_steps

num_hiddens即小批量数据时间步,我们保持表8中其他参数不变,仅改变num_steps,每个取值进行20组实验取平均值,探究困惑度与词元/秒的变化趋势,数据汇总于表13,效果对比如图24:

表13 num_steps对RNN的影响(数据汇总)

num_steps

15

20

25

30

35

40

困惑度

1.5

1.42

1.3

1.21

1.03

1.02

词元/秒

28070.8

21622.1

24836.2

23272.4

28992.3

26157.7

图24 num_steps对RNN的影响(效果对比)

分析:如表13与图24,我们可以发现随着困惑度随num_steps增大不断减少至较稳定(效果变好),而词元/秒则比较波动,没有明显规律,故选择一个较高的num_steps可以取得比较好的性能,而表8中25的num_steps是一个不错的选择。

1.4.3 batch_size

batch_size与num_steps类似,即同时处理数据的RNN数,我们保持表8中其他参数不变,仅改变batch_size,每个取值进行20组实验取平均值,探究困惑度与词元/秒的变化趋势,数据汇总于表14,效果对比如图25:

表14 batch_size对RNN的影响(数据汇总)

batch_size

16

32

64

128

256

困惑度

13.7

1.03

1.1

1.38

6.11

词元/秒

14027.9

28992.3

31438.9

53808.4

47510.8

图25 batch_size对RNN的影响(效果对比)

分析:如表14与图25,我们发现随着困惑度随batch_size增大不断减少再增加(效果变好再变差),而词元/秒则是上升后再下降,故batch_size的选择对速度和性能都有很大影响,实际情况下一般要经过多轮测试选择,而表8中32的batch_size是一个性能最优选。

1.4.4 lr

lr即学习率,决定模型的收敛/迭代速度,我们保持表8中其他参数不变,仅改变lr,每个取值进行20组实验取平均值,探究困惑度与词元/秒的变化趋势,数据汇总于表15,效果对比如图26:

表15 lr对RNN的影响(数据汇总)

lr

0.01

0.1

0.25

0.5

1

困惑度

13.3

3.52

1.19

1.11

1.03

词元/秒

27826.5

27234.4

28626.6

27655.1

28992.3

图26 lr对RNN的影响(效果对比)

分析:如表5与图26,我们可以发现随着困惑度随lr增大不断减少至稳定(效果变好,但小lr很有可能是因为未收敛),而词元/秒则是较为波动,无明显规律,但lr的选择是一门“玄学”,本消融实验也仅作思路的参考,而表8中1的lr是一个较优选。

1.4.5 epoch

epoch即整个训练过程要遍历多少次训练集,在以上四个消融实验中大部分模型已收敛(困惑度/loss无明显变化),困惑度随epoch的变化如图27所示,而对于lr中较大的困惑度取值去测试原因,选取lr=0.01,将epoch改为1500,效果如图28所示:

图27 epoch对困惑度的影响

分析:如图27所示,困惑度随着epoch增加而不断降低至稳定(收敛),与其他深度模型保持一致。

图28 lr=0.01,epoch=1500测试样例

分析:如图28,在测试样例中,epoch=1500困惑度仍维持在9.1,可见lr对收敛速度的影响,也侧面证实了lr的选择对模型效果的影响。

1.4.6 综述

前5部分我们分别对不同的五个超参做了消融实验,除此之外我们还可以改变vacab、激活函数(如变成ReLU)等去探究该参数对RNN的影响,方法类似就不展开了。根据分析结果再回顾表8中的参数组合,总体来说还是兼顾了速度与性能的一组参数

调参的理由与本身对模型的影响有关,如学习率/Batch_size决定了迭代求解的速度与步幅,需要多次测试取较优值,而num_hiddens与模型原理息息相关,需要结合对应架构选择。但总的来说,没有永恒合适的最优参数组合,需根据数据集、任务、模型动态测试与调节。

2.LSTM

在本章第一节我们从原理、代码实现(自购建与API)、参数调节与优化三方面深度剖析了RNN的经典网络,但正如LeNet基于CNN,只了解最经典的架构意义有限,创新性高但存在较大局限性(在各种如今的现实应用场景下),故接下来我们要进行现代循环神经网络的解析与代码实现,首先是LSTM(长短期记忆网络)。

2.1 原理

参考论文:LST-1997(baulab.info)

参考论文:LSTM.pdf (arxiv.org)

参考博客:Understanding LSTM Networks -- colah's blog

我们在前一节一直在谈“梯度爆炸与梯度消失”的问题,同时长期以来隐变量模型存在着长期信息保存和短期输入缺失的问题,最早的解决方案就是长短期存储器LSTM

首先让我们从LSTM的角度回顾RNN,它是一种短期记忆的模型,一个例子如图29,即:

-  RNN中梯度更新小的layer停止学习

- 比如较早的层

- 序列越长,丢失的记忆越多

图29 RNN局限(短期记忆)

故LSTM顾名思义,引入了长期记忆,在架构方面添加了记忆元(单元)、几种用于控制状态的门(输入、忘记、输出),设计灵感来源于计算机的逻辑门,RNN与LSTM的架构对比可见图30:

图30 RNN vs LSTM(架构对比)

分析:根据图30,我们可以看出:

  1. 传统 RNN 神经元默认接受上一时刻的隐藏状态 ht-1 和当前输入 xt。
  2. LSTM的神经元在此基础上还输入了一个 cell 状态 ct-1,cell 状态 c 和RNN中的隐藏状态 h 类似,都保存了历史的信息,从ct-2 ~ ct-1 ~ ct。LSTM 中的 h 更多地是保存上一时刻的输出信息

让我们聚焦LSTM的模型,首先其核心思想是记忆元(单元),也称细胞状态,类似于传送带。直接在整个链上运行,只有一些少量的线性交互。信息在上面流传保持不变会很容易。同时如上文所述,通过精心设计的称作为“门”结构来去除或者增加信息到细胞状态的能力。门是一种让信息选择式通过的方法。他们包含一个 sigmoid 神经网络层和一个 pointwise 乘法操作,如图31所示:

图31 LSTM核心思想与结构

而LSTM的总体流程与门设计简介总结于图表1:

图表1 LSTM总体流程及门设计简介

输入:将数据集导入模型

1、Sigmiod层

    输出 0 到 1 之间的数值,描述每个部分有多少量可以通过。0 代表“不许任何量通过”,1 就指“允许任意量通过”。通过三个门来保护和控制细胞状态

2、遗忘门(LSTM-1)

决定我们从“细胞”中丢弃什么信息。该层读取当前输入x和前神经元信息h,由ft来决定丢弃的信息。输出结果1表示“完全保留”,0 表示“完全舍弃”。

3、输入门(LSTM-2)

确定细胞状态所存放的新信息,这一步由两层组成。sigmoid层作为“输入门层”,决定我们将要更新的值i;tanh层来创建一个新的候选值向量ct~加入到状态中。在语言模型的例子中,我们希望增加新的主语到细胞状态中,来替代旧的需要忘记的主语。

4、输出门(LSTM-3)

更新旧细胞的状态,将ct-1更新为ct。我们把旧状态与ft相乘,丢弃掉我们确定需要丢弃的信息。接着加上it*ct~。这就是新的候选值,根据我们决定更新每个状态的程度进行变化。在语言模型的例子中,这就是我们实际根据前面确定的目标,丢弃旧代词的信息并添加新的信息的地方。

5、输出确定/候选记忆元

最后一步要确定输出,这个输出将会基于我们的细胞状态,但是也是一个过滤后的版本。首先,我们运行一个 sigmoid 层来确定细胞状态的哪个部分将输出出去。接着,我们把细胞状态通过 tanh 进行处理(得到一个在 -1 到 1 之间的值)并将它和 sigmoid 门的输出相乘,最终我们仅仅会输出我们确定输出的那部分。在语言模型的例子中,因为语境中有一个代词,可能需要输出与之相关的信息。例如,输出判断是一个动词,那么我们需要根据代词是单数还是负数,进行动词的词形变化。

输出:处理后的数据/预测数据

至此,我们对LSTM的原理/结构实现就比较清楚了,当然LSTM有很多变体,常见的几个总结于图32(GRU后续详解,其他不展开):

图32 LSTM常见变体架构

通过图表1我们很容易知道LSTM的长记忆引入解决了RNN的短期局限。而对于梯度消失或爆炸的缓解原因在这里做一定补充,通过对RNN的数理推导我们知道梯度消失的原因主要是梯度函数中包含一个连乘项,LSTM去除的方法是通过门的作用使其约等于0或1,如式11,详细来说即:

门的梯度接近1时,连乘项能够保证梯度很好地在 LSTM 中传递,避免梯度消失。

  1. 门的梯度接近0时,即上一时刻的信息对当前时刻并没有作用,此时没必要梯度回传。

\begin{array}{l} \text { remove }: \prod_{j=k+1}^{t} \frac{\partial h_{j}}{\partial h_{j-1}} \\ \text { todo: } \frac{\partial h_{j}}{\partial h_{j-1}} \approx 0 \text { or } \frac{\partial h_{j}}{\partial h_{j-1}} \approx 1 \end{array}

式11 LSTM梯度问题缓解策略

而关于LSTM数理部分的推导,详细可见论文,在这里给出用误差信号的FULL BPTT推导,网络结构总览如图33,推导集合可见式集1与式集2:

图33 LSTM网络结构总览

 

至此,LSTM的架构与数理逻辑部分均解析完成。

2.2 代码实现

本部分我们进入LSTM的代码实现,根据2.1中对LSTM原理/架构的解析,我们编写代码文件为LSTM.py,在此仅作核心代码的解析。其中,LSTM实现的核心流程即数据集导入->参数初始化->模型定义->训练和预测,同样是有自购建与API两种方法,自购建模型定义部分代码可见Code6(但实验使用API构建):

# 模型状态初始化
def init_lstm_state(batch_size, num_hiddens, device):
    return (torch.zeros((batch_size, num_hiddens), device=device),
            torch.zeros((batch_size, num_hiddens), device=device))
#模型定义
def lstm(inputs, state, params):
    [W_xi, W_hi, b_i, W_xf, W_hf, b_f, W_xo, W_ho, b_o, W_xc, W_hc, b_c,
     W_hq, b_q] = params
    (H, C) = state
    outputs = []
    for X in inputs:
        I = torch.sigmoid((X @ W_xi) + (H @ W_hi) + b_i)
        F = torch.sigmoid((X @ W_xf) + (H @ W_hf) + b_f)
        O = torch.sigmoid((X @ W_xo) + (H @ W_ho) + b_o)
        C_tilda = torch.tanh((X @ W_xc) + (H @ W_hc) + b_c)
        C = F * C + I * C_tilda
        H = O * torch.tanh(C)
        Y = (H @ W_hq) + b_q
        outputs.append(Y)
    return torch.cat(outputs, dim=0), (H, C)

代码编写完成后我们进入实验/LSTM的测试,用到的参数组合与baseline(表8)基本保持一致,结果样例如图35所示: 

 

图35 LSTM结果样例

分析:根据图34与图35,处理后的数据能正常进入LSTM模型,经过500次epoch,最终困惑度为1.0,在个人cpu上速度为2401.5词元/秒,验证了LSTM模型设计与代码实现的合理性与正确性。

2.3 消融实验

    类似RNN,我们也可以对LSTM的参数进行消融实验以探究不同选择对模型速度与性能的影响,本部分以num_hiddens的消融实验为例,其他探究与1.4的逻辑和步骤保持一致,在此不赘述。

我们保持其他参数不变,仅改变num_hiddens,每个取值进行20组实验取平均值,探究困惑度与词元/秒的变化趋势,数据汇总于表16,效果对比如图36:

表16 num_hiddens对LSTM的影响(数据汇总)

num_hiddens

32

64

128

256

512

1024

困惑度

4.2

2.2

1.17

1.11

1.02

1.02

词元/秒

99672.4

47770

43921.7

14154.9

2408.6

1989.9

图36 num_hiddens对LSTM的影响(效果对比)

分析:如表16与图36,我们可以发现随着困惑度随num_hiddens增大不断减少至较稳定(效果变好),而词元/秒则逐渐减小(效率降低),故如何做好速度和性能的平衡或取舍可通过num_hiddens的选择来决定,而表8中512的num_hiddens是一个不错的选择。

    至此,LSTM的理论与实验部分均已解析完成。

3.GRU

3.1 原理

参考论文:RNN Encoder–Decoder.pdf (arxiv.org)

参考论文:GRU.pdf (arxiv.org)

LSTM是对RNN的经典优化,较好地缓解了梯度爆炸或消失问题,且解决了隐模型长期信息保存和短期输入缺失的问题,但LSTM也存在结构复杂、效率低下的问题,故一个经典变体GRU(门控循环单元)出现了,架构对比如图37所示:

图37 LSTM vs GRU(架构对比)

简单来说,它组合了遗忘门和输入门到一个单独的“更新门”中,也合并了cell state和hidden state,并且做了一些其他的改变,形成了一个更加简化的模型,核心流程即重置门->更新门->候选隐状态->隐状态,详细的计算可见图表1和图32,在此不赘述。

总的来说,GRU有以下两个显著特征:

  1. 重置门有助于捕获序列中的短期依赖关系
  2. 更新门有助于捕获序列中的长期依赖关系

3.2 代码实现

本部分我们进入GRU的代码实现,根据3.1中对GRU原理/架构的解析,我们编写代码文件为GRU.py,在此仅作核心代码的解析。其中,GRU实现的核心流程即数据集导入->参数初始化->模型定义->训练和预测,同样是有自购建与API两种方法,自购建模型定义部分代码可见Code7(但实验使用API构建):

# 模型状态初始化
def init_gru_state(batch_size, num_hiddens, device):
    return (torch.zeros((batch_size, num_hiddens), device=device), )
#模型定义
def gru(inputs, state, params):
    W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q = params
    H, = state
    outputs = []
    for X in inputs:
        Z = torch.sigmoid((X @ W_xz) + (H @ W_hz) + b_z)
        R = torch.sigmoid((X @ W_xr) + (H @ W_hr) + b_r)
        H_tilda = torch.tanh((X @ W_xh) + ((R * H) @ W_hh) + b_h)
        H = Z * H + (1 - Z) * H_tilda
        Y = H @ W_hq + b_q
        outputs.append(Y)
    return torch.cat(outputs, dim=0), (H,)

代码编写完成后我们进入实验/GRU的测试,用到的参数组合与baseline(表8)基本保持一致,结果样例如图39所示: 

 

图39 GRU结果样例

分析:根据图38与图39,处理后的数据能正常进入GRU模型,经过500次epoch,最终困惑度为1.0,在个人cpu上速度为4538.1词元/秒,验证了GRU模型设计与代码实现的合理性与正确性。

3.3 消融实验

类似RNN与LSTM,我们也可以对GRU的参数进行消融实验以探究不同选择对模型速度与性能的影响,本部分以num_hiddens的消融实验为例,其他探究与1.4的逻辑和步骤保持一致,在此不赘述。

我们保持其他参数不变,仅改变num_hiddens,每个取值进行20组实验取平均值,探究困惑度与词元/秒的变化趋势,数据汇总于表17,效果对比如图40:

表17 num_hiddens对GRU的影响(数据汇总)

num_hiddens

32

64

128

256

512

1024

困惑度

3.7

1.7

1.1

1.08

1

1

词元/秒

82965.5

57806.2

47659.9

24216.3

4692.1

1764.5

图40 num_hiddens对GRU的影响(效果对比)

分析:如表17与图40,我们可以发现随着困惑度随num_hiddens增大不断减少至较稳定(效果变好),而词元/秒则逐渐减小(效率降低),故如何做好速度和性能的平衡或取舍可通过num_hiddens的选择来决定,而表8中512的num_hiddens是一个不错的选择。

    至此,GRU的理论与实验部分均已解析完成。

4.对比分析

在前三节我们使用了不同方案(自购建、API)与不同参数组合(消融实验测试)构建了RNN、LSTM、GRU三种模型,在本节我们通过对前三节消融实验的数据抽取,去对比分析三种模型的困惑度(性能)与词元/秒(速度),以num_hiddens作为聚合维度(样例,可换其他,逻辑一致这里不展开),数据汇总于表18与表19(分别对应困惑度对比与词元/秒对比),效果对比如图41与图42所示(逻辑同理):

Tips:对比模型均由API构建,数据集为时光机器书籍,其他参数与表8基本一致。

表18 模型困惑度对比分析(数据汇总)

num_hiddens

32

64

128

256

512

1024

RNN

5.21

3.69

1.98

1.28

1.03

1.02

LSTM

4.2

2.2

1.17

1.11

1.02

1.02

GRU

3.7

1.7

1.1

1.08

1

1

图41 模型困惑度效果对比

表19 模型词元/秒对比分析(数据汇总)

num_hiddens

32

64

128

256

512

1024

RNN

289066.5

182857.1

160003

84529.1

28992.3

9542.2

LSTM

99672.4

47770

43921.7

14154.9

2408.6

1989.9

GRU

82965.5

57806.2

47659.9

24216.3

4692.1

1764.5

图42 模型词元/秒效果对比

分析:根据图41、42与表18、19,我们可以从速度与性能两方面得出对比结论如下:

  1. 性能:在本实验条件下,每个num_hiddens下性能均是GRU > LSTM > RNN(困惑度相反),而随着num_hiddens增大,三个模型的表现均越来越好,且性能差距越来越小。
  2. 速度:在本实验条件下,每个num_hiddens下速度均是GRU 与 LSTM < RNN(大部分情况下GRU速度优于LSTM),而随着num_hiddens增大,三个模型的速度均越来越低,且差距越来越小。

    综述:故并不是说GRU在任何情况下都是优于传统RNN的选择(且本实验只以num_hiddens作为了聚合维度),真实情况下要结合任务、数据集、算力资源等实际情况去择优选择合适的模型

四、高级循环神经网络架构介绍与选择实现

上一章中我们对几种经典的循环神经网络(RNN、LSTM、GRU)做了架构设计与数理推导的详解,且用了不同方式实现了几种循环神经网络,并通过消融实验的方式探究了几种超参数对模型速度与性能的影响,另外还对比分析了三种模型的优劣。

而本章我们来介绍一些高级循环神经网络(本实验命名,非官方),以深度循环神经网络/双向循环神经网络/编码器-解码器结构/序列到序列学习为例,并选择一种实现。

1.深度循环神经网络

1.1 原理

前面主要讨论的都是只有一个单向隐藏层的循环神经网络,而实际上循环神经网络是可堆叠的,这就是深度循环神经网络的本质/核心,一个样例如图43所示,这对层的添加、非线性的补充都是有指导意义的。而将函数依赖关系形式化,如式12,最后,输出层的计算仅基于第l个隐藏层最终的隐状态,如式13:

Tips:式子含义与网络迭代推理时保持一致,在此不赘述。

图43 深度循环神经网络样例

1.2 代码实现

实现多层循环神经网络所需的许多逻辑细节在高级API中都是现成的。以LSTM为例,与第二章2.2类似,唯一的区别是我们指定了层的数量,而不是使用单一层这个默认值,核心代码可见Code8,完整代码可见DeepLSTM.Py

vocab_size, num_hiddens, num_layers = len(vocab), 256, 2
num_inputs = vocab_sizedevice = d2l.try_gpu()
lstm_layer = nn.LSTM(num_inputs, num_hiddens, num_layers)
model = d2l.RNNModel(lstm_layer, len(vocab))
model = model.to(device)

代码编写完我们进入实验/多层LSTM的测试,用到的参数组合与baseline(表8)基本保持一致(num_hiddens由512改为32),结果样例如图45所示: 

图45 双层LSTM结果样例

分析:根据图44与图45,处理后的数据能正常进入双层LSTM模型,经过500次epoch,最终困惑度为2.1,在个人cpu上速度为56000.6词元/秒,验证了GRU模型设计与代码实现的合理性与正确性。

1.3 消融实验

类似第二章的模型,我们也可以对深度循环神经网络的参数进行消融实验以探究不同选择对模型速度与性能的影响,而最重要的即num_layers的影响,故本部分以其消融实验为例,其他探究与第二章1.4的逻辑和步骤保持一致,在此不赘述。

我们保持其他参数不变,仅改变num_layers,每个取值进行20组实验取平均值,探究困惑度与词元/秒的变化趋势,数据汇总于表20,效果对比如图46:

表20 num_layers对LSTM的影响(数据汇总)

num_layers

1

2

3

4

5

困惑度

4.2

2.09

2.54

17.4

17.5

词元/秒

99672.4

55921.8

38788.8

28000.5

23579.3

图46 num_layers对LSTM的影响(效果对比)

分析:如表20与图46,可以发现随着困惑度随num_layers先减小再增大再稳定,而词元/秒则逐渐减小(效率降低),故num_layers的选择并不是越大越好,与其他超参数的选择也息息相关,需多次测试选出最优值,本实验中中2的num_layers是一个不错的选择。

    至此,深度循环神经网络的理论与实验部分均已解析完成。

2.双向循环神经网络

参考论文:Bidirectional Recurrent Neural Networks - (cmu.edu)

首先来看一个例子感受一下“未来”的重要性,如表21所示:

表21 文本序列填空样例

饿 。

不是 非常饿。

非常 非常饿,我可以吃下一只猪。

根据可获得的信息量,我们可以用不同的词填空,在本样例中,下文(“未来”)传达了重要信息/做了限制,而RNN是只关注上文的,在本部分存在局限,故BRNN(双向循环神经网络)出现了,架构样例如图47所示,前/反向传播更新如式14,输出如式15:

Tips:式子含义与网络迭代推理时保持一致,在此不赘述。 

图47 BRNN架构样例

故BRNN的关键特征是使用来自序列两端的信息来估计输出,但在预测下一个词元时这步的意义有限,且会大大降低计算速度,故双向层的使用在实践中非常少,并且仅仅应用于部分场合。 例如,填充缺失的单词、词元注释(例如,用于命名实体识别) 以及作为序列处理流水线中的一个步骤对序列进行编码(例如,用于机器翻译)。

BRNN的代码实现如Code9所示,结果样例如图48:

import torch
from torch import nn
from d2l import torch as d2l
# 加载数据
batch_size, num_steps, device = 32, 35, d2l.try_gpu()
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
# 通过设置“bidirective=True”来定义双向LSTM模型
vocab_size, num_hiddens, num_layers = len(vocab), 256, 2
num_inputs = vocab_size
lstm_layer = nn.LSTM(num_inputs, num_hiddens, num_layers, bidirectional=True)
model = d2l.RNNModel(lstm_layer, len(vocab))
model = model.to(device)
# 训练模型
num_epochs, lr = 500, 1
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)

Tips:本样例为双向循环神经网络的错误应用,即用其进行序列预测。 

图48 双向LSTM结果样例

分析:根据图48,我们可以看到最终预测输出的不合理性,验证了双向循环神经网络在序列预测等任务上的局限性。

3.稠密连接网络

我们在第一节使用深度循环神经网络的时候难免会有一个问题:添加层是否可以提高准确性?如图49所示,也通过消融实验做了简单测试。在实验2中我们研究CNN的时候解析了ResNet(在此不赘述),实际上深度循环神经网络是可以与残差网络组合的,即稠密连接网络,架构样例如图50所示:

 图49 准确性与层数关系

图50 稠密连接网络架构样例

根据图50与两种相关网络定义,我们可以总结稠密连接网络主要特点如下:

  1. 将前一层的输出连接为下一层的输入
  2. 偶尔添加过渡层以减少维度

4.机器翻译与数据集

    前面的模型我们一直在以序列预测为基准任务,实际上是有点局限且乏味的,NLP作为计算机领域高速发展的领域,实际上还有许多经典的任务,比如机器翻译任务,其为语言模型最成功的基准测试。 因为机器翻译正是将输入序列转换成输出序列的序列转换模型的核心问题,故我们引入一下相关概念与数据集,后面的几个架构介绍都会基于机器翻译任务。

机器翻译,顾名思义即指的是将序列从一种语言自动翻译成另一种语言,一般分为统计机器翻译(基于统计学方法)与神经机器翻译(基于神经网络)。

机器翻译的数据集是由源语言和目标语言的文本序列对组成的。因此,我们需要一种完全不同的方法来预处理机器翻译数据集,而不是复用语言模型的预处理程序。一个样例流程总结于表22(如何将预处理后的数据加载到小批量中用于训练):

Tips:本实验以Tatoeba项目的双语句子对 组成的“英-法”数据集为例,详情可见:Tab-delimited Bilingual Sentence Pairs from the Tatoeba Project(manythings.org)

表22 机器翻译数据集预处理

输入:数据集

数据集中的每一行都是制表符分隔的文本序列对, 序列对由英文文本序列和翻译后的法语文本序列组成。 请注意,每个文本序列可以是一个句子, 也可以是包含多个句子的一个段落,例如:

Go. Va !

Hi. Salut !

Run!        Cours !

1、预处理

    下载导入数据集后,需要经过几个预处理步骤,例如用空格代替不间断空格, 使用小写字母替换大写字母,并在单词和标点符号之间插入空格等。

2、词元化

与前面的词元化不同,在机器翻译中,一般更喜欢单词级词元化(最先进的模型可能使用更高级的词元化技术),例如:

([['go', '.'],

[['ça', 'alors', '!']])

3、词表构建与加载数据集

分别为源语言和目标语言构建两个词表,并通过截断和填充方式实现一次只处理一个小批量的文本序列。

输出:将处理好的数据输出到模型,进行训练。

5.编码器-解码器架构

    Encoder-Decoder(编码器-解码器)是深度学习模型的抽象概念,一般认为很多模型均起源/共同表征于这个架构,包括但不限于CNN、RNN、Transformer,广义架构如图51:

图51 编码器-解码器广义架构

根据图51,很容易归纳出其架构的两个核心:

  1. 编码器(Encoder):负责将输入(Input)转化为特征(Feature)
  2. 解码器(Decoder):负责将特征(Feature)转化为目标(Target)

而我们提到很多模型可以在这个架构下共同表征,以CNN和RNN为例,如图52所示,它们的简单理解如下:

  1. CNN可以认为是解码器可以不接受输入的情况
  2. RNN可以认为是解码器同时接受输入的情况

图52 CNN vs RNN(Encoder-Decoder)

让我们的视角聚焦回RNN,第四节我们说到,机器翻译是序列转换模型的一个核心问题,其输入和输出都是长度可变的序列,故编码器-解码器架构是一个不错的选择,编码器接受一个长度可变的序列作为输入,并将其转换为具有固定形状的编码状态。解码器将固定形状的编码状态映射到长度可变的序列。代码实现分别如Code10与Code11所示:

class Encoder(nn.Module):
    def __init__(self, **kwargs):
        super(Encoder, self).__init__(**kwargs)

    def forward(self, X, *args):
        raise NotImplementedError

class Decoder(nn.Module):
    def __init__(self, **kwargs):
        super(Decoder, self).__init__(**kwargs)

    def init_state(self, enc_outputs, *args):
        raise NotImplementedError

    def forward(self, X, state):
        raise NotImplementedError
model = model.to(device)

6.序列到序列学习

参考论文:Sequence to Sequence Learning with Neural Networks 14 Dec 2014

上一节我们解析了编码器-解码器架构,其会启发人们使用具有状态的神经网络。本节我们来讲讲一个使用循环神经网络设计基于“编码器-解码器”架构的序列转换模型——seq2seq(序列到序列学习)。样例架构如图53所示,其中的层如图54所示:

图53 RNN编码器-解码器的序列到序列学习架构样例

图54 循环神经网络编码器-解码器模型中的层

如图54,序列到序列学习的核心是一旦输出序列生成此词元,模型就会停止预测。对于训练效果的度量,可以引入常规的loss(softmax来获得分布,并通过计算交叉熵损失函数来进行优化)而对于预测序列的评估,我们可以通过与真实的标签序列进行比较来评估预测序列,即使用BLEU测量许多应用的输出序列的质量。原则上说,对于预测序列中的任意n元语法, BLEU的评估都是这个n元语法是否出现在标签序列中,如式16: 

\exp \left(\min \left(0,1-\frac{\text { len }_{\text {label }}}{\text { len }_{\text {pred }}}\right) \prod_{n=1}^{k} p_{n}^{1 / 2^{n}}\right.

式16 BLEU定义

Tips:其中lenlabel表示标签序列中的词元数和lenpred表示预测序列中的词元数,k是用于匹配的最长的n元语法。 另外,用pn表示n元语法的精确度。

在代码实现方面,详情可见seq2seq.py,本部分仅作核心代码解析。首先是编码器与解码器的设计,核心与Code10、Code11保持一致(需扩展),而在训练与模型初始化部分,代码如Code12,两个结果样例分别如图55与图56(训练与预测):

#@save 训练
def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device):
    """训练序列到序列模型"""
    def xavier_init_weights(m):
        if type(m) == nn.Linear:
            nn.init.xavier_uniform_(m.weight)
        if type(m) == nn.GRU:
            for param in m._flat_weights_names:
                if "weight" in param:
                    nn.init.xavier_uniform_(m._parameters[param])

    net.apply(xavier_init_weights)
    net.to(device)
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
    loss = MaskedSoftmaxCELoss()
    net.train()
    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
                     xlim=[10, num_epochs])
    for epoch in range(num_epochs):
        timer = d2l.Timer()
        metric = d2l.Accumulator(2)  # 训练损失总和,词元数量
        for batch in data_iter:
            optimizer.zero_grad()
            X, X_valid_len, Y, Y_valid_len = [x.to(device) for x in batch]
            bos = torch.tensor([tgt_vocab['<bos>']] * Y.shape[0],
                          device=device).reshape(-1, 1)
            dec_input = torch.cat([bos, Y[:, :-1]], 1)  # 强制教学
            Y_hat, _ = net(X, dec_input, X_valid_len)
            l = loss(Y_hat, Y, Y_valid_len)
            l.sum().backward()      # 损失函数的标量进行“反向传播”
            d2l.grad_clipping(net, 1)
            num_tokens = Y_valid_len.sum()
            optimizer.step()
            with torch.no_grad():
                metric.add(l.sum(), num_tokens)
        if (epoch + 1) % 10 == 0:
            animator.add(epoch + 1, (metric[0] / metric[1],))
    print(f'loss {metric[0] / metric[1]:.3f}, {metric[1] / timer.stop():.1f} '
        f'tokens/sec on {str(device)}')
#模型初始化
embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.1
batch_size, num_steps = 64, 10
lr, num_epochs, device = 0.005, 300, d2l.try_gpu()
train_iter, src_vocab, tgt_vocab = d2l.load_data_nmt(batch_size, num_steps)
encoder = Seq2SeqEncoder(len(src_vocab), embed_size, num_hiddens, num_layers,
                        dropout)
decoder = Seq2SeqDecoder(len(tgt_vocab), embed_size, num_hiddens, num_layers,
                        dropout)net = d2l.EncoderDecoder(encoder, decoder)
train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device)

图55 seq2seq用于机器翻译的结果样例(训练) 

图56 seq2seq用于机器翻译的结果样例(预测)

类似前面所有RNN模型,seq2seq同样可以进行消融实验探究不同参数组合对模型效果的影响,逻辑与步骤在上文已详细分析,在此不赘述。

至此,大部分主流RNN模型架构均已解析完成。

五、总结

1.实验结论

本次实验完成任务梳理如表23,不同RNN简介与对比如图表2所示:

表23 实验3完成任务梳理

1、理论梳理:

    第一章进行了循环神经网络的综述(背景、概念、原理、发展历程)与RNN训练的基本原理与流程介绍,在第二章中按时间线从架构与数理两部分对RNN(tanh)、LSTM、GRU进行了解析。在第三章总结介绍了一些其他的循环神经网络及其优化。

2、多种RNN实践与优化

    从自构建与API两种方式对比实现了RNN,并对RNN、LSTM、GRU均进行了不同参数的消融实验,定量探究对应参数与架构设计对模型速度与性能的影响,并对比分析了三种模型效果。在高级循环神经网络中,提出了多种优化策略,并分别选择了深度循环神经网络、双向循环神经网络、序列到序列学习进行了实践探究。

3、方案补充

   对于RNN的高级架构实现与优化,除了给定的要求,均作了相关的拓展,比如在原理侧,详细解析了语言模型的核心前置知识(序列模型/预测、文本预处理、机器翻译等)。

图表2 不同RNN简介与对比

1、经典RNN(tanh):

广义上RNN的开山之作,通过循环将训练“学”到的东西蕴藏在权值W中,本质上是循环/递推函数。

2、LSTM

在RNN(tanh)的基础上引入了记忆元,并通过遗忘门、输入门、输出门进行状态控制,实现了长短期记忆共用,也缓解了梯度爆炸/梯度消失的问题。

3、GRU

GRU主要是对LSTM的简化,组合了遗忘门和输入门到一个单独的“更新门”中,也合并了cell state和hidden state,并且做了一些其他的改变。

4、深度循环神经网络

深度循环神经网络的核心即堆叠RNN(改变隐藏层的数量)。

5、稠密连接网络

稠密连接网络即深度循环神经网络与残差网络的组合。

6、双向循环神经网络

双向循环神经网络即同时关注上下文的RNN(使用来自序列两端的信息来估计输出),但在预测下一个词元时这步的意义有限,且会大大降低计算速度,故双向层的使用在实践中非常少,并且仅仅应用于部分场合。

7、编码器-解码器结构

编码器-解码器是深度学习模型的抽象概念,一般认为很多模型均起源/共同表征于这个架构,对RNN即编码器接受一个长度可变的序列作为输入,并将其转换为具有固定形状的编码状态。解码器将固定形状的编码状态映射到长度可变的序列。

8、序列到序列学习

序列到序列学习即用循环神经网络设计基于“编码器-解码器”架构的序列转换模型。

补充:RNN的选择需要和实际需求紧密结合,并不存在某种模型/算法适用于各种数据集、任务、算力资源中。

2. 参考资料

1. 8. Recurrent Neural Networks — Deep Learning 2.0.0 documentation (d2l.ai)

2. The most detailed explanation of cyclic neural network in history (RNN/LSTM/GRU) - Zhihu (zhihu.com)

3. RNN research and development process - short book (jianshu.com)

4. Starting from SRNN in the 1990s, review the research progress of cyclic neural network for 27 years- Zhihu (zhihu.com)

5. Sequence model evolution and study notes in deep learning (including RNN/LSTM/GRU/Seq2Seq/Attention mechanism)

6. Interpretation of RNN Papers on Recurrent Neural Networks

7. RNN Detailed Explanation (Recurrent Neural Network)_bestrivern's Blog-CSDN Blog

8. Introduction to LSTM and mathematical derivation (FULL BPTT)_lstm mathematical expression_a635661820's blog-CSDN blog

9. Recurrent Neural Network RNN, LSTM, GRU - Short Book (jianshu.com)

10. A detailed LSTM and GRU diagram - Zhihu (zhihu.com)

11. Deep Learning: Encoder-Decoder Architecture - Zhihu (zhihu.com)

Guess you like

Origin blog.csdn.net/weixin_51426083/article/details/130148220