Deep Learning 05-RNN Recurrent Neural Network

Overview

Recurrent Neural Network (RNN) is a neural network structure with recurrent connections and is widely used in natural language processing, speech recognition, time series data analysis and other tasks. Compared with traditional neural networks, the main feature of RNN is that it can process sequence data and capture the timing information in the sequence.

The basic unit of RNN is a Recurrent Unit, which receives an input and a hidden state from the previous time step, and outputs the hidden state of the current time step. In traditional RNN, the recurrent unit usually uses activation functions such as tanh or ReLU.

Basic recurrent neural network

principle

The basic recurrent neural network structure consists of an input layer, a hidden layer and an output layer.

Insert image description here
   x x x is the input vector,ooo is the output vector,sss represents the value of the hidden layer;UUU is the weight matrix from the input layer to the hidden layer,VVV is the weight matrix from the hidden layer to the output layer. The value s of the hidden layer of the recurrent neural network does not only depend on the current inputxxx , also depends on the value of the last hidden layersss . The weight matrix W is the last value of the hidden layer as the weight of this input.
  
Expand the basic RNN structure in the above figure in the time dimension (RNN is a chain structure, each time slice uses the same parameters, t represents time t): Now it will look much clearer.
Insert image description here
This network receives at time t to inputxt x_txtAfter that, the value of the hidden layer is st s_tst, the output value is ot o_tot. The key point is st s_tstThe value of x_t depends not only on xtxt, also depends on st − 1 s_{t−1}st1
公式1: s t = f ( U ∗ x t + W ∗ s t − 1 + B 1 ) s_t=f(U∗x_t+W∗s_{t−1}+B1) st=f(Uxt+Wst1+B 1 )
Formula 2:ot = g ( V ∗ st + B 2 ) o_t=g(V∗s_t+B2)ot=g ( Vst+B2 ) _

  • Equation 1 is the calculation formula of the hidden layer, which is a recurrent layer. U is the weight matrix of input x, W is the last hidden layer value S t − 1 S_{t−1}St1As the input weight matrix this time, f is the activation function.
  • Equation 2 is the calculation formula of the output layer, V is the weight matrix of the output layer, g is the activation function, and B1 and B2 are biases assumed to be 0.

The hidden layer has two inputs, the first is U and xt x_txtThe product of vectors, the second is the state st − 1 s_t−1 output by the previous hidden layerstThe product of 1 and W. Equal to st − 1 s_t−1calculated at the previous momentst1 needs to be cached, enter xt x_tthis timextCalculate together and output the final ot o_t togetherot

If we repeatedly put Equation 1 into Equation 2, we will get:
Insert image description here
As can be seen from the above, the output value ot of the recurrent neural network is affected by the previous input values,,,,,,,, xt x_txt x t − 1 x_{t−1} xt1 x t − 2 x_{t−2} xt2 x t − 3 x_{t−3} xt3,... influence, which is why the recurrent neural network can look forward to any number of input values. This is actually not good, because if the previous value has no relationship with the later value, and the recurrent neural network still considers the previous value, it will affect the judgment of the later value.

The above is the forward propagation process of the entire one-way single-layer NN.

In order to understand the input x input format faster, let's use Word Embedding in nlp to explain.

Word Embedding

First, we need to encode the input text x into a language that the computer can read. When encoding, we expect to maintain similar lines between words between sentences. The vector representation of words is the basis for machine learning and deep learning.

A basic idea of ​​word embedding is that we map a word to a point in the semantic space, and map a word to a low-dimensional dense space. This mapping makes words that are semantically similar have a closer distance in the semantic space. Close, if the relationship between two words is not very close, then the vectors will be relatively far in the semantic space.

Insert image description here
As shown in the figure above, English and Spanish are mapped to the semantic space. Numbers with the same semantics have the same distribution position in the semantic space. Let’s briefly
review word embedding. For nlp, what we input is discrete symbols. For neural networks, Say, it deals with vectors or matrices. So in the first step, we need to encode a word into a vector. The simplest is the one-hot representation method. As shown in the figure below:
Insert image description here
python code (one-hot), such as

import numpy as np

word_array = ['apple', 'kiwi', 'mango']
word_dict = {'apple': 0, 'banana': 1, 'orange': 2, 'grape': 3, 'melon': 4, 'peach': 5, 'pear': 6, 'kiwi': 7, 'plum': 8, 'mango': 9}

# 创建一个全为0的矩阵
one_hot_matrix = np.zeros((len(word_array), len(word_dict)))

# 对每个单词进行one-hot编码
for i, word in enumerate(word_array):
    word_index = word_dict[word]
    one_hot_matrix[i, word_index] = 1

print(one_hot_matrix)

Output:

[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]   #这就是apple的one-hot编码
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]    #这就是kiwi的one-hot编码
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]    #这就是mango的one-hot编码

The rows represent each word, the columns represent the corpus, and each column corresponds to a corpus word, which is the feature column.

Although one-hot encoding is a simple and effective feature representation method, it also has some shortcomings:

  1. High-dimensional representation: When using one-hot encoding, each feature needs to create a large sparse vector with dimensions equal to the number of unique values ​​of the feature. This results in high-dimensional input data, increasing computational and storage overhead. Especially when dealing with problems with a large number of discrete features, which results in a very large feature space.

  2. Dimension independence: One-hot encoding represents each feature as an independent binary feature, without taking into account the correlation and semantic relationship between features. This may make it difficult for the model to capture the interactions and correlations between features, thus affecting the performance of the model.

  3. Unable to handle unknown features: one-hot encoding requires that the unique values ​​of the feature appear in the training set, otherwise problems will occur. If feature values ​​that do not appear in the training set are encountered in the test set or actual application, one-hot encoding cannot be performed, which may cause the model to be unable to handle these unknown features.

  4. Feature sparsity: Since the feature vector of one-hot encoding is sparse, most elements are 0, which will lead to increased data sparsity, which may cause some problems for some algorithms (such as linear models).

In summary, although one-hot coding is a simple and effective feature representation method in some cases, it also has some shortcomings, especially when dealing with high-dimensional discrete features, considering the relationship between features, and dealing with unknown feature values. You may encounter problems.

There are two main reasons for using nn.Embedding instead of one-hot encoding:

  1. Dimensionality flexibility: When using one-hot encoding, each feature needs to create a large sparse vector with dimensions equal to the number of unique values ​​of the feature. This results in high-dimensional input, increasing computational and storage overhead. The use of embedding can map discrete features into low-dimensional continuous vector representations, reducing storage and computing costs.

  2. Semantic relationships and similarities: Embedding vectors can capture the semantic relationships and similarities between features. For example, in natural language processing tasks, using embedding vectors can map words into continuous vector representations, so that words with similar semantic meanings are closer in the embedding space. Such features can help the model better understand and learn the relationship between features and improve the performance of the model.

Therefore, using nn.Embedding instead of one-hot encoding can improve the efficiency and performance of the model, especially when dealing with high-dimensional discrete features.

Okay, let's look at a simple example to push the effect of the two parameters of nn.embedding.

Suppose we have a sentence classification task, our input is a sentence and each word is a feature. We have 5 different words, namely ["I", "love", "deep", "learning", "!"].

We can use nn.embedding to map these words into embedding vectors (a position in the coordinate system that points to the word). Suppose we embed each word as a 3-dimensional vector. Here, num_embeddings is 5, which means we have 5 different words; embedding_dim is 3, which means each word is embedded as a 3-dimensional vector.

We can use the following table to represent the embedding vector of each word:

word embedding vector
“I” [0.1, 0.2, 0.3]
“love” [0.4, 0.5, 0.6]
“deep” [0.7, 0.8, 0.9]
“learning” [0.2, 0.3, 0.4]
“!” [0.5, 0.6, 0.7]

With nn.embedding, we can convert each word in the sentence into the corresponding embedding vector. For example, the sentence "I love deep learning!" can be converted into the following sequence of embedding vectors:

[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6], [0.7, 0.8, 0.9], [0.2, 0.3, 0.4], [0.5, 0.6, 0.7]]

In this way, we can convert discrete word features into continuous embedding vectors for use in deep learning models.

The following is the use of pytorch ( getting started with python )

 # 创建词汇表
        vocab = {"I": 0, "love": 1, "deep": 2, "learning": 3, "!": 4}
        strings=["I", "love", "deep", "learning", "!" ]
        # 将字符串序列转换为整数索引序列
        input = t.LongTensor([vocab[word] for word in strings])
        #注意第一个参数是词汇表的个数,并不是输入单词的长度,你在这里就算填100也不影响最终的输出维度,这个输入值影响的是算出来的行向量值
        #nn.Embedding模块会随机初始化嵌入矩阵。在深度学习中,模型参数通常会使用随机初始化的方法来开始训练,以便模型能够在训练过程中学习到合适的参数值。
        #在nn.Embedding中,嵌入矩阵的每个元素都会被随机初始化为一个小的随机值,这些值将作为模型在训练过程中学习的可训练参数,可以使用manual_seed固定。
        t.manual_seed(1234)
        embedding=nn.Embedding(len(vocab),3)
        print(embedding(input))

The output results are:
tensor([[-0.1117, -0.4966, 0.1631],
[-0.8817, 0.0539, 0.6684], [
-0.0597, -0.4675, -0.2153],
[ 0.8840, -0.7584, -0.3689],
[-0.3424 , -1.4020, 0.3206]], grad_fn=)

Note that the first parameter of Embedding is not the length of the input characters, but the length of the vocabulary. For example, there is a vocabulary
{"I": 0, "love": 1, "deep": 2, "learning": 3, " !": 4}, and the input input may be: i love. At this time, 5 should be passed in instead of 2, because the prediction of the last hidden layer requires a full connection to predict the current input word for all words in the entire vocabulary. The probability.

pytorch rnn

The following is the simplest example of using rnn in pytorch to get familiar with pytorch rnn.
Note that pytorch's rnn does not handle the logic from the hidden layer to the output layer. It only focuses on the output results of the hidden layer. If you need to convert the hidden layer into the result output, You can add a fully connected layer. We will not focus on this part here.

#%%

import torch
import torch.nn as nn

# 定义输入数据
input_size = 10   # 输入特征的维度
sequence_length = 5   # 时间步个数
batch_size = 3   # 批次大小

# 创建随机输入数据
#输入数据的维度为(sequence_length, batch_size, input_size),表示有sequence_length个时间步,
#每个时间步有batch_size个样本,每个样本的特征维度为input_size。
input_data = torch.randn(sequence_length, batch_size, input_size)
print("输入数据",input_data)
# 定义RNN模型
# 定义RNN模型时,我们指定了输入特征的维度input_size、隐藏层的维度hidden_size、隐藏层的层数num_layers等参数。
# batch_first=False表示输入数据的维度中批次大小是否在第一个维度,我们在第二个维度上。
rnn = nn.RNN(input_size, hidden_size=20, num_layers=1, batch_first=False)
"""
在前向传播过程中,我们将输入数据传递给RNN模型,并得到输出张量output和最后一个时间步的隐藏状态hidden。
输出张量的大小为(sequence_length, batch_size, hidden_size),表示每个时间步的隐藏层输出。
最后一个时间步的隐藏状态的大小为(num_layers, batch_size, hidden_size)。
"""
# 前向传播,第二个参数h0未传递,默认为0
output, hidden = rnn(input_data)
print("最后一个隐藏层",hidden.shape)
print("输出所有隐藏层",output.shape)

# 打印每个隐藏层的权重和偏置项
# weight_ih表示输入到隐藏层的权重,weight_hh表示隐藏层到隐藏层的权重,注意这里使出是转置的结果。
# bias_ih表示输入到隐藏层的偏置,bias_hh表示隐藏层到隐藏层的偏置。
for name, param in rnn.named_parameters():
    if 'weight' in name or 'bias' in name:
        print(name, param.data)

output

最后一个隐藏层 torch.Size([1, 3, 20])
输出所有隐藏层 torch.Size([5, 3, 20])

Insert image description here

Why are the weights 10 rows and 20 columns? The principle of the parameter convolutional neural network.
The length of the outermost row of the data determines the number of forward propagation time series.
This input_size is the dimension of the input data. For example, after a word is converted to one-hot, the column is the feature length of the dictionary.
This hidden_size is the number of hidden layer neurons, which is the number of features input to the final hidden layer.
In num_layer are stacked multi-layer hidden layers.

common structures

Commonly used result types of RNN (Recurrent Neural Network) include single-input single-output, single-input multiple-output, multiple-input multiple-output, and multiple-input single-output. Below I'll explain each result type in detail and how they can be used.

  1. Single Input Single Output (SISO): This is the most common RNN result type. The input is a sequence and the output is a single predicted value. For example, given a piece of text, predict the next word; given a period of sequence data, predict the value of the next time step. This type of result is suitable for many sequence prediction tasks, such as language models, time series prediction, etc.
    For example, suppose we want to predict the price of a house. We may use multiple features, such as the area of ​​the house, the number of bedrooms, the number of bathrooms, etc. In this way, we can combine these features into a feature vector as the input of the model, and the output of the model is the predicted house price. Therefore, linear regression can be used to solve the problem of multiple features to a single output, so it is called a single-input single-output model.

  2. Single Input Multiple Output (SIMO): In this result type, the input is a sequence but the output is multiple predicted values. For example, given a piece of text, predict the next word and the part-of-speech tag of the word at the same time; given a piece of audio signal, predict the speech emotion and speaker identity at the same time. This result type is suitable for situations where multiple related tasks need to be predicted simultaneously.

  3. Multiple Input Multiple Output (MIMO): In this result type, there are multiple input sequences and multiple output sequences. For example, in a machine translation task, the input is a sequence of sentences in the source language, and the output is a sequence of sentences in the target language; in a dialogue system, the input is a sequence of questions from the user, and the output is a sequence of answers from the system. This result type is suitable for tasks that need to process multiple input and output sequences. Mimo has two types: equal and unequal input and output numbers.

  4. Multiple Input Single Output (MISO): In this result type, there are multiple input sequences but only one output. For example, in the image description generation task, the input is an image sequence, and the output is a description of the image; in autonomous driving, the input is a data sequence from multiple sensors, and the output is a vehicle control command. This result type is suitable for tasks that require mapping multiple input sequences to a single output sequence.

Linear regression is a simple machine learning model whose input can be multiple features but has only one output. "Single input single output" here means that the input of the model is a vector (a combination of multiple features) and the output is a scalar (a predicted value). In linear regression, we obtain a predicted value by linearly combining input features. So although the input can be multiple elements, the output is only one.

Insert image description here

bidirectional recurrent neural network

Ordinary RNN can only predict the output of the next moment based on the timing information of the previous moment. However, in some problems, the output of the current moment is not only related to the previous state, but also to the future state.

For example, predicting a missing word in a sentence not only needs to be judged based on the previous text, but also the content behind it needs to be considered to truly make a judgment based on context.

BRNN is composed of two RNNs superimposed one above the other, and the output is determined by the states of the two RNNs.
Insert image description here
First, let’s focus on the symbols in the pictures and formulas for easy reference when needed:

  • h t 1 h_t^1 ht1Represents the memory (information) obtained from left to right in Cell1 at time t;
  • W 1 , U 1 W^1,U^1 W1,U1 represents the learnable parameters of Cell1 in the figure, W is the parameter of the hidden layer, U is the parameter of the input layer;
  • f 1 f_1 f1Represents the activation function of Cell1;
  • h t 2 h_t^2 ht2Represents the memory obtained from right to left in Cell2 at time t;
  • W 2 W^2 W2, U 2 U^2 U2 represents the learnable parameters of Cell2 in the figure;
  • f 2 f_2 f2Represents the activation function of Cell2;
  • VVV is the parameter of the output layer, which can be understood as MLP;
  • f 3 f_3 f3is the activation function of the output layer;
  • y t y_t ytis the output value at time t;

In Figure 1-1, for the input xt x_t at time txt, can be combined with memory ht − 1 1 h^1_{t-1} from left to rightht11, get the memory ht 1 h^1_t at the current momentht1: Similarly, memory ht − 1 2 h^2_{t−1}
Insert image description here
can also be combined from right to leftht12, get the memory ht 2 h^2_t at the current momentht2:
Insert image description here
Then ht 1 h^1_tht1and ht 2 h^2_tht2The head and tail are cascaded together through the output layer network VVV gets the outputyt y_tyt:
Insert image description here
In this way, for any moment t, you can see the memory obtained from different directions, making the model easier to optimize and accelerating the convergence speed of the model.

pytorch rnn

The following is the simplest example of implementing a bidirectional RNN using the nn.RNN module in PyTorch:

import torch
import torch.nn as nn

# 定义输入数据
input_size = 10   # 输入特征的维度
sequence_length = 5   # 时间步个数
batch_size = 3   # 批次大小

# 创建随机输入数据
input_data = torch.randn(sequence_length, batch_size, input_size)

# 定义双向RNN模型
rnn = nn.RNN(input_size, hidden_size=20, num_layers=1, batch_first=False, bidirectional=True)

# 前向传播
output, hidden = rnn(input_data)

# 输出结果
print("输出张量大小:", output.size())
print("最后一个时间步的隐藏状态大小:", hidden.size())

output

输出张量大小: torch.Size([5, 3, 40])
最后一个时间步的隐藏状态大小: torch.Size([2, 3, 20])

In this example, the dimensions of the input data are the same as in the previous example.

When defining a bidirectional RNN model, we set bidirectional=True in the parameters of the RNN model, indicating that we want to build a bidirectional RNN model.

During forward propagation, we pass the input data to the bidirectional RNN model and get the output tensor output and the hidden state of the last time step. The size of the output tensor is (sequence_length, batch_size, hidden_size num_directions), where num_directions is 2, indicating both forward and reverse directions. The size of the hidden state at the last time step is (num_layers num_directions, batch_size, hidden_size).

Bidirectional RNN can utilize both past and future information and can better capture the features in time series data. You can adjust the size of the input data, parameters of the RNN model, etc. as needed to conduct experiments.

The output of a bidirectional RNN is usually a combination of forward and backward hidden states, which are stored in an array. Specifically, if you use the nn.RNN module in PyTorch to implement a bidirectional RNN, the shape of the output tensor will be (sequence_length, batch_size, hidden_size * 2), where hidden_size * 2 represents the sum of the sizes of the forward and reverse hidden states . This output tensor contains information about the forward and backward hidden states at each time step and can be used in subsequent tasks.
The final hidden layer size of the bidirectional rnn is (2, batch_size, hidden_size)

Deep RNN (multi-layer RNN)

The RNN we introduced earlier is the transformation of data in the time dimension. No matter how long the time dimension is, there is only one RNN module, that is, there is only one set of parameters to be learned (W, U), which belongs to a single-layer RNN. Deep RNN is also called multi-layer RNN. As the name suggests, it is composed of multiple RNN cascades, which transforms the input data in the spatial dimension. As shown in the figure, this is the L-layer RNN architecture. Each layer is a separate RNN, and there are L RNNs in total.
Insert image description here

In the horizontal direction of each layer, there is only one set of learnable parameters, such as the llParameters of layer l W l U l W^lU^lWlUl . The horizontal direction is that the data is transformed along the time dimension. The transformation mechanism is consistent with the mechanism of a single RNN. For details, please refer to the previous article. In the vertical direction at each time t, there are L sets of learnable parameters (W i , U i W^i,U^iWi,Ui ) i = 1, 2, …, L. in thellThe input data of Cell at time t of layer l comes from two directions: one is the outputhtl − 1 h^{l−1}_thtl1:
Insert image description here
One is from the lll layer,t − 1 t − 1tMemory data at 1 momentht − 1 lh^l_{t−1}ht1l:
Insert image description here
So the output of Cell htlh^l_thtl:
Insert image description here
Essentially, Deep RNN modifies the input at the current moment into the output of the upper layer based on a single RNN. In this way, RNN completes spatial data transformation. An additional mention: Each layer of DeepRNN can also be a bidirectional RNN.

pytorch rnn

The following is the simplest example of using the nn.RNN module to implement a multi-layer RNN:

import torch
import torch.nn as nn

# 定义输入数据和参数
input_size = 5
hidden_size = 10
num_layers = 2
batch_size = 3
sequence_length = 4

# 创建输入张量
input_tensor = torch.randn(sequence_length, batch_size, input_size)

# 创建多层RNN模型
rnn = nn.RNN(input_size, hidden_size, num_layers)

# 前向传播
output, hidden = rnn(input_tensor)

# 打印输出张量和隐藏状态的大小
print("Output shape:", output.shape)
print("Hidden state shape:", hidden.shape)

In the above example, we first defined the dimensions of the input data, the parameters of the RNN model (input size, hidden state size, and number of layers), as well as the batch size and sequence length. Then, we create an input tensor with shape (sequence_length, batch_size, input_size). Next, we use the nn.RNN module to create a multi-layer RNN model with two layers. Finally, we perform forward propagation by passing the input tensor to the forward method of the RNN model and print the size of the output tensor and hidden state.

Note that the output tensor has shape (sequence_length, batch_size, hidden_size), where sequence_length and batch_size remain constant and hidden_size is the size of the hidden state. The shape of the hidden state is (num_layers, batch_size, hidden_size), where num_layers is the number of layers of the RNN model.

Disadvantages of RNN

Exploding and vanishing gradient problems

In practice, the several RNNs introduced earlier cannot handle longer sequences well. RNN is prone to gradient explosion and gradient disappearance during training, which results in the gradient not being passed on in longer sequences, making the RNN unable to capture it. to long distance effects.

Generally speaking, gradient explosion is easier to deal with. Because when the gradient explodes, our program will receive a NaN error. We can also set a gradient threshold, and when the gradient exceeds this threshold, it can be intercepted directly.

Vanishing gradients are harder to detect, and a bit more difficult to deal with. In general, we have three methods to deal with the vanishing gradient problem:

1. Reasonable initialization weight values. Initialize the weights so that each neuron does not take a maximum or minimum value as much as possible to avoid the area where the gradient disappears.

2. Use relu instead of sigmoid and tanh as the activation function. .

3. Use RNNs of other structures, such as Long Short-term Memory Network (LTSM) and Gated Recurrent Unit (GRU), which is the most popular approach.

short term memory

If we need to determine the user's intention to speak (ask about the weather, ask about the time, set an alarm clock...), and the user says "what time is it?" we need to segment the sentence first: then input RNN in order, we first divide
Insert image description here
" "what" is used as the input of RNN, and the output "01" is obtained.
Insert image description here
Then, we input "time" into the RNN network in sequence, and the output "02" is obtained.

In this process, we can see that when "time" is input, the output of the previous "what" also has an impact (half of the hidden layer is black).
Insert image description here
By analogy, all previous inputs have an impact on the future output. You can see that the circular hidden layer contains all previous colors. As shown in the figure below:
Insert image description here
When we judge the intention, we only need the output "05" of the last layer, as shown in the figure below:
Insert image description here
The shortcomings of RNN are also obvious.
Insert image description here
Through the above example, we have found that short-term memory has a greater impact ( Such as the orange area), but the long-term memory impact is very small (such as the black and green areas). This is the short-term memory problem of RNN.

  1. RNN has short-term memory issues and cannot handle very long input sequences
  2. Training RNN requires a huge cost

Optimization algorithm of RNN

LSTM – Long short-term memory network

RNN is a rigid logic. The later the input has a greater impact, the earlier the input has a smaller impact, and this logic cannot be changed.
The biggest change made by LSTM is to break this rigid logic and use a flexible logic - only retaining important information.
To put it simply: focus on the key points!
Insert image description here
For example, let’s read the following paragraph quickly:
Insert image description here
After we finish reading it quickly, we may only remember the following key points:
Insert image description here
LSTM is similar to the highlighted points above. It can retain the "important information" in longer sequence data. ”, ignoring unimportant information. This solves the problem of RNN short-term memory.

principle

The hidden layer of the original RNN has only one state, h, which is very sensitive to short-term input. Then if we add another gate mechanism to control the circulation and loss of features, that is, c, and let it save the long-term state, this is the Long Short Term Memory network (Long Short Term Memory, LSTM).
Insert image description here
The newly added state c is called the unit state. We expand the LSTM according to the time dimension:
where the logo on the image σ \sigmaThe σ flag is activated using sigmod to [0-1],tanh ⁡ \tanhtanh activates to [-1,1]
⨀ is a mathematical symbol representing element-wise product or Hadamard product. When two matrices, vectors, or tensors of the same dimensions are multiplied element-wise, the ⨀ symbol can be used to indicate this.

For example, for two vectors [a1, a2, a3] ⨀ [b1, b2, b3]=[a1 b1 , a2 b2, a3*b3], their element-wise product can be expressed.
Insert image description here
It can be seen that at time t,

There are three inputs to LSTM: the output value of the network at the current moment xt x_txt, the output value of LSTM at the previous moment ht − 1 h_{t−1}ht1, and the memory unit vector ct − 1 c_{t−1} of the previous momentct1

There are two outputs of LSTM: the current moment LSTM output value ht h_tht, the hidden state vector ht h_t at the current momentht, and the memory unit state vector ct c_t at the current momentct

Note: Memory unit c ends its work inside the LSTM layer and does not output to other layers. The output of LSTM is only the hidden state vector h.

The key to LSTM is the cell state, which is the horizontal line that runs across the top of the graph, a bit like a conveyor belt. This part is generally called the cell state, which exists in the entire chain system of LSTM from beginning to end.
Insert image description here

oblivion door

f t f_t ftIt is called the forgetting gate , which means C t − 1 C_{t−1}Ct1Which features of are used to calculate C t C_tCt f t f_t ftis a vector, and each element of the vector is in the range (0~1). Usually we use sigmoid as the activation function. The output of sigmoid is a value between (0~1), but when you observe a trained LSTM, you will find that most of the gate values ​​​​are very close to 0. Or 1, and very few other values.
Insert image description here

input gate

C t C_t CtRepresents the unit status update value, which is determined by the input data xt x_txtand hidden node ht − 1 h_{t−1}ht1Obtained through a neural network layer, the activation function of the unit state update value usually uses tanh. it i_titIt's called the input gate, the same as ft f_tftIt is also a vector whose elements are in the range (0~1), also represented by xt x_txtand ht − 1 h_{t−1}ht1Calculated through the sigmoid activation function
Insert image description here

output gate

Finally, in order to calculate the predicted value yty^tyt and generate the complete input of the next time slice, we need to calculate the output of the hidden nodeht h_tht
Insert image description here

lstm writes poetry

First, let’s study the usage of lstm in pytorch.
Single-layer lstm

sequence_length =3
batch_size =2
input_size =4
#这里如果是输入比如[张三,李四,王五],一般实际使用需要通过embedding后生成一个[时间步是3,批量1(这里是1,但是如果是真实数据集可能有分批处理,就是实际的批次值),3(三个值的坐标表示一个张三或者李四)]
input=t.randn(sequence_length,batch_size,input_size)
lstmModel=nn.LSTM(input_size,3,1)
#其中,output是RNN每个时间步的输出,hidden是最后一个时间步的隐藏状态。
output, (h, c) =lstmModel(input)
#因为是3个时间步,每个时间步都有一个隐藏层,每个隐藏层都有2条数据,隐藏层的维度是3,最终(3,2,3)
print("LSTM隐藏层输出的维度",output.shape)
#
print("LSTM隐藏层最后一个时间步输出的维度",h.shape)
print("LSTM隐藏层最后一个时间步细胞状态",c.shape)

output

LSTM隐藏层输出的维度 torch.Size([3, 2, 3])
LSTM隐藏层最后一个时间步输出的维度 torch.Size([1, 2, 3])
LSTM隐藏层最后一个时间步细胞状态 torch.Size([1, 2, 3])

Double layer lstm

sequence_length =3
batch_size =2
input_size =4
input=t.randn(sequence_length,batch_size,input_size)
lstmModel=nn.LSTM(input_size,3,num_layers=2)
#其中,output是RNN每个时间步的输出,hidden是最后一个时间步的隐藏状态。
output, (h, c) =lstmModel(input)
print("2层LSTM隐藏层输出的维度",output.shape)
print("2层LSTM隐藏层最后一个时间步输出的维度",h.shape)
print("2层LSTM隐藏层最后一个时间步细胞状态",c.shape)

Output:
Dimension of the output of the hidden layer of 2-layer LSTM torch.Size([3, 2, 3]) Dimension of the
output of the last time step of the hidden layer of 2-layer LSTM torch.Size([2, 2, 3])
Hidden of 2-layer LSTM The cell state of the last time step of the layer torch.Size([2, 2, 3])

For 2 layers, the output is the output of the last hidden layer. h and c are two layers of hidden layers and memory cells in one time step.

Example of starting a poem
This is the directory structure of the project
Insert image description here

Download Data

The experimental data comes from more than 50,000 Tang poems collected by Chinese enthusiasts on Github. The author performed some data processing on this basis. Since data processing is time-consuming and is not the focus of pytorch learning, it is omitted here. The author provides a numpy compressed package tang.npz, the download address
can refer to the specific structure of the data, the main part of the code below

from torch.utils.data import  Dataset,DataLoader
import numpy as np
class PoetryDataset(Dataset):
    def __init__(self,root):
        self.data=np.load(root, allow_pickle=True)
    def __len__(self):
        return len(self.data["data"])
    def __getitem__(self, index):
        return self.data["data"][index]
    def getData(self):
        return self.data["data"],self.data["ix2word"].item(),self.data["word2ix"].item()
if __name__=="__main__":
    datas=PoetryDataset("./tang.npz").data
    # data是一个57580 * 125的numpy数组,即总共有57580首诗歌,每首诗歌长度为125个字符(不足125补空格,超过125的丢弃)
    print(datas["data"].shape)
    #这里都字符已经转换成了索引
    print(datas["data"][0])
    # 使用item将numpy转换为字典类型,ix2word存储这下标对应的字,比如{0: '憁', 1: '耀'}
    ix2word = datas['ix2word'].item()
    print(ix2word)
    # word2ix存储这字对应的小标,比如{'憁': 0, '耀': 1}
    word2ix = datas['word2ix'].item()
    print(word2ix)
    # 将某一首古诗转换为索引表示,转换后:[5272, 4236, 3286, 6933, 6010, 7066, 774, 4167, 2018, 70, 3951]
    str="床前明月光,疑是地上霜"
    print([word2ix[i] for i in str])
    #将第一首古诗打印出来
    print([ix2word[i] for i in datas["data"][0]])

Define model

import torch.nn as nn
class Net(nn.Module):
    """
        :param vocab_size 表示输入单词的格式
        :param embedding_dim 表示将一个单词映射到embedding_dim维度空间
        :param hidden_dim 表示lstm输出隐藏层的维度
    """
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(Net, self).__init__()
        self.hidden_dim = hidden_dim
        #Embedding层,将单词映射成vocab_size行embedding_dim列的矩阵,一行的坐标代表第一行的词
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        #两层lstm,输入词向量的维度和隐藏层维度
        self.lstm = nn.LSTM(embedding_dim, self.hidden_dim, num_layers=2, batch_first=False)
        #最后将隐藏层的维度转换为词汇表的维度
        self.linear1 = nn.Linear(self.hidden_dim, vocab_size)

    def forward(self, input, hidden=None):
        #获取输入的数据的时间步和批次数
        seq_len, batch_size = input.size()
        #如果没有传入上一个时间的隐藏值,初始一个,注意是2层
        if hidden is None:
            h_0 = input.data.new(2, batch_size, self.hidden_dim).fill_(0).float()
            c_0 = input.data.new(2, batch_size, self.hidden_dim).fill_(0).float()
        else:
            h_0, c_0 = hidden
        #将输入的数据embeddings为(input行数,embedding_dim)
        embeds = self.embeddings(input)     # (seq_len, batch_size, embedding_dim), (1,1,128)
        output, hidden = self.lstm(embeds, (h_0, c_0))      #(seq_len, batch_size, hidden_dim), (1,1,256)
        output = self.linear1(output.view(seq_len*batch_size, -1))      # ((seq_len * batch_size),hidden_dim), (1,256) → (1,8293)
        return output, hidden

train

The following code: input, target = (data[:-1, :]), (data[1:, :])Explanation:

When using LSTM for word prediction, the input and labels are set up to align the input sequence with the target sequence.

In a language model, we want to predict the next word based on the previous words. Therefore, the input sequence is the previous word, and the target sequence is the next word.

Consider the following example:
Suppose we have a sentence: "I love deep learning."
We can break it down into input and target sequences of the form:
Input sequence: ["I", "love", "deep"]
Target sequence: ["love", "deep", "learning"]

In this example, the input sequence is the previous word ["I", "love", "deep"], and the target sequence is the corresponding next word ["love", "deep", "learning"].

In the code, data is a data set containing all words, where each row represents a word. When slicing data into input and target, we use data[:-1, :] as the input sequence, that is, except the last word. And data[1:, :] serves as the target sequence, starting from the second word.

The purpose of setting the input and target sequences in this way is to align the input and labels so that the model can predict the next word based on the previous words.

import fire
import torch.nn as nn
import torch as t
from data.dataset import PoetryDataset
from models.model import Net
num_epochs=5
data_root="./data/tang.npz"
batch_size=10
def train(**kwargs):
    datasets=PoetryDataset(data_root)
    data,ix2word,word2ix=datasets.getData()
    lenData=len(data)
    data = t.from_numpy(data)
    dataloader = t.utils.data.DataLoader(data, batch_size=batch_size, shuffle=True, num_workers=1)
    #总共有8293的词。模型定义:vocab_size, embedding_dim, hidden_dim = 8293 * 128 * 256
    model=Net(len(word2ix),128,256)
    #定义损失函数
    criterion = nn.CrossEntropyLoss()
    model=model.cuda()
    optimizer = t.optim.Adam(model.parameters(), lr=1e-3)
    iteri=0
    filename = "example.txt"
    totalIter=lenData*num_epochs/batch_size
    for epoch in range(num_epochs):  # 最大迭代次数为8
        for i, data in enumerate(dataloader):  # 一批次数据 128*125
            data = data.long().transpose(0,1).contiguous() .cuda()
            optimizer.zero_grad()
            input, target = (data[:-1, :]), (data[1:, :])
            output, _ = model(input)
            loss = criterion(output, target.view(-1))  # torch.Size([15872, 8293]), torch.Size([15872])
            loss.backward()
            optimizer.step()
            iteri+=1
            if(iteri%500==0):
                print(str(iteri+1)+"/"+str(totalIter)+"epoch")
            if (1 + i) % 1000 == 0:  # 每575个batch可视化一次
                with open(filename, "a") as file:
                    file.write(str(i) + ':' + generate(model, '床前明月光', ix2word, word2ix)+"\n")
    t.save(model.state_dict(), './checkpoints/model_poet_2.pth')
def generate(model, start_words, ix2word, word2ix):     # 给定几个词,根据这几个词生成一首完整的诗歌
    txt = []
    for word in start_words:
        txt.append(word)
    input = t.Tensor([word2ix['<START>']]).view(1,1).long()      # tensor([8291.]) → tensor([[8291.]]) → tensor([[8291]])
    input = input.cuda()
    hidden = None
    num = len(txt)
    for i in range(48):      # 最大生成长度
        output, hidden = model(input, hidden)
        if i < num:
            w = txt[i]
            input = (input.data.new([word2ix[w]])).view(1, 1)
        else:
            top_index = output.data[0].topk(1)[1][0]
            w = ix2word[top_index.item()]
            txt.append(w)
            input = (input.data.new([top_index])).view(1, 1)
        if w == '<EOP>':
            break
    return ''.join(txt)
if __name__=="__main__":
    fire.Fire()

5epoch, 10batch, ordinary PC, GTX1050, 2GB video memory, training time is 30 minutes.
50epoch 128batch colab free gpu 16GB memory, training time 1 hour

test

def test():
    datasets = PoetryDataset(data_root)
    data, ix2word, word2ix = datasets.getData()
    modle = Net(len(word2ix), 128, 256)  # 模型定义:vocab_size, embedding_dim, hidden_dim —— 8293 * 128 * 256
    if t.cuda.is_available() == True:
        modle.cuda()
        modle.load_state_dict(t.load('./checkpoints/model_poet_2.pth'))
        modle.eval()
        name = input("请输入您的开头:")
        txt = generate(modle, name, ix2word, word2ix)
        print(txt)

Since it has only been trained for 5 epochs, the effect is not very good. After visualizing the loss, you can see the effect multiple epochs. There is another problem. If the input does not change, the generated results will be the same, so this may require noise interference.
5epoch version effect

(env380) D:\code\deeplearn\learn_rnn\pytorch\4.nn模块\案例\生成古诗\tang>python main.py test
请输入您的开头:唧唧复唧唧
唧唧复唧唧,不知何所如?君不见此地,不如此中生。一朝一杯酒,一日相追寻。一朝一杯酒,一醉一相逢。

(env380) D:\code\deeplearn\learn_rnn\pytorch\4.nn模块\案例\生成古诗\tang>python main.py test
请输入您的开头:我儿小谦谦
我儿小谦谦,不是天地间。有时有所用,不是无为名。有时有所用,不是无生源。有时有所用,不是无生源。

50epoch version effect

(env380) D:\code\deeplearn\learn_rnn\pytorch\4.nn模块\案例\生成古诗\tang>python main.py test
请输入您的开头:我家小谦谦
我家小谦谦,今古何为郎。我生不相识,我心不可忘。我来不我见,我亦不得尝。君今不我见,我亦不足伤。

(env380) D:\code\deeplearn\learn_rnn\pytorch\4.nn模块\案例\生成古诗\tang>python main.py test
请输入您的开头:床前明月光
床前明月光,上客不可见。玉楼金阁深,玉瑟风光紧。玉指滴芭蕉,飘飘出罗幕。玉堂无尘埃,玉节凌风雷。
(env380) D:\code\deeplearn\learn_rnn\pytorch\4.nn模块\案例\生成古诗\tang>python main.py test
请输入您的开头:唧唧复唧唧
唧唧复唧唧,胡儿女卿侯。妾本邯郸道,相逢两不游。妾心不可再,妾意不能休。妾本不相见,妾心如有钩。

GRU

Gated Recurrent Unit – GRU is a variant of LSTM. It retains the characteristics of LSTM in highlighting and forgetting unimportant information, and it will not be lost during long-term propagation.

LSTM has too many parameters and takes a long time to calculate. Therefore, the industry has recently proposed GRU (Gated Recurrent Unit). GRU retains the concept of using gates in LSTM, but reduces the parameters and shortens the calculation time.

Compared to LSTM, which uses two lines: hidden state and memory unit, GRU only uses hidden state. The similarities and differences are as follows:
Insert image description here
GRU calculation graph
Insert image description here
GRU calculation graph, σ nodes and tanh nodes have dedicated weights, and affine transformation is performed inside the nodes (the "1−" node inputs x and outputs 1 − x). The calculations performed in GRU are composed of the above
Insert image description here
4 Expressed by an equation (here xt and ht−1 are both row vectors), as shown in the figure, GRU has no memory unit and only a hidden state h propagated in the time direction. Two gates, r and z, are used here (LSTM uses 3 gates), r is called the reset gate, and z is called the update gate.

r (reset gate) determines to what extent past hidden states are "ignored". According to Equation 2.3, if r is 0, the new hidden state h~ only depends on the input xt x_txt. That is, past hidden states at this point will be completely ignored.

z (update gate)** is a gate that updates the hidden state. It plays the two roles of forget gate and input gate of LSTM. (1−z)⊙ ht − 1 h_{t−1} of Formula 2.4ht1Partially functions as a forget gate, deleting information that should be forgotten from its past hidden state. z⊙ h h^~h The part acts as an input gate to weight the newly added information.

Guess you like

Origin blog.csdn.net/liaomin416100569/article/details/131380370