Preface
Recurrent Neural Network (RNN), like Convolutional Neural Network, plays a very important role in deep learning. Although the fully connected networks and convolutional neural networks mentioned in the previous notes already have strong representation capabilities, due to the limitations of the network structure, they can only handle fixed-length input data, and due to the characteristics of directed acyclic , they The output is predicted only by the current input data, regardless of the order of the input data, and whether the data before and after are related. However, sequential data such as text, voice, and video are often of variable length, and the data before and after are highly correlated. Once the order is changed, the original information is often lost . Therefore, a network that considers the order of input data and can memorize past data information is needed -cyclic neural network. This note will learn the principle of the simplest recurrent neural network RNN and Pytorch's RNN layer .
This note mainly refers to the Deep Eyes Pytorch course , torch official documents , PyTorch Deep Learning Introduction and Practical Combat by Sun Yulin, and the Neural Network (CNN/RNN/GAN) Algorithm Principles + Practical Courses of Deep Learning . The data used comes from the Internet, and the text is written based on the author's superficial understanding of many courses. If a friend finds an error, I hope to point it out. In addition, it was found that someone copied the author's notes on other platforms, not only did not indicate the source, but some even used it as a fee-based article. Therefore, the author will insert an identification mark anywhere in the article. If it affects reading, I hope to forgive me.
See the linear layer: Eye of Depth Pytorch punch card (13): Pytorch fully connected neural network components-linear layer, nonlinear activation layer and dropout layer
RNN compared with other networks
Fully connected neural networks and convolutional neural networks are suitable for classification and detection tasks. They have only one input, or an image , or a group of features , and only one output, or a predicted value , or a category. Vector . Due to limitations of the network structure (matrix multiplication requires a corresponding dimension ), they must be input resize
to the specified size, as long as the network architecture is determined, the output size is to be determined, a schematic view of the structure shown in Figure 1, adapted from FIG source .
There is no essential difference between the cyclic neural network and the above two, and the input and output of the model unit are also fixed-length . The reason why the cyclic neural network has variable input and output lengths and "memory" the past data is that it is cyclic . The process of variable length input sequence is realized by calling the model unit cyclically. Each cycle, the model unit processes a fixed-length sequence element , such as a character in a string. The number of cycles is determined by the number of elements in the input data sequence. . Each time the model unit is called, two outputs are produced-the current predicted value and the state value . The current state value is weighted to participate in the generation of the next state value, that is, the current input element will affect the state value of the next input element. In turn, it will have an impact on its output predicted value. RNN uses the state value to achieve the so-called "memory" . How many times the model unit cycles, how many output predictions will be generated, and finally select several of them according to the task requirements as the output of the model, and the variable length of the output sequence is realized. The structure diagram is shown in Figure 1, modified from the source of the diagram . According to different tasks, the cyclic neural network can flexibly adjust its input and output size.
RNN principle and simple example application
The RNN model unit is generally composed of three linear layers and two non-linear activation layers , as shown in Figure 2. The left and right sides of the figure are the relationship between the cyclic closing and expansion , modified from the source of the graph . Outputting O
a nonlinear active layer before f2
the status value S
has a nonlinear active layer before f1
. Input to f1
comprise between a linear layer, weight matrix of the U- , the state value to f1
comprise a layer between linear, weight matrix W is . The current state value S
to contain a linear non-linear layer between the active layer, the weight matrix V . When in use, the number of times the model unit is called is determined according to the number of data units contained in the input, and the state value generated each time will be weighted and then transmitted. The model unit expression is shown in formula (1) and formula (2).
Use a very common in input method, the next input character prediction super simplified model training for further learning. Now suppose that all samples contain only 4 different characters (actually need to be counted) h、e、l、o
, and the model only predicts these 4 characters (actually 128), that is, divided into 4 categories. The flowchart is shown in Figure 3, modified from the source of the graph .
With a vector of length 4, 4 characters can be represented orthogonally . Let o=[1 0 0 0], h=[0 1 0 0]、e=[0 0 1 0],l=[0 0 0 1]
4 be the input size of the model unit, that is, the input size of linear layer 1 , and a single character is the input of the model unit. In practical applications, f1 in formulas (1) and (2) is commonly used as tanh , which is a multivariate classification, and f2 is softmax . It can be seen from the figure that the input size of the model unit is (4,1), the shape of the weight matrix U of linear layer 1 is (3,4), and the shape of the weight matrix W of linear layer 2 is (3, 3). 4 classification, so the output size is 4, the shape of the weight matrix V of the easy-to-get linear layer 3 is (4, 3), and the basic structure of the model unit can be defined by the above-mentioned size and activation function. The model unit is called cyclically to process the characters in the input sequence. Each character processed can get a prediction about the next character and can "memorize" the state value containing current and past information." Minimize the difference between the predicted value and the label Distance can establish the mapping relationship between the input character and the next character that is expected to appear. The essence is the mapping relationship between vectors.
The sum of the loss between each prediction output and the label is the total loss, which is used in gradient descent. It is worth noting that U, V, and W remain unchanged during the process of loop call of the model unit . Therefore, when the back-propagation gradient descent derivation is obtained, the function that has been compounded many times is biased against U, V, and W. Derivative, using a chain-like derivation rule structure similar to that shown in Figure 4, source of the graph .
The above process can be implemented with the following code. The key is to construct three linear layers and the transfer of state values in the loop process. For the simple task of China, it does not need to be iterated many times to completely predict it correctly.
import torch
import torch.nn as nn
# 数据
input_data = (
([0, 1, 0, 0], 'h'),
([0, 0, 1, 0], 'e'),
([0, 0, 0, 1], 'l'),
([0, 0, 0, 1], 'l'),
([1, 0, 0, 0], 'o')
)
# 类别与向量的映射关系
class_data = (
([1, 0, 0, 0], 'o'),
([0, 1, 0, 0], 'h'),
([0, 0, 1, 0], 'e'),
([0, 0, 0, 1], 'l')
)
S = torch.tensor([0, 0, 0], dtype=torch.float)
loss = 0
U_Linear = nn.Linear(4, 3, bias=True)
V_Linear = nn.Linear(3, 4, bias=True)
W_Linear = nn.Linear(3, 3, bias=True)
tanh = nn.Tanh()
softMax = nn.Softmax()
Loss_function = nn.MSELoss()
# 三个线性层各一个优化器
optimizer = torch.optim.SGD(U_Linear.parameters(), lr=0.5)
optimizer1 = torch.optim.SGD(V_Linear.parameters(), lr=0.5)
optimizer2 = torch.optim.SGD(V_Linear.parameters(), lr=0.5)
# print(input_data[2][0])
for n in range(1000):
for i in range(len(input_data)-1): # 循环读取输入列表中元素
# RNN模型单元前向传播
U_out = U_Linear(torch.tensor(input_data[i][0], dtype=torch.float))
W_out = W_Linear(S)
S = tanh(U_out + W_out)
# print(S)
V_out = V_Linear(S)
outputs = softMax(V_out)
# 标签就是当前输入字符的下一个字符
targets = torch.tensor(input_data[i+1][0], dtype=torch.float)
# 计算总的损失
loss = loss + Loss_function(outputs, targets)
predicts = list(torch.round(outputs.data).numpy())
if n%10 == 0:
idx = predicts.index(max(predicts))
print('Outputs', outputs.detach().numpy(), 'Predict:', class_data[idx][1])
if n%10 == 0:
print('inters:', n, 'Loss:', loss.detach().numpy(),)
optimizer.zero_grad()
optimizer1.zero_grad()
optimizer2.zero_grad()
# 误差反向传播加优化模型
loss.backward(retain_graph=True)
optimizer.step()
optimizer1.step()
optimizer2.step()
loss = 0
# (CSDN意疏原创笔记:https://blog.csdn.net/sinat_35907936/article/details/107833112)
inters: 110 Loss: 0.014474688
Outputs [0.03512129 0.01285427 0.9469078 0.00511664] Predict: e
Outputs [0.02170071 0.01946862 0.00924533 0.9495853 ] Predict: l
Outputs [0.08608606 0.01253206 0.00805425 0.8933276 ] Predict: l
Outputs [0.88951653 0.00719438 0.01199281 0.0912962 ] Predict: o
# (CSDN意疏原创笔记:https://blog.csdn.net/sinat_35907936/article/details/107833112)