Deep Eye Pytorch punch card (17): Recurrent neural network components-RNN and RNN layer (detailed analysis of RNN principle and implementation)

Preface


  Recurrent Neural Network (RNN), like Convolutional Neural Network, plays a very important role in deep learning. Although the fully connected networks and convolutional neural networks mentioned in the previous notes already have strong representation capabilities, due to the limitations of the network structure, they can only handle fixed-length input data, and due to the characteristics of directed acyclic , they The output is predicted only by the current input data, regardless of the order of the input data, and whether the data before and after are related. However, sequential data such as text, voice, and video are often of variable length, and the data before and after are highly correlated. Once the order is changed, the original information is often lost . Therefore, a network that considers the order of input data and can memorize past data information is needed -cyclic neural network. This note will learn the principle of the simplest recurrent neural network RNN and Pytorch's RNN layer .

  This note mainly refers to the Deep Eyes Pytorch course , torch official documents , PyTorch Deep Learning Introduction and Practical Combat by Sun Yulin, and the Neural Network (CNN/RNN/GAN) Algorithm Principles + Practical Courses of Deep Learning . The data used comes from the Internet, and the text is written based on the author's superficial understanding of many courses. If a friend finds an error, I hope to point it out. In addition, it was found that someone copied the author's notes on other platforms, not only did not indicate the source, but some even used it as a fee-based article. Therefore, the author will insert an identification mark anywhere in the article. If it affects reading, I hope to forgive me.

  See the linear layer: Eye of Depth Pytorch punch card (13): Pytorch fully connected neural network components-linear layer, nonlinear activation layer and dropout layer


RNN compared with other networks


  Fully connected neural networks and convolutional neural networks are suitable for classification and detection tasks. They have only one input, or an image , or a group of features , and only one output, or a predicted value , or a category. Vector . Due to limitations of the network structure (matrix multiplication requires a corresponding dimension ), they must be input resizeto the specified size, as long as the network architecture is determined, the output size is to be determined, a schematic view of the structure shown in Figure 1, adapted from FIG source .

  There is no essential difference between the cyclic neural network and the above two, and the input and output of the model unit are also fixed-length . The reason why the cyclic neural network has variable input and output lengths and "memory" the past data is that it is cyclic . The process of variable length input sequence is realized by calling the model unit cyclically. Each cycle, the model unit processes a fixed-length sequence element , such as a character in a string. The number of cycles is determined by the number of elements in the input data sequence. . Each time the model unit is called, two outputs are produced-the current predicted value and the state value . The current state value is weighted to participate in the generation of the next state value, that is, the current input element will affect the state value of the next input element. In turn, it will have an impact on its output predicted value. RNN uses the state value to achieve the so-called "memory" . How many times the model unit cycles, how many output predictions will be generated, and finally select several of them according to the task requirements as the output of the model, and the variable length of the output sequence is realized. The structure diagram is shown in Figure 1, modified from the source of the diagram . According to different tasks, the cyclic neural network can flexibly adjust its input and output size.

Comparison of Recurrent Neural Networks with Other Networks and Application Fields of Recurrent Neural Networks

Figure 1. Comparison of recurrent neural network with other networks and application fields of recurrent neural network

RNN principle and simple example application


  • RNN principle

  The RNN model unit is generally composed of three linear layers and two non-linear activation layers , as shown in Figure 2. The left and right sides of the figure are the relationship between the cyclic closing and expansion , modified from the source of the graph . Outputting Oa nonlinear active layer before f2the status value Shas a nonlinear active layer before f1. Input to f1comprise between a linear layer, weight matrix of the U- , the state value to f1comprise a layer between linear, weight matrix W is . The current state value Sto contain a linear non-linear layer between the active layer, the weight matrix V . When in use, the number of times the model unit is called is determined according to the number of data units contained in the input, and the state value generated each time will be weighted and then transmitted. The model unit expression is shown in formula (1) and formula (2).

RNN official
RNN official

RNN principle

Figure 2. Principle of RNN
  • RNN simple example application

  Use a very common in input method, the next input character prediction super simplified model training for further learning. Now suppose that all samples contain only 4 different characters (actually need to be counted) h、e、l、o, and the model only predicts these 4 characters (actually 128), that is, divided into 4 categories. The flowchart is shown in Figure 3, modified from the source of the graph .

  With a vector of length 4, 4 characters can be represented orthogonally . Let o=[1 0 0 0], h=[0 1 0 0]、e=[0 0 1 0],l=[0 0 0 1]4 be the input size of the model unit, that is, the input size of linear layer 1 , and a single character is the input of the model unit. In practical applications, f1 in formulas (1) and (2) is commonly used as tanh , which is a multivariate classification, and f2 is softmax . It can be seen from the figure that the input size of the model unit is (4,1), the shape of the weight matrix U of linear layer 1 is (3,4), and the shape of the weight matrix W of linear layer 2 is (3, 3). 4 classification, so the output size is 4, the shape of the weight matrix V of the easy-to-get linear layer 3 is (4, 3), and the basic structure of the model unit can be defined by the above-mentioned size and activation function. The model unit is called cyclically to process the characters in the input sequence. Each character processed can get a prediction about the next character and can "memorize" the state value containing current and past information." Minimize the difference between the predicted value and the label Distance can establish the mapping relationship between the input character and the next character that is expected to appear. The essence is the mapping relationship between vectors.

RNN implementation process

Figure 3. RNN implementation process

  The sum of the loss between each prediction output and the label is the total loss, which is used in gradient descent. It is worth noting that U, V, and W remain unchanged during the process of loop call of the model unit . Therefore, when the back-propagation gradient descent derivation is obtained, the function that has been compounded many times is biased against U, V, and W. Derivative, using a chain-like derivation rule structure similar to that shown in Figure 4, source of the graph .

Compound function finds partial derivative chain rule structure

Figure 4. The structure of the chain rule for compound functions

  The above process can be implemented with the following code. The key is to construct three linear layers and the transfer of state values ​​in the loop process. For the simple task of China, it does not need to be iterated many times to completely predict it correctly.

import torch
import torch.nn as nn

# 数据
input_data = (
    ([0, 1, 0, 0], 'h'),
    ([0, 0, 1, 0], 'e'),
    ([0, 0, 0, 1], 'l'),
    ([0, 0, 0, 1], 'l'),
    ([1, 0, 0, 0], 'o')
)

# 类别与向量的映射关系
class_data = (
    ([1, 0, 0, 0], 'o'),
    ([0, 1, 0, 0], 'h'),
    ([0, 0, 1, 0], 'e'),
    ([0, 0, 0, 1], 'l')

)
S = torch.tensor([0, 0, 0], dtype=torch.float)
loss = 0
U_Linear = nn.Linear(4, 3, bias=True)
V_Linear = nn.Linear(3, 4, bias=True)
W_Linear = nn.Linear(3, 3, bias=True)
tanh = nn.Tanh()
softMax = nn.Softmax()

Loss_function = nn.MSELoss()

# 三个线性层各一个优化器
optimizer = torch.optim.SGD(U_Linear.parameters(), lr=0.5)
optimizer1 = torch.optim.SGD(V_Linear.parameters(), lr=0.5)
optimizer2 = torch.optim.SGD(V_Linear.parameters(), lr=0.5)
# print(input_data[2][0])
for n in range(1000):
    for i in range(len(input_data)-1):     # 循环读取输入列表中元素

        # RNN模型单元前向传播
        U_out = U_Linear(torch.tensor(input_data[i][0], dtype=torch.float))
        W_out = W_Linear(S)
        S = tanh(U_out + W_out)
        # print(S)
        V_out = V_Linear(S)
        outputs = softMax(V_out)

        # 标签就是当前输入字符的下一个字符
        targets = torch.tensor(input_data[i+1][0], dtype=torch.float)

        # 计算总的损失
        loss = loss + Loss_function(outputs, targets)
        predicts = list(torch.round(outputs.data).numpy())
        if n%10 == 0:
            idx = predicts.index(max(predicts))
            print('Outputs', outputs.detach().numpy(), 'Predict:', class_data[idx][1])
    if n%10 == 0:
        print('inters:', n, 'Loss:', loss.detach().numpy(),)
    optimizer.zero_grad()
    optimizer1.zero_grad()
    optimizer2.zero_grad()

    # 误差反向传播加优化模型
    loss.backward(retain_graph=True)
    optimizer.step()
    optimizer1.step()
    optimizer2.step()

    loss = 0

# (CSDN意疏原创笔记:https://blog.csdn.net/sinat_35907936/article/details/107833112)
inters: 110 Loss: 0.014474688
Outputs [0.03512129 0.01285427 0.9469078  0.00511664] Predict: e
Outputs [0.02170071 0.01946862 0.00924533 0.9495853 ] Predict: l
Outputs [0.08608606 0.01253206 0.00805425 0.8933276 ] Predict: l
Outputs [0.88951653 0.00719438 0.01199281 0.0912962 ] Predict: o

# (CSDN意疏原创笔记:https://blog.csdn.net/sinat_35907936/article/details/107833112)

RNN layer


Guess you like

Origin blog.csdn.net/sinat_35907936/article/details/108276948