[Pytorch Framework] 2.5 Recurrent Neural Network

import torch
torch.__version__
'1.4.0'

2.5 Recurrent Neural Network

2.5.1 Introduction to RNN

One of the biggest characteristics that distinguish our brains from machines is that we have memories and can deduce unknown affairs based on our own memories. Our thoughts are persistent. However, the elements of the neural network structure currently introduced in this tutorial are independent of each other, and the input and output are independent.

The cause of RNN

In the real world, many elements are connected to each other. For example, the outdoor temperature changes periodically with climate change, and our language also needs to confirm the meaning expressed through context. But it is quite difficult for the machine to do this step. Therefore, there is the current cyclic neural network. Its essence is: it has the ability to remember and will make inferences based on the content of these memories. Therefore, his output depends on the current input and memory.

Why do we need RNN

The idea behind RNN is to use sequential information. In traditional neural networks, we assume that all inputs (and outputs) are independent of each other. If you want to predict the next word in a sentence, you need to know which words are in front of it, and you need to see the following words to be able to give the correct answer.
RNNs are called loops because they perform the same task on each element of the sequence, and all outputs depend on previous calculations.
From another perspective, RNN has "memory" and can capture the information calculated so far. In theory, RNNs can use information in arbitrarily long sequences, but in practice they are limited to reviewing a few steps.
The proposal of the cyclic neural network is based on the idea of ​​the memory model. It is expected that the network can remember the features that appeared before and infer the subsequent results based on the features, and the overall network structure continues to circulate because of the name cyclic neural network.

What can RNN do

RNN has achieved great success in many NLP tasks. At this point, I should mention that the most commonly used type of RNN is LSTM, which is much better than RNN in capturing long-term dependencies. But don't worry, LSTM is basically the same as the RNN we will develop in this tutorial, they just use a different way to calculate the hidden state. We will introduce LSTM in more detail later. The following are some examples of RNN in NLP:
language modeling and text generation

Through language modeling, we can generate fake and real texts that humans can understand from given words.

machine translation

Machine translation is similar to language modeling. We input a series of words in the source language, and through the calculation of the model, we can output the content corresponding to the target language.

Speech Recognition

Given the input sequence of acoustic signals from sound waves, we can predict a series of speech fragments and their probabilities, and convert the speech into text

Generate image description

Together with convolutional neural networks, RNN can generate descriptions of unlabeled images.

2.5.2 RNN network structure and principle

RNN

The basic structure of the cyclic neural network is particularly simple, that is, the output of the network is stored in a memory unit, and this memory unit enters the neural network together with the next input. We can see that the network will be combined with the memory unit as input when it is input. The network not only outputs the result, but also saves the result in the memory unit. The following figure is a schematic diagram of the simplest recurrent neural network when it is input. Image Source

RNN can be seen as multiple assignments of the same neural network. Each neural network module will pass the message to the next one. We expand the structure of this graph

The network has a cyclic structure, which is the origin of the name of the cyclic neural network. At the same time, according to the structure of the cyclic neural network, it can be seen that it has a natural advantage in processing sequence type data. Because the network itself is a sequence structure, this is also the most essential structure of all recurrent neural networks.

The cyclic neural network has particularly good memory characteristics and can apply the memory content to the current situation, but the memory capacity of the network is not as effective as imagined. The biggest problem with memory is that it has forgetfulness. We always remember the recent events more clearly and forget the events that happened a long time ago. Recurrent neural networks also have this problem.

pytorch uses the nn.RNN class to build a sequence-based recurrent neural network. Its constructor has the following parameters:

  • input_size: the number of feature values ​​of input data X.
  • hidden_size: The number of neurons in the hidden layer, that is, the number of features in the hidden layer.
  • num_layers: The number of layers of the recurrent neural network, the default value is 1.
  • bias: The default is True. If it is false, the neuron does not use the bias parameter.
  • batch_first: If set to True, the first dimension in the dimensions of the input data is the batch value, and the default is False. By default, the first dimension is the length of the sequence, the second dimension is--batch, and the third dimension is the number of features.
  • dropout: If it is not empty, it means that the last dropout layer discards part of the data, and the percentage of discarded data is specified by this parameter.

The most important parameters in RNN are input_size and hidden_size, these two parameters must be clarified. The rest of the parameters usually do not need to be set, just use the default values.

rnn = torch.nn.RNN(20,50,2)
input = torch.randn(100 , 32 , 20)
h_0 =torch.randn(2 , 32 , 50)
output,hn=rnn(input ,h_0) 
print(output.size(),hn.size())
torch.Size([100, 32, 50]) torch.Size([2, 32, 50])

Beginners who see the introduction above are definitely still at a loss. What are these things and how to use them in practice?
Next, we use pytorch to write an implementation of RNN. In this way, through our own implementation, we will have a deeper understanding of the structure of RNN.

Before implementation, we continue to introduce the working mechanism of RNN in depth. RNN is actually an ordinary neural network, except that there is an extra hidden_state to save historical information. The function of this hidden_state is to save the previous state. We often say that the memory state information saved in the RNN is this hidden_state.

For RNN, we only need to live in a formula:

h t = tanh ⁡ ( W i h x t + b i h + W h h h ( t − 1 ) + b h h ) h_t = \tanh(W_{ih} x_t + b_{ih} + W_{hh} h_{(t-1)} + b_{hh}) ht=tanh ( W.and hxt+band h+Whhh(t1)+bhh)

This formula comes from the official website:
https://pytorch.org/docs/stable/nn.html?highlight=rnn#torch.nn.RNN

Xt x_t in this formulaxtIs the input value of our current state, h (t − 1) h_{(t-1)}h(t1)It is the hidden_state of the previous state to be passed in as mentioned above, which is the memory part.
The part of the entire network to be trained is W ih W_{ih}Wand hThe weight of the current state input value, W hh W_{hh}Whhhidden_state is the weight of the previous state and the two input offset values. These four values ​​add up to use tanh for activation. Pytorch uses tanh as the activation by default, or you can use relu as the activation function by setting.

The steps mentioned above are a calculation process circled with a red frame

This step is not different from ordinary neural networks, and because RNN has more sequence dimension, it needs to run forward propagation n times with the same model. This n is the number of our sequence settings.
Let’s start to implement our RNN manually: Refer to the article by Karpathy: https://karpathy.github.io/2015/05/21/rnn-effectiveness/

class RNN(object):
    def __init__(self,input_size,hidden_size):
        super().__init__()
        self.W_xh=torch.nn.Linear(input_size,hidden_size) #因为最后的操作是相加 所以hidden要和output 的shape一致
        self.W_hh=torch.nn.Linear(hidden_size,hidden_size)
        
    def __call__(self,x,hidden):
        return self.step(x,hidden)
    def step(self, x, hidden):
        #前向传播的一步
        h1=self.W_hh(hidden)
        w1=self.W_xh(x)
        out = torch.tanh( h1+w1)
        hidden=self.W_hh.weight
        return out,hidden
rnn = RNN(20,50)
input = torch.randn( 32 , 20)
h_0 =torch.randn(32 , 50) 
seq_len = input.shape[0]
for i in range(seq_len):
    output,hn= rnn(input[i, :], h_0)
print(output.size(),h_0.size())
torch.Size([32, 50]) torch.Size([32, 50])

LSTM

LSTM is the abbreviation of Long Short Term Memory Networks, literally translated as Long Short Term Memory Networks. The network structure of LSTM was proposed by Hochreiter and Schmidhuber in 1997, and then this network structure became very popular.
Although LSTM only solves the problem of short-term dependence, and it uses deliberate design to avoid long-term dependence, this approach has proven to be very effective in practical applications, and many people have followed up related work to solve many practical problems. , So now LSTM is still widely used. Image Source
Insert picture description here

The standard recurrent neural network has only a simple layer structure, while LSTM has 4 layer structures:

The first layer is a forgetting layer: decide what information to discard in the state

The second tanh layer is used to generate candidates for updated values, indicating that the state needs to be strengthened in some dimensions, and needs to be weakened in some dimensions.

The third layer of the sigmoid layer (input gate layer), its output value must be multiplied by the output of the tanh layer, which plays a role of scaling. In extreme cases, the sigmoid output 0 indicates that the state in the corresponding dimension does not need to be updated

The last layer decides what to output, and the output value is related to the state. Which part of the candidates will ultimately be output is determined by a sigmoid layer.

pytorch uses the nn.LSTM class to build a sequence-based recurrent neural network. Its parameters are basically similar to RNN, so I won't list it here.

lstm = torch.nn.LSTM(10, 20,2)
input = torch.randn(5, 3, 10)
h0 =torch.randn(2, 3, 20)
c0 = torch.randn(2, 3, 20)
output, hn = lstm(input, (h0, c0))
print(output.size(),hn[0].size(),hn[1].size())
torch.Size([5, 3, 20]) torch.Size([2, 3, 20]) torch.Size([2, 3, 20])

TOWER CRANE

GRU is the abbreviation of gated recurrent units and was proposed by Cho in 2014. The biggest difference between GRU and LSTM is that GRU combines the forget gate and the input gate into an "update gate". At the same time, the network no longer gives an additional memory state, but instead uses the output result as the memory state to continuously pass backwards, the input of the network And output has become particularly simple.

rnn = torch.nn.GRU(10, 20, 2)
input = torch.randn(5, 3, 10)
h_0= torch.randn(2, 3, 20)
output, hn = rnn(input, h0)
print(output.size(),hn.size())
torch.Size([5, 3, 20]) torch.Size([2, 3, 20])

2.5.3 Backward propagation of cyclic network (BPTT)

In the case of forward propagation, the input of the RNN advances with each time step. In the case of backpropagation, we "go back in time" to change the weight, so we call it Backpropagation Through Time (BPTT).

We usually treat the entire sequence (words) as a training sample, so the total error is the sum of the errors in each time step (character). The weight is the same at each time step (so it can be updated together after calculating the total error).

  1. Calculate cross entropy error using predicted output and actual output
  2. The network is fully deployed in time steps
  3. For the expanded network, calculate the gradient of the weight for each practice step
  4. Because the weights are the same for all time steps, the gradients can be obtained together for all time steps (rather than getting different gradients for different hidden layers like a neural network)
  5. Then upgrade the weights of circulating neurons

The network deployed by RNN looks like an ordinary neural network. Backpropagation is also similar to a normal neural network, except that we get the gradients of all time steps at once. If there are 100 time steps, then the network will become very huge after unfolding, so in order to solve this problem, structures such as LSTM and GRU will appear.

Recurrent neural networks are currently the most popular in natural language processing, so the following content will introduce some other knowledge that recurrent neural networks need to use when processing NLP

2.5.4 word embedding

In the process of our human communication, the characterization vocabulary is directly represented by English words, but for computers, it is impossible to directly recognize words. In order for the computer to better understand our language and build a better language model, we need to characterize the vocabulary.

One-hot encoding will be used in the image classification problem. For example, there are 10 numbers 0-9 in LeNet. If this number is 2, its code is (0,0,1,0,0,0,0,0,0,0), which is the case for classification problems. The expression is very clear, but in natural language processing, because the number of words is too large, for example, there are 10,000 different words, then the efficiency of using one-hot to define it is particularly low, and each word is 10,000-dimensional vector. Only one of them is 1, and the rest are 0, which takes up memory and cannot reflect the part of speech of the word, because every word is one-hot, although some words will be closer in semantics. But one-hot can’t help. To reflect this characteristic, it is necessary to use another way to define each word.

Different features are used to characterize each vocabulary. Compared with different features, different words have different values. This is word embedding. The picture below is still a screenshot of the course from teacher Wu Enda

Word embedding not only realizes the characteristic representation of different words, but also calculates the similarity between words. In fact, in a multi-dimensional space, we can find the distance similarity of each dimension between word vectors. Analogical reasoning, such as summer and heat, and winter and cold, are all related.

In PyTorch, we use the nn.Embedding layer to do the embedding word bag model. The first input of the Embedding layer indicates how many words we have, and the second input indicates how many dimensions of vector representation each word uses.

# an Embedding module containing 10 tensors of size 3
embedding = torch.nn.Embedding(10, 3)
# a batch of 2 samples of 4 indices each
input = torch.LongTensor([[1,2,4,5],[4,3,2,9]])
output=embedding(input)
print(output.size())
torch.Size([2, 4, 3])

2.5.5 Other important concepts

Beam search

After generating the distribution of the first word, we can use greedy search to select the first word most likely to be output according to our conditional language model, but for the greedy search algorithm, there are hundreds of words in our word library. For tens of millions of words, it is not feasible to calculate the possibility of each combination of words. Therefore, we use approximate search methods to maximize or approximate the maximum conditional probability of sentences, rather than through words.

Beam Search (cluster search) is a heuristic graph search algorithm. It is usually used when the solution space of the graph is relatively large. In order to reduce the space and time occupied by the search, some quality is cut when the depth of each step is expanded. Inferior nodes, keep some higher quality nodes. Although the Beam Search algorithm is incomplete, it can reduce space occupation and time when it is used to understand a system with a larger space.

Beam search can be regarded as a breadth-first search with constrained optimization. First, a breadth-first strategy is used to build a search tree. At each level of the tree, the nodes are sorted according to the heuristic cost, and then only a predetermined number (Beam width- Cluster width) nodes, only these nodes continue to expand at the next level, and other nodes are cut off.

  1. Insert the initial node into the list
  2. The node will be out of the heap, if the node is the target node, the algorithm ends;
  3. Otherwise, the node is expanded, and the node of the cluster width is taken into the pile. Then go to the second step to continue the cycle.
  4. The condition for the end of the algorithm is to find the optimal solution or the heap is empty.

In use, the cluster width can be pre-appointed or variable, and the specific settings can be adjusted according to the actual scene.

Attention model

For RNN models that use encoding and decoding, we can achieve relatively accurate machine translation results. For short sentences, its performance is very good, but if it is a very long sentence, the translation result will be worse.
When we humans perform manual translation, we translate part by part. The attention mechanism introduced is very similar to the human translation process, which also translates long sentences part by part.

The specific content will not be introduced in detail here

Guess you like

Origin blog.csdn.net/yegeli/article/details/113688200