Dahua Recurrent Neural Network (RNN)

——The original text was published on my WeChat public account "Big Data and Artificial Intelligence Lab" (BigdataAILab). Welcome to pay attention.

 

 

In the previous article, the algorithm principle of Convolutional Neural Network (CNN) was introduced . CNN has a powerful and wide range of applications in image recognition, but there are some scenarios that cannot be effectively solved by CNN, such as:

  • Speech recognition, to process the sound information of each frame in sequence, some results need to be recognized according to the context;
  • Natural language processing, to read each word in turn, to identify the semantics of a certain text

These scenarios have a characteristic that they are all related to time series, and the length of the input sequence data is not fixed.

In the classic artificial neural network, deep neural network (DNN), and even convolutional neural network (CNN), one is that the input data has the same dimension, and the other is that each input is independent, and the signal of each layer of neurons can only go up. One layer is propagated, and the processing of samples is independent at each moment.
 
In real life, for example, when performing speech recognition on a speech, the time for each sentence of the speaker is almost different, and the content of the speech to be recognized must also be recognized according to the order of speech.
This requires a more capable model: the model has a certain memory ability and can process information of any length in time series. This model is today's protagonist "Recurrent Neural Networks" (RNN).


Recurrent Neural Network (RNN), the output of the neuron can directly act on itself (as input) at the next timestamp. Take a look at the comparison diagram below:
 
From the two simplified diagrams above, it can be seen that RNN is compared to the classic neural network. The structure has an additional loop, which means that the output of the neuron will be returned as part of the input at the next timestamp. These loops make RNN seem mysterious. However, if you think about it from another angle, it is no better than a classic neural networks are difficult to understand. RNN can be regarded as multiple assignments to the same neural network. The input of the i-th layer neuron at time t, in addition to the output of the (i-1) layer neuron at this time, also includes its own at (t- 1) The output of the time, if we expand the RNN according to the time point, we will get the following structure diagram:
 
At different time points, the input of the RNN is related to the previous time state, and the output result of the network at the time tn is the time. The input and the result of all histories combined, which achieves the purpose of modeling time series.

[The question is coming] About RNN's long-term dependencies (Long-Term Dependencies)

In theory, RNN can use the information of all previous time points to act on the current task, which is the long-term dependency mentioned above. If RNN can do this, it will become very useful. For example, in automatic question answering, it can be based on Context enables smarter question answering. However, in real applications, different situations will be faced, such as:
(1) There is a language model that predicts the next word based on the previous word. If we now want to predict the last word of the following sentence " white clouds are floating in (the sky) ", we don't need any other context, the last word should obviously be the sky . In such a scenario, the interval between the relevant information and the predicted word position is very small, as shown in the following figure:
 
(2) Suppose we want to predict " I grew up in Sichuan...I can speak fluently (Sichuan) words) "the last word, according to the information of the last sentence it is suggested that the last word may be the name of a language, but if we want to figure out what language it is, we need to find the word " Sichuan " which is far away from the current position context. This means that the gap between the relevant information and the current predicted position becomes quite large. As shown in the figure below:
 
Unfortunately, as the interval continues to increase, the RNN will experience the phenomenon of " vanishing gradient " or " exploding gradient ", which is the long-term dependence problem of RNN. For example, we often use sigmoid as the excitation function of neurons. For example, for a signal with an amplitude of 1, the gradient decays to the original 0.25 for each layer passed backwards. Without a valid signal, this situation is "gradient vanishing". Therefore, as the interval increases, the RNN loses its ability to learn to connect information so far away.

 

[What to do with swollen] The artifact is here: Long Short Term Memory Network (LSTM for short, Long Short Term Memory Network)


LSTM is a special type of RNN that can learn long-term dependency information. On many problems, LSTMs have achieved considerable success and have been widely used.

The structure of an LSTM unit is shown in the following figure:
 
As can be seen from the above figure, there is a cell (cell) in the middle, which is also the "processor" that LSTM uses to determine whether the information is useful. At the same time, three gates are placed next to the cell, namely Input Gate, Forget Gate and Output Gate . When a piece of information enters the LSTM network, it can be judged whether it is useful or not according to the rules. Only the information that meets the requirements will be left, and the information that does not meet the requirements will be forgotten through the forgetting gate.
LSTM cleverly uses switches to achieve temporal memory function in the form of "gates", which is an effective technology to solve long-term dependence problems. In a digital circuit, the gate is a binary variable {0,1}, 0 represents the closed state, allowing no information to pass through; 1 represents the open state, allowing all information to pass through. The "gate" in LSTM is similar, but it is a "soft" gate, between (0, 1), which means that information is passed through at a certain ratio.
It doesn't sound obvious at first, so how does it do it?

Let's first take a look at the simplified diagram of RNN expanded by time. The structure is very simple. The repeating module in the standard RNN only contains a single layer, such as the tanh layer, as shown in the figure below:
 
LSTM has a similar structure, but the repeating module has a different The structure of the repeating module in LSTM contains four interactive layers, in which the input gate (Input Gate), the forget gate (Forget Gate) and the output gate (Output Gate) are in it, as shown below:
 
Let's introduce the work of LSTM The principle will be introduced in combination with the structure diagram and formula below. Review the structure diagram and calculation formula of the most basic single-layer neural network as follows, indicating that the input is x, and the output y is obtained by transforming Wx+b and the activation function f. Similar formulas appear many times below

The following is an example of a language model that predicts the next word based on the words it has seen, for example:


Xiao Ming just finished eating rice, and now he is about to eat fruit, and then picks up a ()

(1) Forget Gate
The schematic diagram of the gate is as follows. The gate will read the information of ht-1 and xt, and output a value between 0 and 1 through the sigmoid layer as a value for each cell state. A number in Ct-1, where 0 means "completely discard" and 1 means "completely retain".
 
Combined with the example of the language prediction model mentioned above, " Xiao Ming just finished eating rice ", the subject of this sentence is " Xiao Ming ", the object is " rice ", and the next sentence " I'm going to eat fruit now ", then the object has become The new word " fruit " has been added, and the word to be predicted in the third sentence is related to " fruit " and has nothing to do with " rice " . forget.

(2) Input Gate
The next step is to determine what new information is stored in the cell state. There are two parts here:
first, through the "input gate", this layer is to decide what value we will update; then, a tanh layer creates a new candidate value vector and adds it to the state, as shown below:
 
In this language prediction model In the example of , we want to add the new pronoun " fruit " to the cell state to replace the old pronoun " rice " that needs to be forgotten.

Now let's update the state of the old cell, from Ct-1 to Ct, the update method is: (1) Multiply the old state Ct-1 by ft (recall, ft is the forget gate, and output the degree of forgetting, that is, 0 to 1 value), discard the information that needs to be discarded (for example, if the forget gate outputs 0, it will become 0 after multiplication, and the information will be discarded); (2) Then add it and multiply the candidate value (calculation The formula is shown above). After the two are combined, they become a new candidate value. In this example of a language prediction model, this is where old pronoun information ( rice ) is discarded and new information ( fruit
 
) is added , based on the goals identified earlier .

(3) Output Gate (Output Gate)
Finally, we need to determine what value to output. First, a sigmoid layer is used to determine which part of the cell state will be output, and then the cell state is processed through tanh (to get a value between -1 to a value between 1) and multiplying it with the sigmoid output will eventually output only the part of the information we need.

 
In this language model example, seeing a new pronoun ( fruit ) might need to output information related to it ( apple, pear, banana... ).

The above is the principle introduction of standard LSTM. There are also many variants of LSTM. One of the most popular variants is Gated Recurrent Unit (GRU) , which combines the forget gate and the input gate into a single update gate. Mixed cell state and hidden state, as well as some other changes. The final GRU model is a bit simpler than the standard LSTM model, as shown in the following figure:

 

Related Reading

 

Welcome to follow my WeChat public account "Big Data and Artificial Intelligence Lab" (BigdataAILab) for more information

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324935238&siteId=291194637