RNN recurrent neural network and LSTM long-term and short-term artificial neural network

insert image description here
This is the most basic multi-input single-output Rnn. For example, after speaking a paragraph, and then discovering the key points.
The most obvious shortcoming in Rnn is that it shares a set of U, W, and b, which are all unchanged. Look at the output y and hi, where hi in this figure is h1, h2, and h3. You can also find that h2 depends on the result of h1, h3 depends on the result of h2, and so on.
Rnn can be multiple and ordered. For example, like characters talking or doing things, they are all time-sequenced, and they can imitate real characters, one after another. Instead of simple linear prediction, given a document data, it can be fitted directly with gradient descent. Rnn is controlled by time, which is equivalent to a memory ability, because for example, your sentence is connected to the next sentence, and the w weight between them is relatively large. So it is very practical to use Rnn to process data with time series. A distance-time prediction like this fits well.

**

Two forms of Seq2seq:

**

insert image description here

Among them, Encoder-Decoder is more common. The encoder and decoder are actually a black box inside and pushed according to the formula. This is the case where the output of the encoder is equal to the input of the next decoder. Then decode it again to get Y values ​​one by one. Here h4 is given to h5 to h8 (h6 contains h5 components), and here h3 is also given to h5-h8.
The second form:
insert image description here
here is different from the above. All h3 is input to h4.
But there is a small disadvantage that sometimes the input of the decoder is the same, because for example, three sentences are input (equivalent to three x. There are x1, x2, x3), and the intermediate semantic c that it may generate is the same. This leads to the same input value received by the decoder, and it is very likely that the y value generated is similar, or even the same.
So how to change this situation, the attention mechanism is introduced.
What the attention mechanism does is to adjust the weights in the encoder. The input c1, c2, and c3 (c is the intermediate semantics) in each decoder are different, so that an ideal y value can be generated, which is clearer.

insert image description here
Among them, look at that w. 0.6, 0.2, 0.1, 0.1, anyway, the sum is 1. The w1 here is 0.6, which means that this word is the key point, because the weight is relatively large, and the attention mechanism is added.
If you want to classify, you need to use the softmax layer. You need to look at cnn for this, so let's not talk about this.
Summary of Rnn invariant species: it is training, based on the principle of gradient descent to continuously reduce the difference between the real value and the predicted value (if you don’t understand here, go to Baidu to see the principle of linear regression with code, you will understand). Among them, backpropagation BP is used, so this is also a bad part of Rnn, because at the beginning (U, W, b) are all unchanged, what does this lead to? For example, the weight w, what if they are all 0.9 or 1.1? If you backpropagate, let's understand it this way, it is to multiply w forward from the last assumed value to find the error. If it is 1.1, if you multiply too much, the number will be very large, and the error will be very large. There is no way to continue, this is the gradient explosion. What if n 0.9? That is, the gradient disappears, and the error is particularly large, so a better variant is introduced to solve this problem: LSTM.

insert image description here
insert image description here
The forgetting gate is easy to understand, that is, w, u, and b have all changed, and it is to constantly change wf according to the results to test his forgetting degree.
insert image description here
Among them, alpha is an incentive function. In many cases, choosing the sigmoid function (if you don't know, Baidu)
is to see if the input value is used or not.

insert image description here
This step can be understood as the adjustment of the parameters (for better fitting), and the mathematical formulas ct-1, it, ft can be understood by looking at the formulas. (Why use Tanh, the distribution is from -1 to 1, to satisfy the central distribution, isn’t it for fitting? Why use this formula, the experience of the predecessors.) Finally, there is an output
insert image description here
gate. It is similar to the principle of Rnn mentioned before, and also requires the previous input value ht-1.
After talking about Lstm, I probably understand why it is called a long-term short-term memory artificial neural network. It is to add more forget gates, internal memory units, input values, and output values ​​are also processed through formula derivation. This is more in line with the thinking of the human brain to deal with problems.

Guess you like

Origin blog.csdn.net/qq_42671505/article/details/107413153