introduction
Disadvantages of RNN
RNN was introduced last time, you can refer to the article recurrent neural network (RNN) .
At the end, only the advantages of RNN are mentioned. As a comparison of LSTM, it is necessary to point out its obvious disadvantages. That is, for the representation of historical information, the ht h_t of RNNhtNot too reasonable.
Why? Consider a situation where the activation function of the hidden layer is Relu, or if there is no activation function, then:
ht = W (... (W (W h 0 + U x 1) + U x 2). ..) + U xt h_t=W(...(W(Wh_0+Ux_1)+Ux_2)...)+Ux_tht=W(...(W(Wh0+Ux1)+Ux2)...)+Uxt
Then you can expand to get ht = W t − 1 U x 1 +...... H_t=W^{t-1}Ux_1+.......ht=Wt−1Ux1+. . . . . . . . According to this result, we can get:
- If U is positive, and the weight in W is a number greater than 1, then ht h_thtWill be mainly x 1 x_1x1Dominant, that is, only x 1 x_1 is savedx1Information.
- If U is positive and the weight in W is a number from 0 to 1, then ht h_thtWill be mainly xt x_txtDominant, x 1 x_1x1The information is minimal, x 2 x_2x2Second.
In short, according to the above analysis, it seems that RNN will forget sequentially. For example, in the second case above, the farther you choose, the cleaner you will forget.
LSTM improvements
Now, we hope to have a network architecture that can do it, selective forgetting. That is to say, set some more parameters than RNN to do this, in the hope that historical information and current input can be evaluated, if the current input is of no use, a low weight is selectively given to better inherit the history Information; if historical information will not have any influence on the future, then give the historical information a low weight and give the current input a higher weight. How to do it?
LSTM
The network architecture is the same as RNN, but each neuron in the hidden layer is expanded to:
Anatomy
parameter:
Obviously the difference from RNN is that not only the parameters are increased , but also xt x_txtAnd ht − 1 h_{t-1}ht−1It was reused 4 times.
Which 4 times?
Explanation: the above ft, it, ot f_t,i_t,o_tft,it,ThetYes [0, 1] [0,1][0,The number of 1 ] is a weight.
Used up the above two parametersxt x_txtAnd ht − 1 h_{t-1}ht−1After that, the next cell memory st s_t needs to be generatedst. This is a significant difference from RNN, this st s_tstAnd wait for the ht h_t to be calculatedhtThey must be saved for use by the next neuron.
st s_tst
This means how much ft f_tftForget the memory of the past st − 1 s_{t-1}st−1, To what extent it i_titAccept the input candidate memory ct c_tct。
h t h_t ht
So far, we found that the historical information is changed by ht h_thtAnd st s_tstThe escort has enhanced the model's capabilities.
Finally, output zt z_twithtThe formula is:
to sum up
As an innovative improvement based on RNN, LSTM uses three gate mechanisms to achieve selective forgetting, which greatly enhances the expressive power of the network architecture and achieves very good application effects in practice. LSTM has now become a frequently used network, and it is a small expert who deals with time series and contextual problems.