Long short-term memory (LSTM)

introduction

Disadvantages of RNN

RNN was introduced last time, you can refer to the article recurrent neural network (RNN) .
At the end, only the advantages of RNN are mentioned. As a comparison of LSTM, it is necessary to point out its obvious disadvantages. That is, for the representation of historical information, the ht h_t of RNNhtNot too reasonable.
Why? Consider a situation where the activation function of the hidden layer is Relu, or if there is no activation function, then:
ht = W (... (W (W h 0 + U x 1) + U x 2). ..) + U xt h_t=W(...(W(Wh_0+Ux_1)+Ux_2)...)+Ux_tht=W(...(W(Wh0+Ux1)+Ux2)...)+Uxt
Then you can expand to get ht = W t − 1 U x 1 +...... H_t=W^{t-1}Ux_1+.......ht=Wt1Ux1+. . . . . . . . According to this result, we can get:

  1. If U is positive, and the weight in W is a number greater than 1, then ht h_thtWill be mainly x 1 x_1x1Dominant, that is, only x 1 x_1 is savedx1Information.
  2. If U is positive and the weight in W is a number from 0 to 1, then ht h_thtWill be mainly xt x_txtDominant, x 1 x_1x1The information is minimal, x 2 x_2x2Second.

In short, according to the above analysis, it seems that RNN will forget sequentially. For example, in the second case above, the farther you choose, the cleaner you will forget.

LSTM improvements

Now, we hope to have a network architecture that can do it, selective forgetting. That is to say, set some more parameters than RNN to do this, in the hope that historical information and current input can be evaluated, if the current input is of no use, a low weight is selectively given to better inherit the history Information; if historical information will not have any influence on the future, then give the historical information a low weight and give the current input a higher weight. How to do it?

LSTM

The network architecture is the same as RNN, but each neuron in the hidden layer is expanded to:
Insert picture description here

Anatomy

parameter:
Insert picture description here

Obviously the difference from RNN is that not only the parameters are increased , but also xt x_txtAnd ht − 1 h_{t-1}ht1It was reused 4 times.
Which 4 times?
parameter:
Insert picture description here
Explanation: the above ft, it, ot f_t,i_t,o_tft,it,ThetYes [0, 1] [0,1][0,The number of 1 ] is a weight.
Used up the above two parametersxt x_txtAnd ht − 1 h_{t-1}ht1After that, the next cell memory st s_t needs to be generatedst. This is a significant difference from RNN, this st s_tstAnd wait for the ht h_t to be calculatedhtThey must be saved for use by the next neuron.
st s_tst
Insert picture description here
This means how much ft f_tftForget the memory of the past st − 1 s_{t-1}st1, To what extent it i_titAccept the input candidate memory ct c_tct
h t h_t ht
Insert picture description here
So far, we found that the historical information is changed by ht h_thtAnd st s_tstThe escort has enhanced the model's capabilities.
Finally, output zt z_twithtThe formula is:
Insert picture description here

to sum up

As an innovative improvement based on RNN, LSTM uses three gate mechanisms to achieve selective forgetting, which greatly enhances the expressive power of the network architecture and achieves very good application effects in practice. LSTM has now become a frequently used network, and it is a small expert who deals with time series and contextual problems.

Guess you like

Origin blog.csdn.net/qq_43391414/article/details/111476183