Principles of Deep Learning ----- Recurrent Neural Network (RNN, LSTM)

Series Article Directory

Principles of deep learning ----- linear regression + gradient descent method Principles
of deep learning ----- logistic regression algorithm
Principles of deep learning ----- fully connected neural network Principles
of deep learning ----- convolutional neural network
depth Learning principle-----recurrent neural network (RNN, LSTM)
time series forecasting-----based on BP, LSTM, CNN-LSTM neural network algorithm, single-feature electricity load forecasting
time series forecasting (multi-features)-- ---Multi-feature electricity load forecasting based on BP, LSTM, CNN-LSTM neural network algorithm


Series of teaching videos

Quick introduction to deep learning and actual combat
[hands-on teaching] based on BP neural network single-feature electricity load forecasting
[hands-on teaching] based on RNN, LSTM neural network single-feature electricity load forecasting
[hands-on teaching] based on CNN-LSTM neural network single-feature electricity consumption Load forecasting
[Multi-feature forecasting] Multi-feature electric load forecasting based on BP neural network
[Multi-feature forecasting] Multi-feature power load forecasting based on RNN and LSTM [Multi-feature forecasting
] Multi-feature power load forecasting based on CNN-LSTM network



foreword

  Compared with data such as images that have certain characteristics in space, there is another type of data that has certain characteristics in time. This type of data is called time series data. By definition, it is a string of data indexed by time dimension. For example, natural language is a very typical time series data. Each character or word is continuously output according to the time dimension, with a strong time sequence. In addition to natural language data, data such as weather data, traffic flow data, and electric load data are time-series data, and this type of data is indexed by the time dimension.
  One of the most important characteristics of time series data is that the data in the previous period of time has a great influence on the data in the next period of time, such as natural language. There is such a sentence: "I have lived in China for many years, so I will Say __ fluently. Obviously, the blanks here should be filled with words related to Chinese such as "Chinese", "Chinese", "Mandarin Chinese". If you fill in "Japanese", "English" and other languages ​​​​in the blanks at this time, the above sentence will easily cause ambiguity, because the previous "I have lived in China for many years", this paragraph The words have a great influence on the latter.
  For example, weather data, a piece of weather data contains meteorological characteristics such as daily temperature, humidity, air pressure, wind speed and whether it is raining. All are mild cyclical changes, with obvious periodic and seasonal changes.
  For such time series data, neither the fully connected neural network nor the convolutional neural network can well consider the serial relationship between the time series data. Because both the fully connected neural network and the convolutional neural network are feed-forward networks, the final output of the model is not related to the model itself. The final output of the cyclic neural network model at the previous moment can be used as the input of the next moment to learn the information of the previous moment, so the cyclic neural network can learn the sequence relationship in the time series data very well, so as to process the time series data often have good results.


1. RNN neural network

  The more classic and basic neural network in the cyclic neural network is the RNN neural network. This network basically explains the operating principle of the cyclic neural network. Therefore, if you want to understand other improved neural networks in the cyclic neural network such as LSTM and GRU, you must First understand the RNN neural network. insert image description here  As shown in the figure above, it is the basic structure of the RNN neural network. To be honest, I was confused when I saw this structure for the first time. Integrated neural networks are much more difficult to understand. The reason why it is difficult to understand is because this diagram seems too simple, and it seems difficult to understand the cyclic neural network from the diagram. The following is an explanation of this diagram.
   X t X_tXtFor the input time series data, the input data is input according to time, that is, the data input is sequential.
   ht \mathrm{h}_thtThe output of the hidden layer calculated by the neural network for the data at each moment, so A is the hidden layer of the RNN neural network, and this output will be added to the data input at the next moment, and then sent to the neural network as a new input Carry out the calculation of the neural network;
  finally output, and repeat the above process continuously until the input data calculation at the last moment is completed. Since the RNN neural network continuously calculates the input data, the calculation speed of the RNN neural network is relatively slow.
  It seems that the operation process of the above-mentioned RNN neural network is not so intuitive, so it should be easier to understand that the above-mentioned graph can be expanded into the following shape. Here is a special reminder: it is best not to look at this picture from a global perspective, which will easily lead you into a misunderstanding of understanding. Instead, you should look at it from left to right, one computing unit at a time. Only in this way can we better understand that the RNN neural network is calculated moment by moment, so the previous moment has an impact on the output of the later moment. insert image description here  As shown in the figure above, the input data contains data features from time 0 to time t.
  First input the data features at time 0 to calculate the neural network to obtain the hidden layer output h 0 \mathrm{h}_0h0, output this hidden layer h 0 \mathrm{h}_0h0and input X 1 X_1 at time 1X1Carry out weighted summation. At this time, the input at time 1 already contains the data information at time 0;
  and then input the above-mentioned weighted and summed data into the neural network for calculation to obtain the hidden layer output h 1 \ at time 1 mathrm{h}_1h1, and then h 1 \mathrm{h}_1h1and X 2 X_2X2Carry out weighted summation. At this time, the input at time 2 already contains the information of 0 and 1;
  and then perform the neural network calculation on the data after weighted summation, and obtain the hidden layer output h 2 \mathrm{h}_2 at time 2h2, continuously loop the above process until the end of time t, and get the hidden layer output ht \mathrm{h}_t at time tht;
  theoretically this ht \mathrm{h}_thtThe output is determined by the output at time 0 to t-1 and the input at time t, so the RNN neural network has a memory function, and the memory of the input data at the previous time affects the output at the next time.
  Let's understand the RNN neural network from the specific data expression of the RNN neural network. The following formula shows the mathematical expression of the RNN neural network. O t = g ( V ⋅ ht ) ht = f ( U ⋅ X t + W ⋅ ht − 1 ) \begin{gathered} O_t=g\left(V \cdot h_t\right) \\ h_t=f\left( U \cdot X_t+W \cdot h_{t-1}\right) \end{gathered}Ot=g(Vht)ht=f(UXt+Wht1)  Among them, X t X_tXtis the input feature data at time t, UUU is the corresponding parameter weight;ht − 1 h_{t-1}ht1is the output of the hidden layer at the previous moment, WWW is the corresponding parameter weight;ht \mathrm{h}_thtis the output of the hidden layer at the current moment, VVV is the corresponding weight parameter;O t O_tOtis the output of the output layer at the current time t. Therefore, drawing the weights in the graph can be obtained as shown in the following figure:
insert image description here
  This figure has more output layers than the RNN neural network unit at the beginning. If we remove the W on the right at this time, then the RNN neural network at this time becomes The fully connected neural network, the specific figure is shown below.
insert image description here
  Of course, the above-mentioned fully connected neural network looks a bit twisted, because the diagram of the fully connected neural network I often see is shown in the figure below.
insert image description here
  Therefore, add the above-mentioned hidden layer output and its corresponding weight to the above-mentioned fully-connected neural network graph, and then it becomes the graph shown in the figure: it should be very clear according to the following correspondence
insert image description here
  . The above two figures correspond to each other.
insert image description here
  The above comparison chart should help us understand the RNN neural network more easily. When beginners learn RNN, it is easy for beginners to understand the middle layer in the RNN neural network as having only one neuron node, because many people may draw to highlight the The characteristics of the loop ignore that its input is a feature vector, that is, the feature vector contains one or more input values, and one or more neurons can also be set in the hidden layer, which is easy to cause difficulties in understanding.

1.1. Weight sharing

  The concept of weight sharing is used in cyclic neural networks. In deep learning, the concept of weight sharing is also used in convolutional neural networks. In the convolutional neural network, the convolution kernel of the convolution layer is its weight. The convolution kernel slides on the feature map and continuously calculates with the values ​​in the feature map. Therefore, a feature map shares a set of weights instead of Like a fully-connected neural network, each input data feature has its own weight parameter, which will greatly increase the amount of calculation of the computer, and it is also easier to overfit the network model, so the generalization of the model is not strong.
insert image description here
  Carefully observe the model diagram of the cyclic neural network (as shown in the figure above), and you will find that the calculated weights W, V, and U of the neural network at different moments are the same, which means that these weights are shared at each moment. Such a setting enables the output of the RNN neural network to be variable in length, and the input data is continuously input into the neural network in time series order to obtain the corresponding output. Therefore, the network parameters of the cyclic neural network are less than those of the fully connected neural network. A lot, so the calculation of the parameter update of the neural network is much less, which greatly reduces the calculation of the computer.

1.2. Case

  Let's use a case to consolidate the knowledge points of the RNN neural network.
  Assume that there is now a network structure of an RNN neural network consisting of an input layer, a hidden layer, and an output layer. The input data contains 3 time steps, and each time step contains 2 data features, so the input layer of the RNN neural network contains 2 neurons, and the hidden layer contains 2 neurons, and the output layer of the neural network has two output, so the output layer contains two neurons; as shown in the figure below:
insert image description here
  the output data of the neural network is as follows: [ 1 1 ] [ 1 1 ] [ 2 2 ] \left[\begin{array}{l} 1 \\ 1 \end{array}\right]\left[\begin{array}{l} 1 \\ 1 \end{array}\right]\left[\begin{array}{l} 2 \\ 2 \end{array}\right][11][11][22]   In order to facilitate the calculation and reflect the characteristics of the RNN neural network, all the weights of the neural network are set to be 1 and there is no bias, and the activation functions are also linear activation functions. At the same time, when calculating the input of the first time step, there is no output value of the hidden layer at the previous moment as input, so the output value of the hidden layer at this time is set to [0,0]. At this time, the calculation of the first time step is shown in the figure below:
insert image description here
  According to the calculation formula of the RNN neural network, the values ​​of the two neurons in the hidden layer are:
  the first neuron:1 × 1 + 1 × 1 + 0 × 1 + 0 × 1 = 2 1 \times 1+1 \times 1+0 \times 1+0 \times 1=21×1+1×1+0×1+0×1=2   The second neuron:1 × 1 + 1 × 1 + 0 × 1 + 0 × 1 = 2 1 \times 1+1 \times 1+0 \times 1+0 \times 1=21×1+1×1+0×1+0×1=2   The first output of the output layer:2 × 1 + 2 × 1 = 4 2 \times 1+2 \times 1=42×1+2×1=4   The second output of the output layer:2 × 1 + 2 × 1 = 4 2 \times 1+2 \times 1=42×1+2×1=4   At this time, the value of the hidden layer of the RNN neural network is updated to [2,2], and the data in the second time step is input. The specific calculation is shown in the figure below:insert image description here
  because the weight value at each moment is shared, so at this moment The weights of are constant and are all 1. Therefore, the values ​​of the two neurons in the hidden layer are:
  the first neuron:1 × 1 + 1 × 1 + 2 × 1 + 2 × 1 = 6 1 \times 1+1 \times 1+2 \times 1 +2 \times 1=61×1+1×1+2×1+2×1=6   The second neuron:1 × 1 + 1 × 1 + 2 × 1 + 2 × 1 = 6 1 \times 1+1 \times 1+2 \times 1+2 \times 1=61×1+1×1+2×1+2×1=6   The first output of the output layer:6 × 1 + 6 × 1 = 12 6 \times 1+6 \times 1=126×1+6×1=12   The second output of the output layer:6 × 1 + 6 × 1 = 12 6 \times 1+6 \times 1=126×1+6×1=12   At this time, the value of the hidden layer of the RNN neural network is updated to [6,6], and the data in the third time step is input. The specific calculation is shown in the figure below: the values ​​of the two neurons in the hidden layer areinsert image description here
  :
  1st neurons:2 × 1 + 2 × 1 + 6 × 1 + 6 × 1 = 16 2 \times 1+2 \times 1+6 \times 1+6 \times 1=162×1+2×1+6×1+6×1=16   The second neuron:2 × 1 + 2 × 1 + 6 × 1 + 6 × 1 = 16 2 \times 1+2 \times 1+6 \times 1+6 \times 1=162×1+2×1+6×1+6×1=  The first output of the 16 output layer: 16 × 1 + 16 × 1 = 32 16 \times 1+16 \times 1=3216×1+16×1=  The second output of the 32 output layer: 16 × 1 + 16 × 1 = 32 16 \times 1+16 \times 1=3216×1+16×1=32   Finally, the output value of each moment is:[ 4 4 ] [ 12 12 ] [ 32 32 ] \left[\begin{array}{l} 4 \\ 4 \end{array}\right]\left[\ begin{array}{l} 12 \\ 12 \end{array}\right]\left[\begin{array}{l} 32 \\ 32 \end{array}\right][44][1212][3232]   So far, we have experienced a complete RNN structure. We have noticed that the output at each moment has a very large relationship with the input at the previous moment. Different. If we change the order of the input sequence, the results we get will be completely different. This is the characteristic of RNN, which can process sequence data and is also sensitive to sequences.

1.3. There are problems in the RNN neural network

  Although the RNN neural network can theoretically learn useful information at any time, it is prone to the problem of gradient explosion and gradient disappearance when the parameters of the neural network are updated by backpropagation, resulting in too large or too small gradient values. At this time, the parameter may not be updated.
  In order to understand the cause of this problem in more detail, it is necessary to explain the gradient of RNN neural network parameters. First, the mathematical model formula of RNN neural network is as follows: O t = g ( V ⋅ ht ) ht = f ( U ⋅ X t + W ⋅ ht − 1 ) \begin{gathered} O_t=g\left(V \cdot h_t\right) \\ h_t=f\left(U \cdot X_t+W \cdot h_{t-1}\ right) \end{gathered}Ot=g(Vht)ht=f(UXt+Wht1)  Now use the parameter W to explain the reasons for the gradient explosion and gradient disappearance of the RNN neural network.
  Suppose there is an error ∂ L t \partial L_tLtFind the derivative of the error with respect to the parameter W. The backpropagation algorithm in the RNN neural network uses the time backpropagation algorithm (BPTT). After the gradient of all time steps needs to be solved, the gradient obtained by using the multivariable chain derivation rule is as follows: ∂ L t ∂ w = ∑ i = 0 i = t ∂ L t ∂ 0 t ∂ 0 t ∂ ht ∂ ht ∂ hi ∂ hi ∂ w \frac{\partial L_t}{\partial w}=\sum_{i=0}^ {i=t} \frac{\partial L_t}{\partial 0_t} \frac{\partial 0_t}{\partial h_t} \frac{\partial h_t}{\partial h_i} \frac{\partial h_i}{\ partial w}wLt=i=0i=t0tLtht0thihtwhi  where except ∂ ht ∂ hi \frac{\partial h_t}{\partial h_i}hihtOther partial derivatives are relatively easy to calculate, using the chain derivation rule for ∂ ht ∂ hi \frac{\partial h_t}{\partial h_i}hiht进行展开求导得到如下式所示: ∂ h t ∂ h i = ∂ h t ∂ h t − 1 ⋅ ∂ h t − 1 ∂ h t − 2 ⋯ ∂ h i + 1 ∂ h i = ∏ k = i k = t − 1 ∂ h k + 1 ∂ h k \frac{\partial h_t}{\partial h_i}=\frac{\partial h_t}{\partial h_{t-1}} \cdot \frac{\partial h_{t-1}}{\partial h_{t-2}} \cdots \frac{\partial h_{i+1}}{\partial h_i}=\prod_{k=i}^{k=t-1} \frac{\partial h_{k+1}}{\partial h_k} hiht=ht1htht2ht1hihi+1=k=ik=t1hkhk+1  Suppose now that we already know: ∂ hk ∂ hk − 1 = f ′ ⋅ w \frac{\partial h_k}{\partial h_{k-1}}=f^{\prime} \cdot whk1hk=fw   wheref ′ f^{\prime}f is the derivative of the activation function, assuming sigmoid as an example,f ∈ ( 0 , 1 ) f \in(0,1)f(0,1 ) , then its derivative isf ′ = f ( 1 − f ) ∈ ( 0 , 1 4 ) f^{\prime}=f(1-f) \in\left(0, \frac{1}{4 }\right)f=f(1f)(0,41) , then when W<4,∂ hk ∂ hk − 1 < 1 \frac{\partial h_k}{\partial h_{k-1}}<1hk1hk<1 , after multiple multiplications,∂ L t ∂ w \frac{\partial L_t}{\partial w}wLtGradually approaching 0, the gradient disappears at this time. Then when W>4, ∂ hk ∂ hk − 1 > 1 \frac{\partial h_k}{\partial h_{k-1}}>1hk1hk>1 , after multiple multiplications,∂ L t ∂ w \frac{\partial L_t}{\partial w}wLtgradually becomes larger, and a gradient explosion occurs at this time.
  It should be noted here that the derivative of the error to the parameter is cumulative, ∂ L t ∂ w = ∑ i = 0 i = t ∂ L t ∂ 0 t ∂ 0 t ∂ ht ∂ ht ∂ hi ∂ hi ∂ w \frac{\ partial L_t}{\partial w}=\sum_{i=0}^{i=t} \frac{\partial L_t}{\partial 0_t} \frac{\partial 0_t}{\partial h_t} \frac{\ partial h_t}{\partial h_i} \frac{\partial h_i}{\partial w}wLt=i=0i=t0tLtht0thihtwhi, which means that the total gradient of the entire error to the parameter will not disappear, so the relatively close gradient always exists. When it dominates, the model cannot establish long-distance dependence.


2. LSTM neural network

  Since the RNN neural network is prone to gradient disappearance or gradient explosion when the neural network parameters are updated, this will cause the neural network to fail to learn information in longer sequences well, so the RNN neural network only has short-term memory. Fortunately, through the efforts of scholars, the RNN neural network has been improved, and the LSTM neural network was invented. LSTM can alleviate the problem of gradient disappearance to a certain extent, and at the same time, it has better long-term sequence learning than the RNN neural network. ability. Therefore, at present, as long as the cyclic neural network is mentioned, it generally refers to the LSTM neural network. However, the RNN nerve is the basis of LSTM. Before learning the LSTM neural network, you must understand the RNN. Otherwise, it will be more difficult to understand when learning the LSTM network. Compared with the RNN, the LSTM neural network only has some more control gates in the neural network unit. Calculation, so as long as the calculation of these control gates is figured out on the basis of the RNN neural network, the LSTM neural network can be understood. In order to prove that the transition from RNN to LSTM is indeed a relatively complex operation of some control gates in the neural network unit. We can transform the above RNN neural network diagram as shown in the following figure: insert image description here
  and the structural diagram of the LSTM neural network improved from the RNN neural network can be shown in the following figure: insert image description here
  From the comparison between the RNN neural network structural diagram and the LSTM neural network structural diagram, you can It is found that, except that the calculation of the neural network unit is a little more complicated, the others are basically the same as RNN. Therefore, just figure out the network unit of the LSTM neural network, and then perform forward propagation and back propagation on the input data and the output of the final neural network according to the operation mode of the RNN neural network. Therefore, I want to reiterate here again that before learning to understand the LSTM neural network, you must understand the principle and the entire operation of the RNN neural network. First, the mathematical model of the LSTM neural network is laid out, as shown in the following formula: f t = σ ( W f ⋅ [ h t − 1 , X t ] + b f ) i t = σ ( W i ⋅ [ h t − 1 , X t ] + b i ) C ~ t = tanh ⁡ ( W C ⋅ [ h t − 1 , X t ] + b C ) C t = f t ∗ C t − 1 + i t ∗ C ~ t o t = σ ( W o [ h t − 1 , X t ] + b o ) h t = o t ∗ tanh ⁡ ( C t ) \begin{gathered} f_t=\sigma\left(W_f \cdot\left[h_{t-1}, X_t\right]+b_f\right) \\ i_t=\sigma\left(W_i \cdot\left[h_{t-1}, X_t\right]+b_i\right) \\ \tilde{C}_t=\tanh \left(W_C \cdot\left[h_{t-1}, X_t\right]+b_C\right) \\ C_t=f_t * C_{t-1}+i_t * \tilde{C}_t \\ o_t=\sigma\left(W_o\left[h_{t-1}, X_t\right]+b_o\right) \\ h_t=o_t * \tanh \left(C_t\right) \end{gathered} ft=p(Wf[ht1,Xt]+bf)it=p(Wi[ht1,Xt]+bi)C~t=fishy(WC[ht1,Xt]+bC)Ct=ftCt1+itC~tot=p(Wo[ht1,Xt]+bo)ht=otfishy(Ct)  It can be seen from the formula that it is indeed much more complicated than the RNN neural network, but don't be afraid, in fact, everything is a paper tiger. If you analyze it carefully, you will find that it is not as complicated as it seems. The following is a specific analysis of the LSTM neural network unit. insert image description here
  The unit structure of the LSTM neural network is shown in the figure above. The sigmoid activation function represented by σ in the above figure, sigmoid maps the value between 0 and 1, such a setting helps to update or forget information. tanh is the tanh activation function.
  Because any number multiplied by 0 is 0, this part of the information will be removed. Similarly, any number multiplied by 1 will get itself, and this part of the information will be perfectly preserved. It is equivalent to remember if it is 1, forget it if it is 0, or remember the data selectively if it is a number between 0-1. Therefore, the principle of the LSTM neural network is: due to the limited memory capacity, remember the important and forget the irrelevant.
  Also, for the icons of the various elements used in the diagram, each black line carries an entire vector from the output of one node to the input of other nodes. The pink circle represents pointwise operations, such as the sum of vectors, and the yellow matrix is ​​the learned neural network layer. Lines that come together indicate that the vectors are connected, and lines that separate indicate that the content is copied and then distributed to different locations.
insert image description here
  The most important design of LSTM is the transmission belt, as shown in the following figure:
insert image description here
  LSTM relies on this transmission belt to save the useful information that has been screened before, and uses this information to participate in the current operation, so that the output at the current moment is obtained from the previous screening. The combined information and current input information affect the output. Compared with the RNN neural network, it has an essential improvement in taking all the previous information as input.
  LSTM has three types of gate structures: forget gate, input gate and output gate to protect and control the state of C (cell). Below, we introduce these three gates.

2.1. Forgotten Gate

  The first step in LSTM is to decide what information we will discard from the cell state. This decision is made through a structure known as the "forget gate". The forget gate will read the previous output and the current input, do a sigmoid nonlinear mapping, and then output a vector (the value of each dimension of the vector is between 0 and 1, 1 means completely reserved, 0 means completely discarded , which is equivalent to remembering the important ones and forgetting the irrelevant ones), this vector is multiplied by the cell state, and the useful information in the cell state is retained, and the useless information is discarded.
insert image description here
  The calculation part of the forget gate in the LSTM neural network unit structure is shown in the figure. The specific calculation formula is as follows: ft = σ ( W f ⋅ [ ht − 1 , X t ] + bf ) f_t=\sigma\left(W_f \cdot\left[h_{t-1}, X_t\right] +b_f\right)ft=p(Wf[ht1,Xt]+bf)   If the above formula is not easy to understand, you can better explain the calculation process of the forget gate through the following figure.
insert image description here
  First the vectorht − 1 h_{t-1}ht1and X t X_tXtPerform a connection operation to get a new vector, and then calculate the parameter matrix W f W_fWfAnd the product of the new vector after the connection, and then the result of this product and the offset bf b_fbfSum, and then perform function mapping through the sigmoid activation function to obtain the vector ft f_tft, this vector ft f_tftEach element of is between 0-1. Obviously the parameter W f W_f hereWfNeed to learn from the training data through backpropagation.

2.2. Input gate

  There are two steps of calculation in the input gate. The calculation formula of these two steps is as follows, and the calculation process of the vector i_t is very similar to the forget gate. insert image description here
it = σ ( W i ⋅ [ ht − 1 , X t ] + bi ) C ~ t = tanh ⁡ ( WC ⋅ [ ht − 1 , X t ] + b C ) \begin{aligned} i_t &=\sigma\ left(W_i \cdot\left[h_{t-1}, X_t\right]+b_i\right) \\ \tilde{C}_t &=\tanh \left(W_C \cdot\left[h_{t-1 }, X_t\right]+b_C\right) \end{aligned}itC~t=p(Wi[ht1,Xt]+bi)=fishy(WC[ht1,Xt]+bC)   i t i_t itThe specific calculation process of can be expressed by the following figure:
insert image description here
  First, the vector ht − 1 h_{t-1}ht1and X t X_tXtPerform a connection operation to get a new vector, and then calculate the parameter matrix W i W_iWiAnd the product of the new vector after the connection, and then the result of this product and the bias bi b_ibiSumming, after exploration, the function mapping is performed through the sigmoid activation function, and the vector it i_t is obtainedit, this vector it i_titEach element of is between 0-1. Obviously the parameter W i W_i hereWiNeed to learn from the training data through backpropagation. It can be seen that this operation is almost the same as the input gate, but the parameter matrix is ​​different.
  C ~ t \tilde{C}_tC~tThe calculation process of can be expressed by the following figure:
insert image description here
  First, the vector ht − 1 h_{t-1}ht1and X t X_tXtPerform a connection operation to get a new vector, and then calculate the parameter matrix WC W_CWCAnd the product of the new vector after connection, and then the result of this product is activated by nonlinear mapping through the tanh activation function, and the vector C ~ t \tilde{C}_t is obtainedC~t, where the vector C ~ t \tilde{C}_tC~tEach element of is between [-1, 1]. Obviously the parameter WC W_C hereWCNeed to learn from the training data through backpropagation. The difference from the forget gate here is that the activation function tanh is used.

2.3. Update cell state

At this point , ft , it , C ~ t f_t , i_t, \tilde{C}_t   have been calculatedftitC~t, these matrix vectors can be used to map the cell state C t − 1 C_{t-1} on the conveyor beltCt1An update has been made.
insert image description here
  The specific calculation formula is as follows: It should be noted that the multiplication of the corresponding elements of the matrix is ​​used here, and the symbol "*" is used, while the previous forget gate and input gate use matrix multiplication. C t = ft ∗ C t − 1 + it ∗ C ~ t C_t=f_t * C_{t-1}+i_t * \tilde{C}_tCt=ftCt1+itC~t  The specific calculation process can be represented by the following diagram:
insert image description here
  First, use the vector f_t and the vector C t − 1 C_{t-1}Ct1Multiply the corresponding elements, the elements in the vector f_t calculated by the forget gate are all between 0-1, so you can selectively forget the vector C t − 1 C_{t-1}Ct1The elements in to get a new vector; then use the vector it i_{t} calculated by the input gateitand C t C_{t}CtMultiply the corresponding elements and get a new vector. At this time, the two new vectors are summed, so that the cell state C t C_{t} on the conveyor belt is updated .Ct, this cell state contains filtered previous information and current moment information.

2.4, output gate

  After completing the update of the cell state, it is the last step to calculate the input of the LSTM, which is calculated from the output.
insert image description here
  The specific calculation formula of the output gate is shown in the figure below, which obviously consists of two parts: ot = σ ( W o [ ht − 1 , X t ] + bo ) ht = ot ∗ ​​tanh ⁡ ( C t ) \begin{aligned} &o_t=\sigma\left(W_o\left[h_{t-1}, X_t\right]+b_o\right) \\ &h_t=o_t * \tanh \left(C_t\right) \end{aligned}ot=p(Wo[ht1,Xt]+bo)ht=otfishy(Ct)  The first part needs to calculate the vector o_t, the calculation process of o_t can be shown in the figure: as can be seen from the figure, the vector O t O_t
insert image description here
  hereOtThe calculation and the previous forget gate vector ft f_tft, input gate vector it i_titBasically the same, but the parameter matrix is ​​different. First the vector h ( t − 1 ) h_(t-1)h(t1 ) andX t X_tXtPerform a connection operation to get a new vector, and then calculate the parameter matrix W o W_oWoand the product of the connected new vector, and then the result of this product is subjected to function mapping through the sigmoid activation function to obtain the vector ot o_tot, this vector ot o_totEach element of is between 0-1. Here the parameter W o W_oWoNeed to learn from the training data through backpropagation.
  The specific calculation process can be represented by the following figure: the last step is to calculate the output ht h_t
insert image description here
  of LSTMht, first the cell state vector C t C_tCtAfter the activation function tanh is activated, a new vector is obtained, and all elements of the vector are mapped to [-1,1]. Then the vector ot o_totMultiply with the corresponding elements of the new vector to get the output vector ht h_t of the LSTMht. But as can be seen from the figure, here ht h_thtThere are two output directions, one is the output of LSTM, and the other is the input of the next moment.
  So far, the structure of the LSTM neural network is finished. Through the above analysis, we can know that LSTM has a total of 4 parameter matrices, which are W f W_fWf W i W_i Wi W C W_C WC W o W_o Wo, these parameter matrices can be updated using the gradient descent method to update the parameters, and finally get an LSTM model to help us complete the tasks in the corresponding scene.

2.5. How does the LSTM neural network alleviate the problem of RNN gradient disappearance

  In the RNN neural network, the reason for the disappearance of the gradient is ∏ k = ik = t − 1 ∂ hk + 1 ∂ hk \prod_{k=i}^{k=t-1} \frac{\partial h_{k+ 1}}{\partial h_k}k=ik=t1hkhk+1When each item in this formula is less than 0, the gradient multiplication is finally close to 0, and the gradient disappears. Therefore, recursion is the main reason for the gradient to disappear. As an improved version of the RNN neural network, the STM neural network alleviates the gradient disappearance of the RNN neural network. In the mathematical model of the LSTM neural network model, C t C_t
  appearsCt C ( t − 1 ) C_(t-1) C(t1 ) This recursive phenomenon, at this time we ask for∂ C t ∂ C t − 1 \frac{\partial C_t}{\partial C_{t-1}}Ct1Ctpartial derivative of . Note here ft f_tft i t i_t it C ~ t \tilde{C}_t C~tBoth are C ( t − 1 ) C_(t-1)C(t1)的复合函数,因此求导的过程如下所示: ∂ C t ∂ C t − 1 = ∂ C t ∂ f t ∂ f t ∂ h t − 1 ∂ h t − 1 ∂ C t − 1 + ∂ C t ∂ i t ∂ i t ∂ h t − 1 ∂ h t − 1 ∂ C t − 1 + ∂ C t ∂ C ~ t ∂ C ~ t ∂ h t − 1 ∂ h t − 1 ∂ C t − 1 + ∂ C t ∂ C t − 1 \begin{aligned} \frac{\partial C_t}{\partial C_{t-1}} &=\frac{\partial C_t}{\partial f_t} \frac{\partial f_t}{\partial h_{t-1}} \frac{\partial h_{t-1}}{\partial C_{t-1}}+\frac{\partial C_t}{\partial i_t} \frac{\partial i_t}{\partial h_{t-1}} \frac{\partial h_{t-1}}{\partial C_{t-1}} \\ &+\frac{\partial C_t}{\partial \tilde{C}_t} \frac{\partial \tilde{C}_t}{\partial h_{t-1}} \frac{\partial h_{t-1}}{\partial C_{t-1}}+\frac{\partial C_t}{\partial C_{t-1}} \end{aligned} Ct1Ct=ftCtht1ftCt1ht1+itCtht1itCt1ht1+C~tCtht1C~tCt1ht1+Ct1Ct  将上述的公式进行一个化简求解: ∂ C t ∂ C t − 1 = C t − 1 σ ′ ( ⋅ ) W f ∗ o t − 1 tanh ⁡ ′ ( C t − 1 ) + C ~ t σ ′ ( ⋅ ) W i ∗ o t − 1 tanh ⁡ ′ ( C t − 1 ) + i t tanh ⁡ ′ ( ⋅ ) W C ∗ o t − 1 tanh ⁡ ′ ( C t − 1 ) + f t \begin{aligned} \frac{\partial C_t}{\partial C_{t-1}} &=C_{t-1} \sigma^{\prime}(\cdot) W_f * o_{t-1} \tanh ^{\prime}\left(C_{t-1}\right) \\ &+\tilde{C}_t \sigma^{\prime}(\cdot) W_i * o_{t-1} \tanh ^{\prime}\left(C_{t-1}\right) \\ &+i_t \tanh ^{\prime}(\cdot) W_C * o_{t-1} \tanh ^{\prime}\left(C_{t-1}\right) \\ &+f_t \end{aligned} Ct1Ct=Ct1p()Wfot1fishy(Ct1)+C~tp()Wiot1fishy(Ct1)+itfishy()WCot1fishy(Ct1)+ft  A result can be seen from the result of derivation, that is ∂ C t ∂ C t − 1 \frac{\partial C_t}{\partial C_{t-1}}Ct1CtThe derivative result of is a summation state, and one of them is f_t, which is the output value of the forget gate. When the forget gate is completely reserved, the value is 1, and when it is completely discarded, the value is 0. Obviously when ft f_tftWhen close to 1 ∂ C t ∂ C t − 1 \frac{\partial C_t}{\partial C_{t-1}}Ct1Ctk = ik = t − 1 ∂ hk + 1 ∂ hk \prod_{k=i}^{k=t-1} \frac{\partial h_{k+1}}{\partial h_k}k=ik=t1hkhk+1It will not be close to 0, so the problem of gradient disappearance is alleviated. In fact, the LSTM neural network does not fundamentally solve the problem of gradient disappearance, but only in C t C_tCtto C ( t − 1 ) C_(t-1)C(t1 ) This road solves the problem of gradient disappearance, and other roads still have the problem of gradient disappearance.
  What needs to be emphasized here is that LSTM does not allow all long-distance gradient values ​​to dissipate, but only allows gradients with timing key information positions to be transmitted all the time. If our task is relatively dependent on historical information, it will be close to 1, and the historical gradient information is not easy to disappear at this time; if it is very close to 0, then it means that our task does not depend on historical information. It doesn't matter if the gradient disappears.


Summarize

  This article explains from the basic unit of the RNN neural network to the explanation of the LSTM neural network. From the perspective of the model, the biggest difference between the RNN neural network and the LSTM neural network is that the LSTM neural network has several more gate units and a conveyor belt to control learning. The importance of the time series information, so as to retain the useful information and remove the useless information. At the same time, the LSTM neural network also alleviates the gradient disappearance problem of the RNN neural network to a certain extent. But the RNN neural network is still the foundation. From the perspective of model operation, the operation of the RNN neural network and the LSTM neural network are almost exactly the same. Therefore, it is necessary to learn the LSTM neural network on the basis of learning and understanding the RNN neural network.

Guess you like

Origin blog.csdn.net/didiaopao/article/details/126483407