Recurrent Neural Network (RNN&LSTM)

1. Recurrent Neural Networks

1.1 What is a Recurrent Neural Network?

Recurrent Neural Network (RNN) is a neural network specially used to process sequence data , which can mine timing information and semantic information in data. Utilizing this ability of RNN, the deep learning model has made breakthroughs in solving problems in the field of NLP such as speech recognition, language model, machine translation, and time series analysis.

The flaws of the fully connected neural network in processing sequence data: for example, in the stock forecasting problem, the stock price not only depends on the current input data state, but also depends on historical information, which is impossible for the fully connected neural network.
insert image description here

1.2 Network structure of recurrent neural network

The figure below shows a typical RNN network structure.
insert image description here
In the figure, UUU is the weight matrix from the input layer to the hidden layer,WWW is the weight matrix of hidden layer to hidden layer recurrent connections,VVV is the weight matrix from the hidden layer to the output layer. xt x_txtmeans ttInput at time t , ht h_thtmeans ttHidden layer vector at time t , ot o_totmeans ttoutput at time t .

h t = f ( U ⋅ x t + W ⋅ h t − 1 + b h ) h_t = f(U \cdot x_t + W \cdot h_{t-1} + b_h) ht=f(Uxt+Wht1+bh)
o t = g ( V ⋅ h t + b o ) o_t = g(V \cdot h_t + b_o) ot=g ( Vht+bo)

where f ( ⋅ ) f(\cdot)f ( ) andg ( ⋅ ) g(\cdot)g ( ) is the activation function.

  • Used to calculate the hidden layer vector ht h_thtThe activation function of is usually chosen in RNNs tanh, and sometimes ReLUis often used;
  • Used to calculate the output ot o_totThe activation function:
    • If it is a binary classification problem, we may use sigmoidthe function;
    • If it is a k-classification problem, we might use softmaxthe function.

It can be seen that the hidden layer vector of RNN at each moment is not only determined by the input at the current moment, but also depends on the hidden layer vector at the previous moment, which makes RNN have the function of remembering the past time series information.

One thing worth noting is that RNNs share the same weights across different time steps .

1.3 Loss function

In RNN, we define the loss function as the standard logistic regression loss (i.e. cross-entropy loss) .

You can refer to the introduction to cross-entropy in this article .

The output of our neural network y ^ ( t ) \widehat{y}^{(t)}y ( t ) is usually some probability value, andy ( t ) y^{(t)}y( t ) is a series of determined values ​​(labels), soy ( t ) = 1 y^{(t)} = 1y(t)=1

Loss function for a single word or a single segment in time series data:
L ( t ) ( y ^ ( t ) , y ( t ) ) = − y ( t ) logy ^ ( t ) = − logy ^ ( t ) L^{(t)} (\widehat{y}^{(t)} , y^{(t)}) = -y^{(t)} log \widehat{y}^{( t)} = -log \widehat{y}^{(t)}L(t)(y (t),y(t))=y(t)logy (t)=logy
Overall loss for ( t )
sequence: (average of cross-entropy loss) L ( y ^ , y ) = 1 T ∑ t = 1 TL ( t ) ( y ^ ( t ) , y ( t ) ) L(\widehat{y} , y) = \frac{1}{T}\sum\limits^T_{t=1} L^{(t)} (\widehat{y}^{( t)} , y^{(t)})L(y ,y)=T1t=1TL(t)(y (t),y( t ) )
and then calculate the parameters in the model by way of back propagation, and use the gradient descent method to update the parameters.

For language models, a good language model can use highly accurate lexical units to predict what we will see next, such as the prompt information of the input method when we type.
In the best case, the model can perfectly estimate the label token with probability 1;
in the worst case, the model predicts the label token with probability 0;
at baseline, the model’s prediction is a uniform distribution over all available tokens in the vocabulary.


1.4 Problems with Recurrent Neural Networks

For length TTsequence of T , we compute this TTduring iterationGradients over T time steps, which will produce length O ( T ) O(T)during backpropagationO ( T ) matrix multiplication chain. WhenTTFor large T , this can lead to numerical instability, e.g. possibly exploding or vanishing gradients. In this way we can find that the RNNHas a proximity effect and is not good at capturing long-term dependencies


2. LSTM

The following content refers to the detailed explanation of LSTM .

2.1 Network structure of LSTM

The full name of LSTM is Long Short Term Memory. The motivation of this algorithm is to solve the long-term dependence problem of RNN mentioned above. The reason why LSTM can solve the long-term dependency problem of RNN is because LSTM introducesgate mechanismUsed to control the flow and loss of features. LSTM is composed of a series of LSTM units (LSTM Unit), and its chain structure is as shown in the figure below.
insert image description here
In the LSTM unit, each yellow box in the figure represents a neural network layer, which consists of weights, biases, and activation functions; each pink circle represents element-level operations; arrows represent vector flow; intersecting arrows represent vector splicing; bifurcated arrows represent vector replication.
insert image description here

2.2 Interpretation of LSTM unit

The core part of LSTM is the part similar to the conveyor belt in the LSTM unit (as shown in the figure below). This part is generally called the cell state (cell state), which exists in the entire chain system of LSTM from beginning to end.

cell state official
insert image description here C t = f t ⋅ C t − 1 + i t ⋅ C ~ t C_t = f_t \cdot C_{t-1} + i_t \cdot \widetilde{C}_t Ct=ftCt1+itC t

in,

  • f t f_t ftis called the forget gate, which means C t − 1 C_{t-1}Ct1Which features of C t are used to calculate C t C_tCt f t f_t ftis a vector, and each element of the vector is in the range [0,1]. Usually we use sigmoid sigmoids i g m o i d function as activation function,sigmoid sigmoidThe output of the s i g m o i d function is a value between [0,1].
  • i t i_t itis called the input gate and is used to control C ~ t \widetilde{C}_tC tWhich features of C t are used to update C_tCt. Same as ft f_tftIt is also a vector with elements between [0,1], usually we use sigmoid sigmoidThe s i g m o i d function is used as the activation function.

forget gate ft f_tft official
insert image description here f t = σ ( W f ⋅ [ h t − 1 , x t ] + b f ) f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) ft=s ( Wf[ht1,xt]+bf)

input gate it i_titwith cell state update value C ~ t \widetilde{C}_tC t official
insert image description here i t = σ ( W i ⋅ [ h t − 1 , x t ] + b i ) i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) it=s ( Wi[ht1,xt]+bi)

C ~ t = t a n h ( W C ⋅ [ h t − 1 , x t ] + b C ) \widetilde{C}_t = tanh(W_C \cdot [h_{t-1}, x_t] + b_C) C t=English ( W _C[ht1,xt]+bC)

in,

  • i t i_t itis called the input gate and is used to control C ~ t \widetilde{C}_tC tWhich features of C t are used to update C_tCt. Same as ft f_tftIt is also a vector with elements between [0,1], usually we use sigmoid sigmoidThe s i g m o i d function is used as the activation function.
  • C ~ t \widetilde{C}_t C tRepresents the unit state update value, the activation function usually uses tanh tanht you _

Hidden layer output ht h_tht official
insert image description here o t = σ ( W o ⋅ [ h t − 1 , x t ] + b o ) o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) ot=s ( Wo[ht1,xt]+bo)

h t = o t ⋅ t a n h ( C t ) h_t = o_t \cdot tanh(C_t) ht=otI ( older )t)

Among them, ot o_totexpressoutput gate(Be careful not to confuse with the output in RNN here), the calculation method is the same as ft f_tftsum it i_titsame.





Reference :
[1] "Deep Learning"
[2] Wu Enda - Deep Learning Course
[3] "Hands-on Deep Learning"
[4] The most detailed explanation of cyclic neural network in history (RNN/LSTM/GRU)
[5] Detailed explanation of LSTM

Guess you like

Origin blog.csdn.net/qq_42757191/article/details/126400525