Introduction of LSTM, GRU and other variants

1. Write in front

This series also has peephole, couple, etc.

This part of the content should be regarded as the most basic part of the development in recent years, but I found that I have forgotten almost, and many details are not very clear. Therefore, I write this blog and hope to use a simpler and clearer idea to understand this. Part of the content is clearly stated, so as to help more friends, and I hope you can leave valuable comments on what you understand wrongly. Don't forget to like and encourage you. Conditional children’s shoes can read this paper , which explains the calculation process of RNN, LSTM, and GRU in detail. I personally think that it is worth reading this paper if it is a friend who wants to go academically. It does not only involve recurrent neural networks. The calculation process is detailed, and many methods have been experimented. The TF code implementation is also based on this paper.

2. Recurrent (recursive) neural network RNN

1. Neural network
Before starting RNN, let's briefly review the neural network. The following figure is an example of a simple neural network:
Insert picture description here

  • Input Layer : The input layer, the input (x 1, x 2, xn x_1, x_2, x_n x1​, x2​, xn​) is a vector representation of a sample, such as a vector of words, words, sentences, etc.

  • Hidden Layer : Hidden layer, the circles in each layer are neurons, generally (x 1, x 2, xn x_1, x_2, x_n x1​,x2​,xn​) The result of a linear transformation will be in Each neuron performs a non-linear conversion again, that is, each neuron has an activation function to perform a non-linear conversion of the input value.

  • Output Layer : The output layer, which calculates the probability of the final result of the hidden layer. Generally, the softmax classifier is used in the neural network. Its essence is to output the value of 0~1 after linear summation of the feature values ​​extracted by the hidden layer. The larger the value, the greater the probability that the piece of data belongs to this category.
    2. Recurrent neural network
    As for why there is RNN, it is actually because in the neural network mentioned above, the calculation of the feature value of each sample in the hidden layer is independent of each other, but in fact, the characteristics of each sample of many tasks Values ​​are all related, so we need to consider this related factor. This requirement is mainly in the NLP task, so there is RNN to join the'memory' method to combine the relevant information between the sample feature values Join in. An example of the network structure of RNN is as follows:
    Insert picture description here
    Before entering the topic, give a chestnut to the input, so as to avoid misunderstandings by friends who have just contacted. Assuming that the sample data is:'I love you', then we will perform the word vector for each word. : ((X 1 1, x 1 2, x 1 3), (x 2 1, x 2 2, x 2 3), (x 3 1, x 3 2, x 3 3)) ((x_1^1, x_1^2, x_1^3), (x_2^1, x_2^2, x_2^3), (x_3^1, x_3^2, x_3^3)) ((x11​, x12​, x13​), ( x21​, x22​, x23​), (x31​, x32​, x33​)), respectively corresponding to the three words I love you, each word is a three-dimensional vector, the input in the figure above is one of them A three-dimensional vector of a word. The number of three-dimensional vectors in the RNN represents how many moments or steps there are. By the way, the input in the actual situation is a matrix. The 0 dimension, that is, the number of rows of the matrix, is your word vector dimension, and the 1 dimension, that is, the number of columns of the matrix, is your batch_size.

  • Input : The input layer is a three-dimensional vector in the example. It is a feature vector in the sample. It is different from the traditional neural network above. Here is a feature vector, and the above is a vector of a sample.

  • Hidden inputs : Hidden layer, which has the same effect as the hidden layer of neural network.

  • Bias node : The bias term adds a bias term (that is, the intercept term of the unary linear function learned in high school) when calculating the hidden layer and the output layer.

  • Output : The output layer is the same as the output layer of the appeal neural network.

In order to explain the RNN forward and backward propagation process more clearly, the following parameters are explained in detail:

  • x_t XT XT : an input time t, a vector.
  • h_t HT HT : hidden state at time t, i.e. the image above Hidden inputs the output value of each neuron layer.
  • O_t OT OT : represents the output at time t, that is, Output probability output value, note that not tag value.
  • U : The weight between the input layer and the hidden layer. It abstracts our original input as the input of the hidden layer, which is the gray-dashed part in the above figure.
  • V : The weight between the hidden layer and the output layer. It abstracts the representation we learned in the hidden layer again as the final output, which is the red dashed part in the figure above.
  • W : The weight between the hidden layer and the hidden layer, as long as the weight between the hidden layer and the hidden layer is represented by W, which is the red solid line part of the above figure.
    In order to make it easier to read the following content, I can have an image in my mind, I drew a simple picture for everyone to imagine, as follows: The
    Insert picture description here
    following picture is a simplified diagram of the above picture, the structure is the same, I hope you don’t get confused :
    By the way, the circle in the figure below is called a cell in RNN, which is the entire hidden layer. I hope friends who are new to RNN must remember this.
    Insert picture description here

Next we carry out forward propagation

At t=1, U, V, and W are all initialized randomly, h 0 h_0 h0​ is usually initialized to 0, and then the following calculations are performed:

  • h 1 = f (U x 1 + W h 0 + b 1) h_1=f(Ux_1+Wh_0+b_1) h1​=f(Ux1​+Wh0​+b1​) (the output of the hidden layer, fff is the activation function )
  • O 1 = g (V h 1 + b 2) O_1=g(Vh_1+b_2) O1​=g(Vh1​+b2​) (The final output, ggg is a function such as softmax)

At time t=2, h 1 h_1 h1​ at this time is used as the memory of time 1, and the next prediction will be made. The calculation is as follows:

  • h 2 = f ( U x 2 + W h 1 + b 1 ) h_2=f(Ux_2+Wh_1+b_1) h2​=f(Ux2​+Wh1​+b1​)
  • O 2 = g ( V h 2 + b 2 ) O_2=g(Vh_2+b_2) O2​=g(Vh2​+b2​)

By analogy, we can get:

  • h t = f ( U x t + W h t − 1 + b 1 ) h_t=f(Ux_t+Wh_{t-1}+b1) ht​=f(Uxt​+Wht−1​+b1)
  • O t = g (V ht + b 2) O_t=g(Vh_t+b_2) Ot​=g(Vht​+b2​)
    By the way, after checking the tensorflow source code, the following answer is obtained, which means that most basic RNNs The memory information of is the output of the hidden layer, because I did not find the most original RNN paper, I can’t find out whether the original RNN is like this, and I found that RNN has been greatly improved through the source code, such as the dropout usage of input or hidden layer neurons. Those who are interested can see the tensorflow source code. In fact, the RNN part has been encapsulated very well, as long as you understand the principle.
    Insert picture description here

Wherein fff can be activation functions such as tanh, relu, logistic, ggg is usually softmax or other, we say that RNN has memory ability, and this ability is to summarize the previous input state through W as the next auxiliary input , The hidden state can be understood as follows: h = f (existing input + previous memory summary) h=f (existing input + previous memory summary) h=f (existing input + previous memory summary)

Next we perform back propagation (BPTT)

The method used in backpropagation is to offset the sum of the errors of the output layer to the gradient of each weight ∇ U, ∇ V, ∇ W \nabla U, \nabla V, \nabla W ∇U, ∇V, ∇W Guide, and then use the gradient descent method to update each parameter. For children’s shoes who don’t know much about the gradient descent algorithm, please refer to my previous blog on the principle of gradient descent algorithm and its calculation process . The output O t O_t Ot at each moment will be generated. For a certain error et e_t et​, the loss function of the error can be a cross-entropy loss function or a square error and so on. Then the total error is E = ∑ 1 tet E=\sum_1^t e_t E=∑1t​et​. Our ultimate goal is to ask for:

  1. ∇ U = ∂ E ∂ U = ∑ 1 t ∂ e t ∂ U \nabla U=\cfrac{\partial E}{\partial U}=\sum_1^t\cfrac{\partial e_t}{\partial U} ∇U=∂U∂E​=∑1t​∂U∂et​​
  2. ∇ V = ∂ E ∂ V = ∑ 1 t ∂ e t ∂ V \nabla V=\cfrac{\partial E}{\partial V}=\sum_1^t\cfrac{\partial e_t}{\partial V} ∇V=∂V∂E​=∑1t​∂V∂et​​
  3. ∇ W = ∂ E ∂ W = ∑ 1 t ∂ et ∂ W \nabla W=\cfrac{\partial E}{\partial W}=\sum_1^t\cfrac{\partial e_t}{\partial W} ∇W = ∂W∂E = Σ1t ∂W∂et
    for ease of understanding, this loss function using the quadratic loss function: L (θ) = 1 2 - 2 L (y ^ y) (\ theta) = \ cfrac12 (\hat y -y)^2 L(θ)=21​(y^​−y)2.
    You may see the formula head is big, but my next formula process may be the simplest process in history. As long as the chain derivation law and gradient descent algorithm are understood, elementary school students can understand it. Welcome everyone to correct me. Here is an example of the chain derivation rule: f (x) = 2 x + 1, g (u) = u 2 f(x)=2x+1, g(u)=u^2 f(x)= 2x+1, g(u)=u2, then for the compound function g (f (x)) = (2 x + 1) 2 g(f(x))=(2x+1)^2 g(f(x ))=(2x+1)2 The derivation process is as follows: ∂ g ∂ x = ∂ g ∂ u ⋅ ∂ u ∂ x = 2 u ⋅ 2 = 4 u = 8 x + 4 \cfrac{\partial g}{\ partial x}=\cfrac{\partial g}{\partial u}\cdot\cfrac{\partial u}{\partial x}=2u\cdot2=4u=8x+4 ∂x∂g​=∂u∂g ​⋅∂x∂u​=2u⋅2=4u=8x+4, now we start BPTT.
    First, we summarize all the formulas:
    st = uxt + wht − 1 + b 1 s_t=ux_t+wh_{t-1}+b_1 st​=uxt​+wht−1​+b1​: This is for me to write the formula later to make h 1 = f ( U x 1 + W h 0 + b 1) h_1=f(Ux_1+Wh_0+b_1) h1​=f(Ux1​+Wh0​+b1​) is split.
    ht = f (st) h_t=f(s_t) ht​=f(st​): fff is the activation function.
    ot = g (vht + b 2) o_t=g(vh_t+b_2) ot​=g(vht​+b2​)
    et = 1 2 (ot − y) 2 e_t=\cfrac12(o_t-y)^2 et ​=21​(ot​−y)2: This is the error at each moment.
    E = ∑ 1 tet E=\sum_1^te_t E=∑1t​et​: This is the total error.
    Update the parameters according to the gradient descent. If you are not impressed, you can review the gradient descent algorithm .
    Because the parameters of RNN are shared, although there are many moments, the parameters are only U, V, W, b 1, b 2 U, V, W, b_1, b_2 U, V, W, b1​, b2​ Five parameters. To update the parameters by backpropagating the error, the most important thing is to require our loss function E = ∑ 1 tet E=\sum_1^te_t E=∑1t​et​The gradient, we let the gradient be Δ \ Delta Δ, then:
    Δ = <∂ E ∂ U, ∂ E ∂ V, ∂ E ∂ W, ∂ E ∂ b 1, ∂ E ∂ b 2> \Delta=<\frac{\partial E}{\partial U}, \frac{ \partial E}{\partial V}, \frac{\partial E}{\partial W}, \frac{\partial E}{\partial b_1}, \frac{\partial E}{\partial b_2}> Δ =<∂U∂E​,∂V∂E​,∂W∂E​,∂b1​∂E​,∂b2​∂E​>
    According to the chain derivation rule, let’s find the respective partial derivatives
    ∂ E ∂ U = ∑ 1 t ∂ e t ∂ U = ∑ 1 t ∂ e t ∂ ( o t − y ) ⋅ ∂ ( o t − y ) ∂ o t ⋅ ∂ o t ∂ ( v h t + b 2 ) ⋅ ∂ ( v h t + b 2 ) ∂ h t ⋅ ∂ h t ∂ s t ⋅ ∂ s t ∂ U = ∑ 1 t ( o t − y ) ⋅ 1 ⋅ g ′ ( v h t + b 2 ) ⋅ v ⋅ f ′ ( s t ) ⋅ x t \frac{\partial E}{\partial U}=\sum_1^t\frac{\partial e_t}{\partial U}=\sum_1^t\frac{\partial e_t}{\partial {(o_t-y)}}\cdot \frac{\partial (o_t-y)}{\partial o_t}\cdot \frac{\partial o_t}{\partial (vh_t+b_2)}\cdot \frac{\partial (vh_t+b_2)}{\partial h_t}\cdot \frac{\partial h_t}{\partial s_t}\cdot\frac{\partial s_t}{\partial U}=\sum_1^t(o_t-y)\cdot1\cdot g'(vh_t+b_2)\cdot v\cdot f'(s_t)\cdot x_t ∂U∂E​=∑1t​∂U∂et​​=∑1t​∂(ot​−y)∂et​​⋅∂ot​∂(ot​−y)​⋅∂(vht​+b2​)∂ot​​⋅∂ht​∂(vht​+b2​)​⋅∂st​∂ht​​⋅∂U∂st​​=∑1t​(ot​−y)⋅1⋅g′(vht​+b2​)⋅v⋅f′(st​)⋅xt​
    The remaining part is left to the diligent you to derive, and so on, you can find the value of Δ \Delta Δ, and after that, we can back-propagate and constantly update the parameters to minimize our loss. In fact, In fact, they are all matrix calculations, which are more brainstorming. If you are interested, you can go deeper. In fact, the BP algorithm is very important, but the code is already encapsulated when it is implemented. We don’t need to design the BP algorithm ourselves. Implementation process.
    After mastering the chain derivation rule and the gradient descent algorithm, not only the BPTT is mastered, but in fact, it is not difficult to find the root cause of the gradient explosion and disappearance, because our parameter update is based on the given learning rate. Update in the negative direction of the gradient. If our gradient is continuously multiplying a number less than 1 or a very large number, then the gradient disappears and explodes. In order to solve the gradient disappearance, the gradient explodes. The gradient is given a threshold C. Values ​​greater than C or less than -C are set to C or -C, and the gradient explodes or disappears. If you need to open a blog in the future. Next we enter the door of LSTM.
    I think I have to say a few more words, why is the BP process of RNN called BPTT, the full name is Backpropagation Through Time , the difference is that the output of each step of BPTT not only depends on the current step of the network, but also requires money for several steps of the network state , So it is called BPTT.

Three, detailed explanation of the principle of LSTM

Friends who may have knowledge of LSTM know that the emergence of LSTM is to alleviate one of the biggest shortcomings of RNN, the problem of long-term dependence. As the length of the sequence increases, the problem of previous information loss will occur, so let's take a look first , Why does RNN have this problem. We use the following calculation process to see where the problem is. For the sake of simplicity, the bias term is temporarily ignored.
h 0 = 0 h_0=0 h0​=0
h 1 = f (ux 1 + wh 0) h_1=f(ux_1+wh_0) h1​=f(ux1​+wh0​)
h 2 = f (ux 2 + wh 1) h_2=f(ux_2+wh_1) h2​=f(ux2​+wh1​)
h 3 = f (ux 3 + wh 2) h_3=f(ux_3+wh_2) h3​=f(ux3​+wh2 )
⋅ \ CDOT ⋅
⋅ \ CDOT ⋅
⋅ \ CDOT ⋅
HT = F (UXT WHT + -. 1) h_t = F (T-ux_t + wh_ {}. 1) HT = F (+ WHT-UXT. 1)
actually We can see from this simple process that the transmission of memory information depends on the www parameter, but as the time increases, when the www is continuously multiplied, if the www value is too small, the previous information will be lost, and if the www is too large, the previous information will be lost. The information weight of is greater than the input at the current moment. Based on this situation, LSTM has improved the traditional RNN, mainly by adding a cell state mechanism. For the final output, the important information in the previous memory will continue to be passed on, but it is not important. Is truncated. Next, let's analyze the structure of LSTM.Figure one
In fact, this picture is available on the Internet. Although many people say that this picture is actually very unfriendly to new scholars, I still don’t know what tools can draw this picture. I don’t know if anyone can tell me about it. I will take this picture. Every detail and misunderstanding of the pictures will be explained in great detail to make up for the lack of original pictures.

3.1 Detailed explanation of LSTM structure

Insert picture description here
This is a whole hidden layer in RNN, remember! It is the entire hidden layer, and many new friends think it is a neuron.
Insert picture description here
From left to right are:
1. The neural network layer, which is equivalent to a fully connected layer, contains many neural units, which are the units parameter in the LSTM code.
2. The matrix algorithm, multiplication or addition, of course, this multiplication is an ordinary multiplication, corresponding to the multiplication of numbers.
3. The transfer direction of the vector indicates the next step of the data.
4. Concat, splicing by column, that is, two matrices are spliced ​​by column.
5. Copy, that is, the two destination data are the same.
Insert picture description here
The three parts framed in the above figure are familiar doors, and they are also doors that many friends who are new to LSTM are easy to confuse. Next, we will analyze the working principles of these three doors and understand the principles in depth. First of all, you can understand a concept like this. In LSTM, you can think that ht h_t ht​ is the memory information at the current moment, and C t C_t Ct​ is the sum of the memory information at the current moment and the previous time. This is why you call it a long and short time. The reason for the memory network . From left to right are:
Insert picture description here
1. Forgotten Gate:
First, explain the data flow of the first box. First, the hidden layer output ht − 1 h_{t-1} ht−1​ and the current time xt x_t xt​ are concat, which is splicing and then input to yellow After the σ \sigma σ neural network layer is fully connected, the sigmoid is activated, and the probability value of 0 to 1 is output ft f_t ft​, ft = σ (W f ⋅ [ht − 1, xt] + bf) f_t=\sigma(W_f \cdot[h_{t-1},x_t]+b_f) ft​=σ(Wf​⋅[ht−1​,xt​]+bf​), used to describe how much each part can pass, 0 is not Allowed to pass, 1 all passed. Then the output value and the cell state at the previous moment C t − 1 C_{t-1} Ct−1​ are multiplied, and then passed to the right to enter the next box. Well, this is the end of the language description. I think friends who are thinking about it should be able to understand it. Next, let’s take a specific example and look at the calculation process of the data in this process, so that we can ensure that everyone fully understands the forgetting door. Up.
Note : For the convenience of writing and simple understanding, the data in the following calculation process is in the shape of a matrix. Set the input dimension to 10 and the data batch size to 1, which means that only one sequence is input at a time. units=20, that is, the number of neurons in the fully connected layer is 20. Then the shape of each data is as follows:
xt = 10 × 1 x_t=10\times1 xt​=10×1: The shape depends on the dimension of your word vector
ht − 1 = 20 × 1 h_{t-1}=20\times1 ht −1​=20×1: The shape depends on the number of neurons in your hidden layer
[x 1, h 0] = 30 × 1 [x_1,h_0]=30\times1 [x1​,h0​]=30×1
W f ⋅ [ht − 1, xt] = 20 × 30 ⋅ 30 × 1 = 20 × 1 W_f\cdot[h_{t-1},x_t]=20\times30\cdot30\times1=20\times1 Wf​⋅ [ht−1​,xt​]=20×30⋅30×1=20×1
W f ⋅ [ht − 1, xt] + bf = 20 × 1 + 20 × 1 = 20 × 1 W_f\cdot[h_ {t-1},x_t]+b_f=20\times1+20\times1=20\times1 Wf​⋅[ht−1​,xt​]+bf​=20×1+20×1=20×1
σ (W f ⋅ [ht − 1, xt] + bf) = 20 × 1 \sigma(W_f\cdot[h_{t-1},x_t]+b_f)=20\times1 σ(Wf​⋅[ht−1 ​,Xt​)+bf​)=20×1
ft × C t − 1 = 20 × 1 × 20 × 1 = 20 × 1 f_t\times C_(t-1)=20\times1\times20\times1=20 \times1 ft​×Ct−1​=20×1×20×1=20×1
Here is a reminder that when the TF code is implemented, ht − 1 h_{t-1} ht−1​ and xt x_t xt ​The respective weights are separated, and different initialization and regularization methods are used. This article is more about the description of the principle process, which may be different from the actual code implementation, and interested friends can directly view the source code.
At this point, I believe that everyone can fully understand the calculation process of the forgetting gate. Then let's think about why cell state C can realize which memories should be remembered, and which memories should be forgotten. For a simple example, suppose:
C t − 1 = [0.8, 1.2, 3.2] C_{t-1}=[0.8,1.2,3.2] Ct−1​=[0.8,1.2,3.2]
ft = [0, 0.5, 1] ​​f_t=[ 0,0.5,1] ft​=[0,0.5,1]
then ft × C t − 1 = [0, 0.6, 3.2] f_t\times C_{t-1}=[0,0.6,3.2] ft ×Ct−1​=[0,0.6,3.2] Is the information equivalent to the position where ft f_t ft​ is 0 is discarded, the position of 0.5 is only reserved, and the position of 1 is all reserved and passed down Up. In summary, through this gate, we can decide what information in cell state C to discard. So it's called the Forgotten Gate or Forgotten Gate.
Insert picture description here
2. Information increase door:
With the foundation of the previous forget door, you can naturally understand the information increase door. As the name suggests, you know which information should be discarded through the forget door model, and then you need to use the information increase door to determine what new information needs to be added. The information in the cell state C goes, which is also the effect of increasing the gate.
The formula is as follows:
it = σ (W i ⋅ [ht − 1, xt] + bi) i_t=\sigma(W_i\cdot[h{t-1},x_t]+b_i) it​=σ(Wi​⋅[ ht−1,xt​]+bi​)
C ~ t = tanh (W c ⋅ [ht − 1, xt] + bc) \tilde C_t=tanh(W_c\cdot[h_{t-1},x_t]+ b_c) C~t​=tanh(Wc​⋅[ht−1​,xt​]+bc​)
C t = ft ⋅ C t − 1 + it ⋅ C ~ t C_t=f_t\cdot C_(t-1)+i_t\cdot\tilde C_t Ct​=ft​⋅Ct−1​+it​⋅C~t ​: It can be seen that the value range of C is relatively large.
The Sigmoid layer decides what value needs to be updated; the Tanh layer creates a new candidate vector C ~ t \tilde C_t C~t​; then calculates it × C ~ t i_t\times\tilde C_t it​×C~t​ Calculate what information needs to be added at the current moment, and then add it to ft ⋅ C t − 1 f_t\cdot C_{t-1} ft​⋅Ct−1​ to get C t C_t Ct​.
To sum up, after passing through these two gates, the deletion and addition of the transmitted information can be determined, that is, the update of the cell state.
Insert picture description here
3. Output gate
This part is to get the output according to the cell state.
The formula is as follows:
ot = σ (W o ⋅ [ht − 1, xt] + bo) o_t=\sigma(W_o\cdot[h_{t-1},x_t]+b_o) ot​=σ(Wo​⋅[ ht−1​,xt​)+bo​)
ht = ot × tanh (C t) h_t=o_t\times tanh(C_t) ht​=ot​×tanh(Ct​)
First determine the cell state through a sigmoid layer Which part of the output will be output, and then use the tanh function to process the cell state to get a value of -1 to 1, and then multiply it with the sigmoid output to output the current output ht h_t ht​. At this point, the structural explanation of LSTM is over, let's take a look at the forward and backward propagation process of LSTM.

3.2 BPTT of LSTM

Note: Actually, after understanding the structure of LSTM, I don’t think it is too important to master the BPTT process. After all, various frameworks are already encapsulated and we don’t need to implement it ourselves. This is also for the completeness of this article. A related description of LSTM's BPTT will help everyone understand the structural principles of LSTM. LSTM's BPTT is similar to that of RNN and is not complicated. Next, we will derive the BPTT process. The process is not complicated and does not burn your brain at all.
1. Forward propagation
Forward propagation is relatively simple, which is the combination of the formulas of our above gates:
ft = σ (W f ⋅ [ht − 1, xt] + bf) f_t=\sigma(W_f\cdot[h_{ t-1},x_t]+b_f) ft​=σ(Wf​⋅[ht−1​,xt​]+bf​)
it = σ (W i ⋅ [ht − 1, xt] + bi) i_t= \sigma(W_i\cdot[h{t-1},x_t]+b_i) it​=σ(Wi​⋅[ht−1,xt​]+bi​)
C ~ t = tanh (W c ⋅ [ht − 1, xt] + bc) \tilde C_t=tanh(W_c\cdot[h_{t-1},x_t]+b_c) C~t​=tanh(Wc​⋅[ht−1​,xt​]+ bc​)
ot = σ (W o ⋅ [ht − 1, xt] + bo) o_t=\sigma(W_o\cdot[h_{t-1},x_t]+b_o) ot​=σ(Wo​⋅[ ht−1​,xt​]+bo​)
C t = ft ⋅ C t − 1 + it ⋅ C ~ t C_t=f_t\cdot C_(t-1)+i_t\cdot\tilde C_t Ct​=ft ⋅Ct−1​+it​⋅C~t​
h t = o t × t a n h ( C t ) h_t=o_t\times tanh(C_t) ht​=ot​×tanh(Ct​)
y ^ t = S ( W y ⋅ h t + b y ) \widehat{y}_t=S(W_y\cdot h_t+b_y) y

​T​=S(Wy​⋅ht​+by​): This part is the probability value of the output at each moment, and S is the classifier, such as softmax.
et = 1 2 (ot − y) 2 e_t=\cfrac12(o_t-y)^2 et​=21​(ot​−y)2
E = ∑ 1 tet E=\sum_1^te_t E=∑1t​et
2. reverse the spread of
fact, we have been derived in part RNN once the principle is the same, because the total in LSTM in W f W i W c W obfbibcboby W_fW_iW_cW_ob_fb_ib_cb_ob_y Wf Wi Wc Wo bf bi bc ​Bo​by​ has a total of 9 parameters. I would like to explain that there may be a total of 8 sets of parameters in some articles. In fact, I also mentioned in the previous article. When the code is implemented, for xt x_t xt​ and ht − 1 The W weights of h_{t-1} ht−1​ are separated. The formula is as follows:
Insert picture description here
My part is just an additional output part for each moment. During the code process, you can set the relevant parameters to determine whether to obtain each moment. Output.
Backpropagation is to obtain the gradient value according to the loss function E, and then update the parameters according to the gradient descent. The process is very simple, and the writing is a bit tired. As long as everyone reads this carefully, mastering the chain derivation law and the gradient descent algorithm can perform simple derivation without pressure. Leave it to you who need it to derive it yourself. Next we enter another more important RNN variant GRU.

Four, GRU structure principle

Before starting GRU, let me explain. In fact, according to the structure of LSTM, we can make some changes according to the actual situation of our business. So in fact, there are many variants. Below we list three famous variants and consider one of them. , Which is described by GRU.
The first type:
Insert picture description here
This structure adds a peephole connections layer, so that each gate also receives the input of cell state C.
The second type:
Insert picture description here
by coupling the forget gate and the information increase gate (the first and the second gate); that is, no longer consider what is forgotten
and what information is added, but are considered together. The reason why this can be done is that the first gate controls which information is forgotten, and the second gate determines which information is added. It happens that the two have opposite effects, so the two can be combined.
The third type: Gated Recurrent Unit (GRU)
Insert picture description here
This structure was proposed in 2014. In fact, it can be regarded as a combination of the first two. Without the concept of cell state in LSTM, the forget gate and the information increase gate are merged into an update. The gate, which combines the data unit state and the hidden state at the same time, is simpler than the LSTM structure.
The formula is as follows:
rt = σ (W r ⋅ [ht − 1, xt]) r_t=\sigma(W_r\cdot[h_{t-1},x_t]) rt​=σ(Wr​⋅[ht−1 ,xt​])
zt = σ (W z ⋅ [ht − 1, xt]) z_t=\sigma(W_z\cdot[h_{t-1},x_t]) zt​=σ(Wz​⋅[ht− 1​,xt​])
h ^ t = tanh (W ⋅ [rt ∗ ht − 1, xt]) \hat h_t=tanh(W\cdot[r_t*h_{t-1},x_t]) h^t​=tanh(W⋅[ rt​∗ht−1​,xt​))
ht = (1 − zt) ∗ ht − 1 + zt ∗ h ^ t h_t=(1-z_t)*h_(t-1)+z_t*\hat h_t ht = (1-zt) * ht -1 + zt * h ^ t
after the adoption of the previous understanding, we come to understand the GRU is very simple, despite their ever-changing, can escape our eyes, a glance Let them be exposed, hahaha, based on this, we can also design our own network structure to be more in line with our actual business scenarios and get miraculous effects.

Five, write at the end

That’s it for the content of the RNN part. I wrote it to record what I learned, so as to help more new friends. I also hope to communicate with more people and know the inaccuracy of my understanding. If you feel If you write well, remember to like it. By the way, I will post another Google paper. The LSTM in this paper is different from our common structure. You can also look at it for children's shoes if you have conditions. The address of the paper .

Guess you like

Origin blog.csdn.net/u010451780/article/details/111191691