Article directory
0. Preface
In accordance with international practice, I first declare: This article is only my own understanding. Although I have referred to other people's valuable insights, the content may be inaccurate. If you find mistakes in the text, I hope to criticize and correct them and make progress together.
The purpose of this article is to deepen the learning and understanding of the LSTM (long short-term memory) neuron network model by building the LSTM module from scratch and applying it to example problems.
Compared with other common neuron network models (FNN, CNN, GAN), RNN has the most complicated underlying mathematical algorithm, and LSTM, as one of the improved variants of RNN, has raised the complexity of this algorithm to another level. . Therefore, it is necessary to carefully study the algorithm and code implementation process of LSTM in order to strengthen the mastery of LSTM and make lower-level algorithm innovations.
0.1 The necessary knowledge before reading this article
- This article is a companion article to build an RNN module based on Numpy and implement an example application (with code). If you don't know much about the implementation of the underlying algorithm of RNN, it is highly recommended to learn the content of RNN first, otherwise it will be difficult to understand this article;
- Algorithm introduction and mathematical derivation of LSTM (long short-term memory) network introduces the underlying mathematical algorithm of LSTM (if it is difficult to understand its mathematical derivation process, as long as you know the derivation result), this article focuses on building LSTM from scratch based on NumPy , the mathematical formula for LSTM forward and back propagation will be slightly taken.
1. LSTM architecture
In fact, there are many introductory articles on this piece of CSDN, and colah's famous blog Understanding LSTM Networks has already explained the architecture of LSTM very clearly. But I encountered some practical problems when coding, so I will sort out this piece carefully:
The above schematic diagram illustrates the process of forward propagation of the LSTM network from time 0 to time n. It is necessary to pay attention to the sequence of each variable at each time, and the subsequent encoding must be strictly in accordance with this sequence.
If you are careful, you may have discovered that the cell state output at the last moment is C n + 1 C_{n+1}Cn+1It is also marked, and it will be explained later that this is because the loss E is calculated for C n C_{n} in backpropagationCnFor partial derivatives, you need to use the loss E to C n + 1 C_{n+1}Cn+1The partial derivatives of are propagated backward iteratively.
For parameter passing inside LSTM, still refer to the schematic diagram below.
2. LSTM forward propagation code implementation
There is no difficulty in forward propagation, as long as you strictly follow the above schematic diagram
2.1 Hidden layer forward propagation
t t Each gate at time t is:
- 忘记门: f t = σ ( w f ⋅ x t + v f ⋅ h t − 1 + b f ) f_t = \sigma(w_f·x_t+v_f·h_{t-1}+b_f) ft=s ( wf⋅xt+vf⋅ht−1+bf)
- 输入门: i t = σ ( w i ⋅ x t + v i ⋅ h t − 1 + b i ) i_t = \sigma(w_i·x_t+v_i·h_{t-1}+b_i) it=s ( wi⋅xt+vi⋅ht−1+bi)
- New memory gate: gt = tanh ( wg ⋅ xt + vg ⋅ ht − 1 + bg ) g_t = tanh(w_g x_t+v_g h_{t-1}+b_g)gt=English ( w _g⋅xt+vg⋅ht−1+bg)
- Solution: ot = σ ( wo ⋅ xt + vo ⋅ ht − 1 + bo ) o_t = \sigma(w_o·x_t+v_o·h_{t-1}+b_o)ot=s ( wo⋅xt+vo⋅ht−1+bo)
t t Cell state C t C_tat time tCtfor:
C t = f t ⨀ C t − 1 + i t ⨀ g t C_t = f_t \bigodot C_{t-1} + i_t \bigodot g_t Ct=ft⨀Ct−1+it⨀gt
t tHidden layer output ht h_tat time thtfor:
h t = o t ⨀ t a n h ( C t ) h_t = o_t \bigodot tanh(C_t) ht=ot⨀I ( older )t)
Code:
def forward(self, x, h_pre, c_pre): #h_pre为h_t-1, c_pre为c_t-1
self.Fgate = sigmoid(np.dot(self.w_f, x) + np.dot(self.v_f, h_pre) + self.b_f)
self.Igate = sigmoid(np.dot(self.w_i, x) + np.dot(self.v_i, h_pre) + self.b_i)
self.Ggate = np.tanh(np.dot(self.w_g, x) + np.dot(self.v_g, h_pre) + self.b_g)
self.Ogate = sigmoid(np.dot(self.w_o, x) + np.dot(self.v_o, h_pre) + self.b_o)
c_cur = self.Fgate * c_pre + self.Igate * self.Ggate #c_cur为c_t
h_cur = self.Ogate * np.tanh(c_cur)
return h_cur, c_cur
Here you can save some lines of code through the multidimensional list. Here, in order to show each door more clearly, all of them are disassembled and written.
2.2 Output Layer Forward Propagation
t tThe final output at time t is:
y t = w h ⋅ h t + b h y_t = w_h·h_t + b_h yt=wh⋅ht+bh
The forward propagation formula of the output layer is generally written as yt = softmax ( wh ⋅ ht + bh ) y_t = softmax(w_h h_t + b_h)yt=softmax(wh⋅ht+bh) , where the softmax can be removed is equivalent to performing an inverse softmax operation on the data to be learned.
Code:
def forward(self, h_cur): #h_cur为 h_t
return np.dot(self.w_h, h_cur) + self.b_h
3. LSTM backpropagation code implementation
The difficulty of the entire code is backpropagated here.
3.1 Output Layer Backpropagation
The calculation method of the loss here is to use MSE (mean square error) to implement it in code, that is, E = 0.5 ∗ ( y − ytrain ) 2 E = 0.5*(y - y_{train})^2E=0.5∗(y−ytrain)2。
Here, a coefficient of 0.5 is added in front to offset the square term "2" when calculating the derivative.
Code:
def backward(self,y,h_cur, train_data):
delta = y - train_data
self.grad_wh = np.dot(delta, h_cur.T)
self.grad_hcur = np.dot(self.w_h.T, delta)
self.grad_bh = delta
In this code, besides calculating the loss EEE pair weightwh w_hwhand bh b_hbhThe partial derivative of hhThe partial derivative of h has been calculated, which will be used for the calculation of the partial derivative of the hidden layer weight later.
3.2 Hidden Layer Backpropagation
This is the core and most difficult part of the entire LSTM algorithm.
In the backpropagation of the hidden layer, the most critical intermediate variable is the loss EEE vs cell stateC t C_tCtThe partial derivative of:
For the derivation process, please see: LSTM (Long Short-Term Memory) Network Algorithm Introduction and Mathematical Derivation
where ∂ E ∂ C t \frac{\partial E}{\partial C_t}∂Ct∂EThrough the ∂ E ∂ C t + 1 \frac{\partial E}{\partial C_{t+1}} of the next instant∂Ct+1∂ECalculated iteratively, this requires adding a variable to store ∂ E ∂ C t \frac{\partial E}{\partial C_t} at each moment during actual coding∂Ct∂E。
而 ∂ E ∂ h t \frac{\partial E}{\partial h_t} ∂ht∂EIt is also obtained through iteration, in tt∂ E ∂ ht − 1 \frac{\partial E}{\partial h_{t-1}}can be calculated at time t∂ht−1∂E, this value should also be stored for t − 1 t-1t−1 is used for the backpropagation calculation at time 1 .
Code:
def backward(self, Fgate, Igate, Ggate, Ogate, x, grad_cnext, Fgate_next, grad_hcur, c_cur,c_pre, h_pre):
self.grad_ccur = grad_cnext * Fgate_next + grad_hcur * Ogate * (1 - np.tanh(c_cur) * np.tanh(c_cur))
self.grad_hpre = self.grad_ccur*(np.dot(self.v_f.T, c_pre*Fgate*(1-Fgate)) + np.dot(self.v_g.T,Igate*(1-Ggate*Ggate)) + np.dot(self.v_i.T,Ggate*Igate*(1-Igate)))
self.grad_wf = np.dot(self.grad_ccur * c_pre * Fgate * (1 - Fgate), x.T) #这里要注意矩阵的转置!!!
self.grad_wi = np.dot(self.grad_ccur * Ggate * Igate * (1 - Igate), x.T)
self.grad_wg = np.dot(self.grad_ccur * Igate * (1 - Ggate * Ggate), x.T)
self.grad_wo = np.dot(grad_hcur*np.tanh(c_cur)*Ogate*(1-Ogate),x.T)
self.grad_vf = np.dot(self.grad_ccur * c_pre * Fgate * (1 - Fgate), h_pre.T)
self.grad_vi = np.dot(self.grad_ccur * Ggate * Igate * (1 - Igate), h_pre.T)
self.grad_vg = np.dot(self.grad_ccur * Igate * (1 - Ggate * Ggate), h_pre.T)
self.grad_vo = np.dot(grad_hcur * np.tanh(c_cur) * Ogate * (1 - Ogate), h_pre.T)
self.grad_bf = self.grad_ccur * c_pre * Fgate * (1 - Fgate)
self.grad_bi = self.grad_ccur * Ggate * Igate * (1 - Igate)
self.grad_bg = self.grad_ccur * Igate * (1 - Ggate * Ggate)
self.grad_bo = grad_hcur * np.tanh(c_cur) * Ogate * (1 - Ogate)
4. Example application description
This example application is fitting y = x 2 y = x^2y=x2 curve, the input data train_x of the training group is 0~1 equidistant to take 600 data, and every 6 data is 1 group, that is, 100 groups of data. The output data train_y is the square of train_x plus a random noise data.
Code:
train_x = np.linspace(0.01,1,600).reshape(100,6,1)
train_y = train_x * train_x + np.random.randn(100,6,1)/200
5. Running results
Set the number of iterations epoch to 5000, and select different learning rates for the model learning process as follows (the blue points are the training group data, and the yellow points are the output data of the network model):
6. Epilogue
First of all, thank you for being able to see this. I have been coding and debugging the whole article for a month, mainly because the backpropagation part of the hidden layer is really not easy to calculate. When doing LSTM mathematical derivation before, I set up a flag to implement LSTM in Python, which can be regarded as filling in the pits dug before, but I never expected that the code implementation of LSTM is much more complicated than that of RNN.
Moreover, calculation overflow is very easy to occur when the code is running:
the output in this case must be NaN. For this reason, I have tried many solutions, but nothing works. I can only run it again, hoping that the calculation will not overflow next time.
The reason for the calculation overflow is the gradient explosion. I guess the reason for the gradient explosion is that LSTM is "picky" about the initial value of the weight. The reason for this guess is that as long as the code runs smoothly through the first epoch, there will be no problem later.
7. Complete code
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
train_x = np.linspace(0.01,1,600).reshape(100,6,1)
train_y = train_x * train_x + np.random.randn(100,6,1)/200
def sigmoid(x):
return 1/(1+np.exp(-x))
class HiddenLayer():
def __init__(self,input_size, hidden_size):
self.w_f = np.random.randn(hidden_size, input_size) #定义各个门的权重, 忘记门
self.w_i = np.random.randn(hidden_size, input_size) #输入门
self.w_g = np.random.randn(hidden_size, input_size) #新记忆门
self.w_o = np.random.randn(hidden_size, input_size) #输出门
self.v_f = np.random.randn(hidden_size,hidden_size)
self.v_i = np.random.randn(hidden_size,hidden_size)
self.v_g = np.random.randn(hidden_size,hidden_size)
self.v_o = np.random.randn(hidden_size,hidden_size)
self.b_f = np.zeros([hidden_size, 1]) #输入限定为一维向量
self.b_i = np.zeros([hidden_size, 1])
self.b_g = np.zeros([hidden_size, 1])
self.b_o = np.zeros([hidden_size, 1])
def forward(self, x, h_pre, c_pre): #h_pre为h_t-1, c_pre为c_t-1
self.Fgate = sigmoid(np.dot(self.w_f, x) + np.dot(self.v_f, h_pre) + self.b_f)
self.Igate = sigmoid(np.dot(self.w_i, x) + np.dot(self.v_i, h_pre) + self.b_i)
self.Ggate = np.tanh(np.dot(self.w_g, x) + np.dot(self.v_g, h_pre) + self.b_g)
self.Ogate = sigmoid(np.dot(self.w_o, x) + np.dot(self.v_o, h_pre) + self.b_o)
c_cur = self.Fgate * c_pre + self.Igate * self.Ggate #c_cur为c_t
h_cur = self.Ogate * np.tanh(c_cur)
return h_cur, c_cur
def backward(self, Fgate, Igate, Ggate, Ogate, x, grad_cnext, Fgate_next, grad_hcur, c_cur,c_pre, h_pre):
self.grad_ccur = grad_cnext * Fgate_next + grad_hcur * Ogate * (1 - np.tanh(c_cur) * np.tanh(c_cur))
self.grad_hpre = self.grad_ccur*(np.dot(self.v_f.T, c_pre*Fgate*(1-Fgate)) + np.dot(self.v_g.T,Igate*(1-Ggate*Ggate)) + np.dot(self.v_i.T,Ggate*Igate*(1-Igate)))
self.grad_wf = np.dot(self.grad_ccur * c_pre * Fgate * (1 - Fgate), x.T) #这里要注意矩阵的转置!!!
self.grad_wi = np.dot(self.grad_ccur * Ggate * Igate * (1 - Igate), x.T)
self.grad_wg = np.dot(self.grad_ccur * Igate * (1 - Ggate * Ggate), x.T)
self.grad_wo = np.dot(grad_hcur*np.tanh(c_cur)*Ogate*(1-Ogate),x.T)
self.grad_vf = np.dot(self.grad_ccur * c_pre * Fgate * (1 - Fgate), h_pre.T)
self.grad_vi = np.dot(self.grad_ccur * Ggate * Igate * (1 - Igate), h_pre.T)
self.grad_vg = np.dot(self.grad_ccur * Igate * (1 - Ggate * Ggate), h_pre.T)
self.grad_vo = np.dot(grad_hcur * np.tanh(c_cur) * Ogate * (1 - Ogate), h_pre.T)
self.grad_bf = self.grad_ccur * c_pre * Fgate * (1 - Fgate)
self.grad_bi = self.grad_ccur * Ggate * Igate * (1 - Igate)
self.grad_bg = self.grad_ccur * Igate * (1 - Ggate * Ggate)
self.grad_bo = grad_hcur * np.tanh(c_cur) * Ogate * (1 - Ogate)
def step(self, lr=0.01):
self.w_f = self.w_f - lr * self.grad_wf
self.w_i = self.w_i - lr * self.grad_wi
self.w_g = self.w_g - lr * self.grad_wg
self.w_o = self.w_o - lr * self.grad_wo
self.v_f = self.v_f - lr*self.grad_vf
self.v_i = self.v_i - lr * self.grad_vi
self.v_g = self.v_g - lr * self.grad_vg
self.v_o = self.v_o - lr * self.grad_vo
self.b_f = self.b_f - lr*self.grad_bf
self.b_i = self.b_i - lr * self.grad_bi
self.b_g = self.b_g - lr * self.grad_bg
self.b_o = self.b_o - lr * self.grad_bo
class OutputLayer():
def __init__(self, hidden_size, output_size):
self.w_h = np.ones([output_size, hidden_size])
self.b_h = np.zeros([output_size, 1])
def forward(self, h_cur):
return np.dot(self.w_h, h_cur) + self.b_h
def backward(self,y,h_cur, train_data):
delta = y - train_data
self.grad_wh = np.dot(delta, h_cur.T)
self.grad_hcur = np.dot(self.w_h.T, delta)
self.grad_bh = delta
def step(self, lr=0.001):
self.w_h = self.w_h - lr * self.grad_wh
self.b_h = self.b_h - lr * self.grad_bh
#---------------------------------------------------
LstmHidden = HiddenLayer(6, 10)
LstmOut = OutputLayer(10, 6)
Fgate_data = np.zeros([101,10,1]) #这些都是要存储的数据
Igate_data = np.zeros([100,10,1])
Ggate_data = np.zeros([100,10,1])
Ogate_data = np.zeros([100,10,1])
gradc_data = np.zeros([101,10,1]) #这里是101是因为c和h都多一个第0时刻的数据
gradh_data = np.zeros([101,10,1])
c_data = np.zeros([101,10,1])
h_data = np.zeros([101,10,1])
y = np.zeros([100,6,1])
epoch = 5001
total_time = len(train_x)
for e in tqdm(range(epoch)):
for t in range(total_time):
h_data[t + 1],c_data[t + 1] = LstmHidden.forward(train_x[t], h_data[t], c_data[t])
Fgate_data[t] = LstmHidden.Fgate
Igate_data[t] = LstmHidden.Igate
Ggate_data[t] = LstmHidden.Ggate
Ogate_data[t] = LstmHidden.Ogate
y[t] = LstmOut.forward(h_data[t + 1])
LstmOut.backward(y[total_time-1], h_data[total_time], train_y[total_time-1])
gradh_data[total_time]=LstmOut.grad_hcur
gradc_data[total_time] =gradh_data[total_time] * Ogate_data[total_time-1]* (1 - c_data[total_time] * c_data[total_time])
LstmOut.backward(y[total_time-2], h_data[total_time-1], train_y[total_time-2])
gradh_data[total_time-1]=LstmOut.grad_hcur
for t in reversed(range(total_time-1)):
LstmOut.backward(y[t], h_data[t + 1], train_y[t])
LstmHidden.backward(Fgate_data[t],Igate_data[t],Ggate_data[t],Ogate_data[t],train_x[t],
gradc_data[t+2],Fgate_data[t+1], gradh_data[t+1], c_data[t+1], c_data[t], h_data[t])
gradc_data[t+1] = LstmHidden.grad_ccur
gradh_data[t] = LstmHidden.grad_hpre
LstmHidden.step(lr=0.00037)
LstmOut.step(lr=0.00037)
if e%200 == 0 :
plt.clf()
plt.scatter(train_x, train_y, c="blue", s=15) # 蓝色线为真实值
plt.scatter(train_x, y, c="orange", s=15) # 黄色线为预测值
plt.savefig('x^2_epoch5000_lr00037_%s'%e)
loss = (y-train_y)**2
print(loss)