Build an LSTM module based on NumPy and apply it with examples (with code)

Article directory

0. Preface

In accordance with international practice, I first declare: This article is only my own understanding. Although I have referred to other people's valuable insights, the content may be inaccurate. If you find mistakes in the text, I hope to criticize and correct them and make progress together.

The purpose of this article is to deepen the learning and understanding of the LSTM (long short-term memory) neuron network model by building the LSTM module from scratch and applying it to example problems.

Compared with other common neuron network models (FNN, CNN, GAN), RNN has the most complicated underlying mathematical algorithm, and LSTM, as one of the improved variants of RNN, has raised the complexity of this algorithm to another level. . Therefore, it is necessary to carefully study the algorithm and code implementation process of LSTM in order to strengthen the mastery of LSTM and make lower-level algorithm innovations.

0.1 The necessary knowledge before reading this article

This article is a companion article to build an RNN module based on Numpy and implement an example application (with code). If you don't know much about the implementation of the underlying algorithm of RNN, it is highly recommended to learn the content of RNN first, otherwise it will be difficult to understand this article;
Algorithm introduction and mathematical derivation of LSTM (long short-term memory) network introduces the underlying mathematical algorithm of LSTM (if it is difficult to understand its mathematical derivation process, as long as you know the derivation result), this article focuses on building LSTM from scratch based on NumPy , the mathematical formula for LSTM forward and back propagation will be slightly taken.

1. LSTM architecture

In fact, there are many introductory articles on this piece of CSDN, and colah's famous blog Understanding LSTM Networks has already explained the architecture of LSTM very clearly. But I encountered some practical problems when coding, so I will sort out this piece carefully:

insert image description here
The above schematic diagram illustrates the process of forward propagation of the LSTM network from time 0 to time n. It is necessary to pay attention to the sequence of each variable at each time, and the subsequent encoding must be strictly in accordance with this sequence.

If you are careful, you may have discovered that the cell state output at the last moment is $C_{n+1}$ It is also marked, and it will be explained later that this is because the loss E is calculated for $C_{n} in backpropagation$ For partial derivatives, you need to use the loss E to $C_{n+1}$ The partial derivatives of are propagated backward iteratively.

For parameter passing inside LSTM, still refer to the schematic diagram below.

2. LSTM forward propagation code implementation

There is no difficulty in forward propagation, as long as you strictly follow the above schematic diagram

2.1 Hidden layer forward propagation

Each gate at time $t is:$

忘记门： $f_t = \sigma(w_f·x_t+v_f·h_{t-1}+b_f)$
输入门： $i_t = \sigma(w_i·x_t+v_i·h_{t-1}+b_i)$
New memory gate: $g_t = tanh(w_g x_t+v_g h_{t-1}+b_g)$
Solution: $o_t = \sigma(w_o·x_t+v_o·h_{t-1}+b_o)$

$C_t$ at time $t$ $C_{t}$ for:

$C_t = f_t \bigodot C_{t-1} + i_t \bigodot g_t$

$h_t$ at time $t$ $h_{t}$ for:

$h_t = o_t \bigodot tanh(C_t)$

Code:

  def forward(self, x, h_pre, c_pre):  #h_pre为h_t-1, c_pre为c_t-1

        self.Fgate = sigmoid(np.dot(self.w_f, x) + np.dot(self.v_f, h_pre) + self.b_f)
        self.Igate = sigmoid(np.dot(self.w_i, x) + np.dot(self.v_i, h_pre) + self.b_i)
        self.Ggate = np.tanh(np.dot(self.w_g, x) + np.dot(self.v_g, h_pre) + self.b_g)
        self.Ogate = sigmoid(np.dot(self.w_o, x) + np.dot(self.v_o, h_pre) + self.b_o)

        c_cur = self.Fgate * c_pre + self.Igate * self.Ggate  #c_cur为c_t
        h_cur = self.Ogate * np.tanh(c_cur)

        return h_cur, c_cur

Here you can save some lines of code through the multidimensional list. Here, in order to show each door more clearly, all of them are disassembled and written.

2.2 Output Layer Forward Propagation

$The final output at time t$ is:

$y_t = w_h·h_t + b_h$

The forward propagation formula of the output layer is generally written as $y_t = softmax(w_h h_t + b_h)$ , where the softmax can be removed is equivalent to performing an inverse softmax operation on the data to be learned.

Code:

 def forward(self, h_cur):   #h_cur为 h_t
        return np.dot(self.w_h, h_cur) + self.b_h

3. LSTM backpropagation code implementation

The difficulty of the entire code is backpropagated here.

3.1 Output Layer Backpropagation

The calculation method of the loss here is to use MSE (mean square error) to implement it in code, that is, $E = 0.5*(y - y_{train})^2$ 。

Here, a coefficient of 0.5 is added in front to offset the square term "2" when calculating the derivative.

Code:

    def backward(self,y,h_cur, train_data):
        delta = y - train_data
        self.grad_wh = np.dot(delta, h_cur.T)
        self.grad_hcur = np.dot(self.w_h.T, delta)
        self.grad_bh = delta

In this code, besides calculating the loss $E$ pair weight $w_h$ and $b_h$ The partial derivative of $The partial derivative of h$ has been calculated, which will be used for the calculation of the partial derivative of the hidden layer weight later.

3.2 Hidden Layer Backpropagation

This is the core and most difficult part of the entire LSTM algorithm.

In the backpropagation of the hidden layer, the most critical intermediate variable is the loss $E$ vs cell state $C_t$ The partial derivative of:

For the derivation process, please see: LSTM (Long Short-Term Memory) Network Algorithm Introduction and Mathematical Derivation

where $\frac{\partial E}{\partial C_t}$ $\frac{\partial E}{\partial C_{t+1}}$ of the next instant $\frac{\partial E}{\partial C _{t + 1}}$ $\frac{\partial E}{\partial C_t}$ at each moment during actual coding $\frac{\partial E}{\partial C _{t}}$ 。

而 $\frac{\partial E}{\partial h_t}$ It is also obtained through iteration, in $\frac{\partial E}{\partial h_{t-1}}$ can be calculated at time $t$ $\frac{\partial E}{\partial h _{t - 1}}$ , this value should also be stored for $t -$ 1 is used for the backpropagation calculation at time $1 .$

Code:

 def backward(self, Fgate, Igate, Ggate, Ogate, x, grad_cnext, Fgate_next, grad_hcur, c_cur,c_pre, h_pre):


        self.grad_ccur = grad_cnext * Fgate_next + grad_hcur * Ogate * (1 - np.tanh(c_cur) * np.tanh(c_cur))
        self.grad_hpre = self.grad_ccur*(np.dot(self.v_f.T, c_pre*Fgate*(1-Fgate)) + np.dot(self.v_g.T,Igate*(1-Ggate*Ggate)) + np.dot(self.v_i.T,Ggate*Igate*(1-Igate)))

        self.grad_wf = np.dot(self.grad_ccur * c_pre * Fgate * (1 - Fgate), x.T)  #这里要注意矩阵的转置!!!
        self.grad_wi = np.dot(self.grad_ccur * Ggate * Igate * (1 - Igate), x.T)
        self.grad_wg = np.dot(self.grad_ccur * Igate * (1 - Ggate * Ggate), x.T)
        self.grad_wo = np.dot(grad_hcur*np.tanh(c_cur)*Ogate*(1-Ogate),x.T)


        self.grad_vf = np.dot(self.grad_ccur * c_pre * Fgate * (1 - Fgate), h_pre.T)
        self.grad_vi = np.dot(self.grad_ccur * Ggate * Igate * (1 - Igate), h_pre.T)
        self.grad_vg = np.dot(self.grad_ccur * Igate * (1 - Ggate * Ggate), h_pre.T)
        self.grad_vo = np.dot(grad_hcur * np.tanh(c_cur) * Ogate * (1 - Ogate), h_pre.T)

        self.grad_bf = self.grad_ccur * c_pre * Fgate * (1 - Fgate)
        self.grad_bi = self.grad_ccur * Ggate * Igate * (1 - Igate)
        self.grad_bg = self.grad_ccur * Igate * (1 - Ggate * Ggate)
        self.grad_bo = grad_hcur * np.tanh(c_cur) * Ogate * (1 - Ogate)

4. Example application description

This example application is fitting $y = x^2$ curve, the input data train_x of the training group is 0~1 equidistant to take 600 data, and every 6 data is 1 group, that is, 100 groups of data. The output data train_y is the square of train_x plus a random noise data.
Code:

train_x = np.linspace(0.01,1,600).reshape(100,6,1)
train_y = train_x * train_x + np.random.randn(100,6,1)/200

5. Running results

Set the number of iterations epoch to 5000, and select different learning rates for the model learning process as follows (the blue points are the training group data, and the yellow points are the output data of the network model):

insert image description here

6. Epilogue

First of all, thank you for being able to see this. I have been coding and debugging the whole article for a month, mainly because the backpropagation part of the hidden layer is really not easy to calculate. When doing LSTM mathematical derivation before, I set up a flag to implement LSTM in Python, which can be regarded as filling in the pits dug before, but I never expected that the code implementation of LSTM is much more complicated than that of RNN.

Moreover, calculation overflow is very easy to occur when the code is running: insert image description here
the output in this case must be NaN. For this reason, I have tried many solutions, but nothing works. I can only run it again, hoping that the calculation will not overflow next time.

The reason for the calculation overflow is the gradient explosion. I guess the reason for the gradient explosion is that LSTM is "picky" about the initial value of the weight. The reason for this guess is that as long as the code runs smoothly through the first epoch, there will be no problem later.

7. Complete code

import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt

train_x = np.linspace(0.01,1,600).reshape(100,6,1)
train_y = train_x * train_x + np.random.randn(100,6,1)/200


def sigmoid(x):
    return 1/(1+np.exp(-x))


class HiddenLayer():
    def __init__(self,input_size, hidden_size):
        self.w_f = np.random.randn(hidden_size, input_size) #定义各个门的权重, 忘记门
        self.w_i = np.random.randn(hidden_size, input_size)  #输入门
        self.w_g = np.random.randn(hidden_size, input_size)    #新记忆门
        self.w_o = np.random.randn(hidden_size, input_size)   #输出门

        self.v_f = np.random.randn(hidden_size,hidden_size)
        self.v_i = np.random.randn(hidden_size,hidden_size)
        self.v_g = np.random.randn(hidden_size,hidden_size)
        self.v_o = np.random.randn(hidden_size,hidden_size)

        self.b_f = np.zeros([hidden_size, 1])    #输入限定为一维向量
        self.b_i = np.zeros([hidden_size, 1])
        self.b_g = np.zeros([hidden_size, 1])
        self.b_o = np.zeros([hidden_size, 1])

    def forward(self, x, h_pre, c_pre):  #h_pre为h_t-1, c_pre为c_t-1

        self.Fgate = sigmoid(np.dot(self.w_f, x) + np.dot(self.v_f, h_pre) + self.b_f)
        self.Igate = sigmoid(np.dot(self.w_i, x) + np.dot(self.v_i, h_pre) + self.b_i)
        self.Ggate = np.tanh(np.dot(self.w_g, x) + np.dot(self.v_g, h_pre) + self.b_g)
        self.Ogate = sigmoid(np.dot(self.w_o, x) + np.dot(self.v_o, h_pre) + self.b_o)

        c_cur = self.Fgate * c_pre + self.Igate * self.Ggate  #c_cur为c_t
        h_cur = self.Ogate * np.tanh(c_cur)

        return h_cur, c_cur

    def backward(self, Fgate, Igate, Ggate, Ogate, x, grad_cnext, Fgate_next, grad_hcur, c_cur,c_pre, h_pre):


        self.grad_ccur = grad_cnext * Fgate_next + grad_hcur * Ogate * (1 - np.tanh(c_cur) * np.tanh(c_cur))
        self.grad_hpre = self.grad_ccur*(np.dot(self.v_f.T, c_pre*Fgate*(1-Fgate)) + np.dot(self.v_g.T,Igate*(1-Ggate*Ggate)) + np.dot(self.v_i.T,Ggate*Igate*(1-Igate)))

        self.grad_wf = np.dot(self.grad_ccur * c_pre * Fgate * (1 - Fgate), x.T)  #这里要注意矩阵的转置!!!
        self.grad_wi = np.dot(self.grad_ccur * Ggate * Igate * (1 - Igate), x.T)
        self.grad_wg = np.dot(self.grad_ccur * Igate * (1 - Ggate * Ggate), x.T)
        self.grad_wo = np.dot(grad_hcur*np.tanh(c_cur)*Ogate*(1-Ogate),x.T)


        self.grad_vf = np.dot(self.grad_ccur * c_pre * Fgate * (1 - Fgate), h_pre.T)
        self.grad_vi = np.dot(self.grad_ccur * Ggate * Igate * (1 - Igate), h_pre.T)
        self.grad_vg = np.dot(self.grad_ccur * Igate * (1 - Ggate * Ggate), h_pre.T)
        self.grad_vo = np.dot(grad_hcur * np.tanh(c_cur) * Ogate * (1 - Ogate), h_pre.T)

        self.grad_bf = self.grad_ccur * c_pre * Fgate * (1 - Fgate)
        self.grad_bi = self.grad_ccur * Ggate * Igate * (1 - Igate)
        self.grad_bg = self.grad_ccur * Igate * (1 - Ggate * Ggate)
        self.grad_bo = grad_hcur * np.tanh(c_cur) * Ogate * (1 - Ogate)

    def step(self, lr=0.01):
        self.w_f = self.w_f - lr * self.grad_wf
        self.w_i = self.w_i - lr * self.grad_wi
        self.w_g = self.w_g - lr * self.grad_wg
        self.w_o = self.w_o - lr * self.grad_wo

        self.v_f = self.v_f - lr*self.grad_vf
        self.v_i = self.v_i - lr * self.grad_vi
        self.v_g = self.v_g - lr * self.grad_vg
        self.v_o = self.v_o - lr * self.grad_vo

        self.b_f = self.b_f - lr*self.grad_bf
        self.b_i = self.b_i - lr * self.grad_bi
        self.b_g = self.b_g - lr * self.grad_bg
        self.b_o = self.b_o - lr * self.grad_bo


class OutputLayer():
    def __init__(self, hidden_size, output_size):

        self.w_h = np.ones([output_size, hidden_size])
        self.b_h = np.zeros([output_size, 1])

    def forward(self, h_cur):
        return np.dot(self.w_h, h_cur) + self.b_h

    def backward(self,y,h_cur, train_data):
        delta = y - train_data
        self.grad_wh = np.dot(delta, h_cur.T)
        self.grad_hcur = np.dot(self.w_h.T, delta)
        self.grad_bh = delta

    def step(self, lr=0.001):
        self.w_h = self.w_h - lr * self.grad_wh
        self.b_h = self.b_h - lr * self.grad_bh

#---------------------------------------------------
LstmHidden = HiddenLayer(6, 10)
LstmOut = OutputLayer(10, 6)

Fgate_data = np.zeros([101,10,1])  #这些都是要存储的数据
Igate_data = np.zeros([100,10,1])
Ggate_data = np.zeros([100,10,1])
Ogate_data = np.zeros([100,10,1])
gradc_data = np.zeros([101,10,1])  #这里是101是因为c和h都多一个第0时刻的数据
gradh_data = np.zeros([101,10,1])
c_data = np.zeros([101,10,1])
h_data = np.zeros([101,10,1])
y = np.zeros([100,6,1])

epoch = 5001
total_time = len(train_x)

for e in tqdm(range(epoch)):
    for t in range(total_time):

        h_data[t + 1],c_data[t + 1] = LstmHidden.forward(train_x[t], h_data[t], c_data[t])
        Fgate_data[t] = LstmHidden.Fgate
        Igate_data[t] = LstmHidden.Igate
        Ggate_data[t] = LstmHidden.Ggate
        Ogate_data[t] = LstmHidden.Ogate


        y[t] = LstmOut.forward(h_data[t + 1])




    LstmOut.backward(y[total_time-1], h_data[total_time], train_y[total_time-1])
    gradh_data[total_time]=LstmOut.grad_hcur
    gradc_data[total_time] =gradh_data[total_time]  * Ogate_data[total_time-1]* (1 - c_data[total_time] * c_data[total_time])

    LstmOut.backward(y[total_time-2], h_data[total_time-1], train_y[total_time-2])
    gradh_data[total_time-1]=LstmOut.grad_hcur

    for t in reversed(range(total_time-1)):
        LstmOut.backward(y[t], h_data[t + 1], train_y[t])

        LstmHidden.backward(Fgate_data[t],Igate_data[t],Ggate_data[t],Ogate_data[t],train_x[t],
                            gradc_data[t+2],Fgate_data[t+1], gradh_data[t+1], c_data[t+1], c_data[t], h_data[t])
        gradc_data[t+1] = LstmHidden.grad_ccur
        gradh_data[t] = LstmHidden.grad_hpre


        LstmHidden.step(lr=0.00037)
        LstmOut.step(lr=0.00037)

    if e%200 == 0 :
        plt.clf()
        plt.scatter(train_x, train_y, c="blue", s=15)  # 蓝色线为真实值
        plt.scatter(train_x, y, c="orange", s=15)  # 黄色线为预测值
        plt.savefig('x^2_epoch5000_lr00037_%s'%e)


loss = (y-train_y)**2

print(loss)