Neural Network Study Notes 9 - LSTM and GRU Model Understanding and Code Analysis in Recurrent Neural Networks

Series Article Directory

LSTM Video Reference

GRU Video Reference



foreword

Recurrent Neural Network (RNN) is a neural network for processing sequence data. Compared with the general neural network, it can handle the data of sequence change. For example, the meaning of a word will have different meanings due to the different content mentioned above, and RNN can solve such problems well.
LSTM is a kind of RNN, which can solve the problem of gradient disappearance and gradient explosion in the long sequence training process of RNN. When a sequence is long enough, it will be difficult for RNN to transfer information from earlier time steps to later time steps. LSTM can learn long-term dependent information and remember information from earlier time steps, so it can be associated with context.
Different from RNN, RNN wants to remember all the information, and does not care whether it is useful or not. LSTM designs a memory cell, which functions as a screening function and has the function of selecting memory, which is used to select and remember important information. , to filter noise and non-important information to reduce memory load. Its appearance solves the problem of gradient distortion. Moreover, the convergence speed of RNN is much faster than that of ordinary RNN.

The core concept of LSTM is cell state and "gate" structure. The cell state is equivalent to the path of information transmission, so that information can be passed on in the serial connection. You can think of it as the "memory" of the network. Theoretically, the state of the cell can carry relevant information all the way through the sequence processing.

GRU (Gate Recurrent Unit) is a kind of recurrent neural network (RNN), which can solve the problems of RNN that cannot be long-term memory and gradient in backpropagation. It is similar to LSTM, but it is simpler than LSTM and easy to train. It is very similar to LSTM. Compared with LSTM, GRU removes the cell state and uses hidden state to transmit information. It contains only two gates: the update gate and the reset gate.

GRU is a variant of LSTM, and it is also proposed to solve problems such as long-term memory and gradient in backpropagation. In many cases, GRU and LSTM have almost the same performance, but GRU calculation is simpler and easier to implement.

1. LSTM model structure

RNN structure
insert image description here

LSTM structure
insert image description here

The LSTM model is compared with the RNN model:

  1. X t X_t XtThe complexity of the model structure of the position varies greatly
  2. LSTM has two outputs, RNN has only one output

insert image description here

Put X t X_tXtTake it out separately, where
CCC means memory cell, the formula is Cell State
hhh means state
XXX means new content
σ σσ represents the gate unit
ft f_tftForget Gate, the formula is Forget Gate
it i_titInput gate, the formula is Input Gate
ot o_totOutput gate, the formula is Output Gate

vs. ft f_tft i t i_t it o t o_t otIt can be found that their laws are σ ( W × X t + W × ht − 1 + b ) σ(W×X_t+W×{h_{t-1}}+b)s ( W×Xt+W×ht1+b ) type. According to the above model structure diagram, it can be found that( X t , ht − 1 ) (X_t,h{_t-1})(Xt,ht1 ) will be input into the σ gate, so the X and h values ​​of the three gates are the same. Through( X t , ht − 1 ) (X_t,h{_t-1})(Xt,ht1 ) Four values ​​ft f_tcan be calculatedft i t i_t it C ~ \widetilde{C} C and ot o_tot. where tanh tanht anh is the activation function and each element of its output vector is between -1~1,W , b W,bW,b is the parameter of each gate neuron, which is to be learned during the training process.

  1. Forget Gate : Decide what information should be discarded or kept. The information from the previous hidden state and the current input information are passed to the sigmoid function at the same time, and the output value is between 0 and 1. The closer to 0 means that it should be discarded, and the closer to 1 means that it should be retained. ft f_t will be obtainedftwith input C t − 1 C_{t-1}Ct1Multiply, that is, vector multiplication.
  2. Input Gates : Input gates are used to update the cell state. First, the information of the hidden state of the previous layer and the information of the current input are passed to the sigmoid function. Adjust the value between 0~1 to decide which information to update. 0 means unimportant, 1 means important, get it i_titvalue. Secondly, the information of the hidden state of the previous layer and the information of the current input must be passed to the tanh function to create a new candidate value vector and get C ~ \widetilde{C}C value. Finally, the output value of sigmoid and the output value of tanh it i_titand C ~ \widetilde{C}C Multiplication, that is, vector multiplication, where the output value of sigmoid will determine which information in the output value of tanh is important and needs to be preserved. The information to be retained will be added to the output obtained by the forget gate, and finally a new memory cell C t = ( ft ⨀ C t − 1 ) + ( it ⨀ C ~ ) {C_t} = ({f_t}⨀ {C_{t-1}})+({i_t}⨀{\widetilde{C}})Ct=(ftCt1)+(itC )
  3. Output gate : The output gate is used to determine the value of the next hidden state, which contains the information of the previous input. First, we pass the previous hidden state and the current input into the sigmoid function, and then pass the newly obtained cell state into the tanh function. Finally, the output of tanh is multiplied by the output of sigmoid, that is, vector multiplication, to determine the information that the hidden state should carry. Then use the hidden state as the output of the current cell, and use the new cell state and the new hidden state ht h_thtpassed to the next time step. That is, the newly obtained memory cells C t {C_t}CtPass in tanh for weight calculation mt = tanh ( C t ) {m_t} = tanh({C_t})mt=I ( older )t) , thenot o_totAccording to the weight mt{m_t}mtCompute the new hidden state ht = ot ⨀ mt {h_t} = {o_t} ⨀{m_t}ht=otmt, so as to control which part needs to be output and which part is the answer we need.

2. GRU model structure

insert image description here

Formula expression:

zt = σ ( W xz × X t + W hz × ht − 1 + bz ) z_t = σ(W_{xz}×X_t+W_{hz}×{h_{t-1}}+b_z)zt=s ( Wxz×Xt+Whz×ht1+bz)
r t = σ ( W x r × X t + W h r × h t − 1 + b r ) r_t = σ(W_{xr}×X_t+W_{hr}×{h_{t-1}}+b_r) rt=s ( Wxr×Xt+Whr×ht1+br)
h ~ t = t a n h ( W x h × X t + W h h × ( r t ⨀ h t − 1 ) + b h ) \widetilde{h}_t = tanh(W_{xh}×X_t+W_{hh}×(r_t⨀{h_{t-1}})+b_h) h t=English ( W _xh×Xt+Whh×(rtht1)+bh)
h t = ( 1 − z t ) ⨀ h t − 1 + z t ⨀ h ~ t h_t = (1-z_t)⨀h_{t-1}+z_t⨀\widetilde{h}_t ht=(1zt)ht1+zth t

GRU has only two gates. The combination of the input gate and the forget gate in LSTM is called the update gate. The formula is zt z_tzt, while another formula called reset gate is rt r_trt.
It can be observed that the formulas of the two gates are also σ ( W × X t + W × ht − 1 + b ) σ(W×X_t+W×{h_{t-1}}+b)s ( W×Xt+W×ht1+b ) type, but hereh ~ t \widetilde{h}_th tand LSTM's C ~ \widetilde{C}C Different, LSTM's C ~ \widetilde{C}C is by ht − 1 , X t h_{t-1},X_tht1,XtComposed of, while GRU's h ~ t \widetilde{h}_th tis given by ( rt ⨀ ht − 1 ) , X t (r_t⨀h_{t-1}),X_trtht1),Xtconsist of. where tanh tanht anh is the activation function and each element of its output vector is between -1~1,W , b W,bW,b is the parameter of each gate neuron, which is to be learned during the training process.

  1. update gate zt z_tzt: A mechanism that plays a role of attention, used to control the degree to which the state information of the previous moment is brought into the current state, that is, the update gate helps the model decide how much past information to pass to the future, in simple terms, it uses to update the memory.
  2. reset gate rt r_trt: The mechanism that plays a role in forgetting, determines how to combine new input information with previous memories, controls how much past information to forget, makes the hidden state forget any information that is not related to predictions in the future, and also allows the construction of Tighter representation.

3. Comparison between GRU and LSTM

advantage:

  1. Compared with LSTM, GRU has fewer output gates, its parameters are less than LSTM, and the amount of calculation is less than LSTM. It is simpler, easier to implement, and can train the model faster;
  2. GRU has only two gates, LSTM has three gates, GRU is easier to control parameters and easier to adjust;

shortcoming:

  1. GRU does not have as strong a memory capacity as LSTM, and for long-term dependency problems, LSTM is strictly stronger than GRU because it can easily count infinitely, while GRU cannot. This is why GRUs cannot learn simple languages, which LSTMs can learn;
  2. GRU does not have as many parameters as LSTM, and it is prone to underfitting.

4. Code implementation

1. GRU code

From hands-on deep learning

import torch
from torch import nn
from d2l import torch as d2l

batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)

def get_params(vocab_size, num_hiddens, device):
    num_inputs = num_outputs = vocab_size

    def normal(shape):
        return torch.randn(size=shape, device=device)*0.01

    def three():
        # torch.normal:返回一个张量,包含从给定参数num_inputs, num_hiddens的离散正态分布中抽取随机数
        # torch.zeros:创建num_hiddens大小的维度,里面元素全部填充为0device输出张量的设备
        return (normal((num_inputs, num_hiddens)),
                normal((num_hiddens, num_hiddens)),
                torch.zeros(num_hiddens, device=device))

    # 初始化模型参数,定义更新门z_t、重置门r_t和隐藏态h_t公式的训练参数
    # 通过def three()给这些参数定义随机数和和0矩阵
    W_xz, W_hz, b_z = three()  # 更新门参数
    W_xr, W_hr, b_r = three()  # 重置门参数
    W_xh, W_hh, b_h = three()  # 候选隐状态参数

    # 输出层参数
    W_hq = normal((num_hiddens, num_outputs))
    b_q = torch.zeros(num_outputs, device=device)

    # requires_grad:附加梯度,设置参数为True,程序将会追踪所有对于该张量的操作
    # 当完成计算后通过调用 .backward(),自动计算所有的梯度, 这个张量的所有梯度将会自动积累到 .grad 属性。
    params = [W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q]
    for param in params:
        param.requires_grad_(True)
    return params

# 定义隐状态的初始化函数,返回一个形状为(批量大小,隐藏单元个数)的张量,张量的值全部为零
def init_gru_state(batch_size, num_hiddens, device):
    return (torch.zeros((batch_size, num_hiddens), device=device),)

def gru(inputs, state, params):
    # 使用的是有梯度的训练参数
    W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q = params
    H, = state
    outputs = []

    # 公式的复现
    # torch.sigmoid将样本值映射到0到1之间。
    for X in inputs:
        Z = torch.sigmoid((X @ W_xz) + (H @ W_hz) + b_z)
        R = torch.sigmoid((X @ W_xr) + (H @ W_hr) + b_r)
        H_tilda = torch.tanh((X @ W_xh) + ((R * H) @ W_hh) + b_h)
        H = Z * H + (1 - Z) * H_tilda
        Y = H @ W_hq + b_q
        outputs.append(Y)
    # torch.cat:将两个张量按指定维度拼接在一起
    return torch.cat(outputs, dim=0), (H,)

vocab_size, num_hiddens, device = len(vocab), 256, d2l.try_gpu()
num_epochs, lr = 500, 1
model = d2l.RNNModelScratch(len(vocab), num_hiddens, device, get_params,
                            init_gru_state, gru)
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)

insert image description here

Guess you like

Origin blog.csdn.net/qq_45848817/article/details/128466074