(3) LSTM and GRU of RNN

Introduction to LSTMs

LSTM (Long Short Term Memory) is a network structure proposed by Hochreater and Schmidhuber in 1997. Although the model has outstanding characteristics in sequence modeling, it did not attract academic attention because it was the downhill period of neural networks at that time. world enough attention. With the gradual development of deep learning, the application of LSTM has gradually increased.

The difference between LSTM and SimpleRNN is that it adds a "processor" to the algorithm to judge whether the information is useful or not. The structure of this processor is called a memory block (Memory Block)

The memory block structure mainly includes three gates: Forget Gate , Input Gate , Output Gate and a memory unit (Cell).

All RNNs have a form of chains of repeating neural network modules. In a standard RNN, this repeated module has only a very simple structure, such as a tanh layer:
insert image description here
LSTM has the same structure, but the repeated module has a different structure. Specifically, RNN repeats a single neural network layer, and the repeating module in LSTM contains four interacting layers, three Sigmoid and one tanh layer, and interacts in a very specific way.

insert image description here

The core idea of ​​LSTM

The key to LSTM is the cell state, and the horizontal line runs across the top of the graph.
The cell state is like a conveyor belt. Run directly on the entire chain, with only a few small linear interactions. It would be easy for information to circulate on it and stay the same.
insert image description here

LSTMs have the ability to remove or add information to the cell state through carefully designed structures called "gates". A gate is a method of selectively allowing information to pass through. They consist of a sigmoid neural network layer and a nonlinear operation of pointwise multiplication.

In this way, 0 means "no amount is allowed to pass", and 1 means "any amount is allowed to pass"! In this way, the network can understand which data needs to be forgotten and which data needs to be saved.

Forgotten Gate (Memory Gate)

insert image description here

The first step in our LSTM is to decide what information we will discard from the cell state. This decision is made through a structure known as the "forget gate". The forget gate will read the previous output ht − 1 h_{t -1}ht1and the current input xt x_{t }xt, do a nonlinear mapping of Sigmoid, and then output a vector ft f_{t }ft(The value of each dimension of the vector is between 0 and 1, 1 means completely reserved, 0 means completely discarded, which is equivalent to remembering the important and forgetting the irrelevant), and finally with the cell state C t − 1 C_ {t-1}Ct1multiplied.

Input gate and cell state

insert image description here
The next step is to determine what new information is stored in the cell state. There are two parts here:

  1. The sigmoid layer called "input gate layer" decides what values ​​we are going to update;
  2. A tanh layer creates a new vector of candidate values ​​C ~ t \widetilde{C}_tC t, will be added to the state.

insert image description here
Now is the time to update the old cell state, C t − 1 C_{t-1}Ct1Update to C t C_{t}Ct. The previous steps have determined what will be done, and we are now going to actually do it.

We compare the old state with ft f_{t}ftMultiply, discarding the information we are sure we need to discard, and then adding it i_{t}it * C ~ t \widetilde{C}_t C t. These are the new candidate values, which change according to how much we decide to update each state.

output gate

insert image description here
Ultimately, we need to determine what value to output. This output will be based on our cell state, but also a filtered version.

First, we run a sigmoid layer to determine which part of the cell state to output.
Next, we pass the cell state through tanh (get a value between -1 and 1) and multiply it with the output of the sigmoid gate, and finally we only output the part that we determine the output.

LSTM complete structure

insert image description here
Understanding of input gate and forget gate:
insert image description here

How does LSTM avoid gradient dispersion?

insert image description here
Unlike the ordinary RNN introduced before, the LSTM gradient is the addition of several items, and does not include the multiplication of activation function derivatives. So LSTM will not have the problem of gradient dispersion.

Gated Recurrent Unit (GRU)

The structure of GRU (Gated Recurrent Unit) only appeared in 2014. The effect is similar to that of LSTM, but it uses fewer parameters, so the calculation speed will be faster. GRU combines the forget gate and input gate into a single update gate , and combines the memory unit and hidden layer into a reset gate , which makes the entire result operation more simplified and performance enhanced.

insert image description here

Guess you like

Origin blog.csdn.net/qq_34184505/article/details/128938140