RNN-LSTM

one to one: image classification image classification

one to many: look at the pictures and talk about image captioning

many to one: sentiment analysis sentiment classification/music classification

many to many: machine translation sequence to sequence

many to many: language model/NER tagging

RNN

Structure: only input x and hidden state h

Recursion, RNN is a chain structure, each time slice uses the same parameters.

Take a sequence as input and recurse in the evolution direction of the sequence

In the time dimension, it is a deep learning model. If there are 100 words in a sentence, then the depth of RNN is 100, which can handle sentences of different sizes.

advantage

  • Can handle input of any length
  • Model size does not vary with input length
  • Calculate past historical data
  • weight sharing

shortcoming

  • Calculation is slow
  • Sensitive to short-term information, lack of long-term dependence

In the field of deep learning (especially RNN), the problem of "long-term dependence" is ubiquitous. The reason for the long-term dependence is that when the nodes of the neural network go through many stages of calculation, the features of the previous relatively long time slice have been covered

Gradient disappearance and gradient explosion are one of the key reasons that plague RNN model training. Gradient disappearance and gradient explosion are caused by the cyclic multiplication of RNN's weight matrix. Multiple combinations of the same function will lead to extreme nonlinear behavior.

Gradient Explosion/Gradient Vanishing

BPTT(back propagation through time)

question

If the norm is less than 1, the gradient will disappear

If the norm is greater than 1, the gradient will explode

How to solve?

gradient explosion

Gradient Clipping for Gradient Exploding

If the gradient is greater than a certain threshold, manually reduce it (set the threshold)

The gradient disappears, it is not easy to solve

LSTM

Long short term memory Long short term memory - processing tasks related to time series data

Structure: memory information c + hidden state h

1. cell state unit state

The information on the conveyor belt is controlled by the forget gate and the input gate

2. Forget Gate Forget Gate f(t)

Determine whether the information on the conveyor belt needs to be forgotten

3. Input Gate input gate i(t)

The dot product of choosing forgetting considering contextual information and not considering contextuality

4. Output Gate output gate o(t)

One output, divided into two directions (one as the input of the next unit, one as the output value of LSTM)

Purpose: Selective retention and extraction of information

Advantage:

Prevent gradient disappearance explosion, but it is not 100% guaranteed

Can capture longer time series data than RNN

LSTM derivative:

  • Stacked LSTM: Multiple LSTMs are stacked
  • CNN LSTM: CNN image processing, LSTM text generation
  • Encoder-Decoder LSTM: Encoder-LSTM, Decoder-LSTM in the seq2seq model
  • Bidirectional LSTM: Bidirectional LSTM, solving the problem of long-term dependence

Bidirectional LSTM

Two-way: not only based on the above information, but also based on the information below

Generally higher accuracy than unidirectional LSTM, can be used for speech models

RNN/LSTM/Bi-LSTM

RNN has a gradient disappearance phenomenon: it cannot capture information long ago
LSTM can only capture past information

Bi-LSTM can capture past as well as future

Guess you like

Origin blog.csdn.net/weixin_46489969/article/details/125572721