one to one: image classification image classification
one to many: look at the pictures and talk about image captioning
many to one: sentiment analysis sentiment classification/music classification
many to many: machine translation sequence to sequence
many to many: language model/NER tagging
RNN
Structure: only input x and hidden state h
Recursion, RNN is a chain structure, each time slice uses the same parameters.
Take a sequence as input and recurse in the evolution direction of the sequence
In the time dimension, it is a deep learning model. If there are 100 words in a sentence, then the depth of RNN is 100, which can handle sentences of different sizes.
advantage
- Can handle input of any length
- Model size does not vary with input length
- Calculate past historical data
- weight sharing
shortcoming
- Calculation is slow
- Sensitive to short-term information, lack of long-term dependence
In the field of deep learning (especially RNN), the problem of "long-term dependence" is ubiquitous. The reason for the long-term dependence is that when the nodes of the neural network go through many stages of calculation, the features of the previous relatively long time slice have been covered
Gradient disappearance and gradient explosion are one of the key reasons that plague RNN model training. Gradient disappearance and gradient explosion are caused by the cyclic multiplication of RNN's weight matrix. Multiple combinations of the same function will lead to extreme nonlinear behavior.
Gradient Explosion/Gradient Vanishing
BPTT(back propagation through time)
question
If the norm is less than 1, the gradient will disappear
If the norm is greater than 1, the gradient will explode
How to solve?
gradient explosion
Gradient Clipping for Gradient Exploding
If the gradient is greater than a certain threshold, manually reduce it (set the threshold)
The gradient disappears, it is not easy to solve
LSTM
Long short term memory Long short term memory - processing tasks related to time series data
Structure: memory information c + hidden state h
1. cell state unit state
The information on the conveyor belt is controlled by the forget gate and the input gate
2. Forget Gate Forget Gate f(t)
Determine whether the information on the conveyor belt needs to be forgotten
3. Input Gate input gate i(t)
The dot product of choosing forgetting considering contextual information and not considering contextuality
4. Output Gate output gate o(t)
One output, divided into two directions (one as the input of the next unit, one as the output value of LSTM)
Purpose: Selective retention and extraction of information
Advantage:
Prevent gradient disappearance explosion, but it is not 100% guaranteed
Can capture longer time series data than RNN
LSTM derivative:
- Stacked LSTM: Multiple LSTMs are stacked
- CNN LSTM: CNN image processing, LSTM text generation
- Encoder-Decoder LSTM: Encoder-LSTM, Decoder-LSTM in the seq2seq model
- Bidirectional LSTM: Bidirectional LSTM, solving the problem of long-term dependence
Bidirectional LSTM
Two-way: not only based on the above information, but also based on the information below
Generally higher accuracy than unidirectional LSTM, can be used for speech models
RNN/LSTM/Bi-LSTM
RNN has a gradient disappearance phenomenon: it cannot capture information long ago
LSTM can only capture past information
Bi-LSTM can capture past as well as future