1.8 [Wang Xiaocao Deep Learning Notes] Vanishing gradients with RNNs

Course source: Wu Enda's deep learning course "Sequence Model"
Notes Arrangement: Wang Xiaocao
Time : April 10, 2018


Earlier we introduced the structure of RNN, forward propagation and back propagation, and the application of RNN in named entities and language models. This section will describe a problem that will be encountered in building RNN: the problem of gradient dispersion. This section describes what gradient dispersion is, and the next few sections describe how to solve this problem.

1. What is gradient dispersion

Let’s first look at the traditional deep neural network:
write picture description here
Assuming that the network is very deep, with 1000 layers or more, after the y of the output layer is obtained through forward propagation, it is difficult to propagate the gradient of y back through the gradient descent method, that is, the back The gradient of , hardly affects the previous weights.

The same is true for RNNs, you are already familiar with the structure of RNNs:

suppose there are two training samples, and the length of the sentence is very long:
The cat, with already ate…………, was full.
The cats, with already ate… ………, where full.
The was of the first sentence is dependent on cat, and the was of the second sentence is dependent on cats, but these two words are so far apart that it is difficult for the latter word to depend on the former to get .

Because, similar to the traditional deep neural network, after the forward propagation, it is difficult for the error in the back propagation to y<t>affect the calculation of the previous layer through the gradient, so when the RNN is calculating the following word was or were, the previous The cat or cats that have appeared have been almost forgotten.

Therefore, in fact, each output y<t>can only be back-propagated to affect the calculation of the previous nearby layers, that is to say, no matter whether y<t>the prediction is correct or wrong during training, or whether the loss is large or small, it cannot pass. Backpropagation tells the very front layers and effectively adjusts the weights for the previous layers.

It is a shortcoming of the basic RNN algorithm, but I am worried that the next few sections will describe in detail how to solve this problem.

2. Exploding gradients

In addition to the problem of gradient dispersion, RNN also encounters the problem of gradient explosion, but this problem is easier to solve, so it is mentioned here incidentally.

How to spot exploding gradients?
Gradient explosion is easy to find, because exponentially great gradients will make your parameters extremely large, so that your network parameters collapse, you will see a lot of NaN or not a number, which means that your network appears A numerical overflow occurred.

How to solve the gradient explosion?
When exploding gradients are found, one solution is gradient pruning.
Observe your gradient vector, and once you find that it exceeds a certain threshold, scale the gradient vector to ensure that it is not too large. For example, the method of pruning by some maximum value

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325182276&siteId=291194637