Deep learning - gradient disappearance, gradient explosion

Reference for this article: Deep Learning 3 - Gradient Explosion and Gradient Disappearance

The root of gradient disappearance and gradient explosion : deep neural network structure , backpropagation algorithm

The current methods of optimizing neural networks are all based on the idea of ​​backpropagation, that is, the error calculated according to the loss function guides the update of the weights of the deep network through backpropagation.

Why does neural network optimization use the optimization method of gradient descent?

The deep network is stacked by many nonlinear layers (with activation functions), and each layer of nonlinear layer can be regarded as a nonlinear function f(x), so the entire deep network can be regarded as a composite nonlinear multivariate function :

\large F(x)=f_{n}(....(f_{1}(x)\times w_{1}+b)\times w_{2}+b)...)

Our purpose is to hope that this multivariate function can well complete the mapping from input to output. Assuming different inputs, the optimal output solution is g(x), then optimizing the deep network is to find the appropriate weights and satisfy the loss. The minimum point, such as a simple loss function square error (MSE):

\large LOSS=(g(x)-F(n))^{2}

To find the minimum problem in mathematics, the method of gradient descent is perfect.

1. The structure of the deep network ;

When backpropagation derivates the activation function, if this part is greater than 1, then as the number of layers increases, the update of the calculated gradient will increase exponentially, and a gradient explosion will occur . If this part is less than 1, the gradient update information obtained with the increase of the number of layers will decay exponentially. If a certain time is equal to 0, then the multiplication will be all 0. This means that the gradient disappears, and the gradient disappears . Gradient disappearance/gradient explosion: Both problems are caused by the network being too deep and the update of network weights unstable. Essentially because of the multiplication effect in gradient backpropagation (less than 1 is multiplied multiple times).
Backpropagation updates the update process of the parameter w, for the parameter. When the gradient disappears, the parameter w closer to the input layer is almost motionless; when the gradient explodes, the parameter w closer to the input layer jumps up and down. Neither stability nor mutation is what we want. What we want is that the parameter w changes steadily in the direction of error reduction.

From the perspective of the deep network, the learning speed of different layers varies greatly. It shows that the learning of the layer close to the output in the network is very good, and the learning close to the input layer is very slow. Even after training for a long time, the weight of the previous base layer It is similar to the initial initialization value, so the root cause of gradient disappearance and gradient explosion lies in the insufficiency of the backpropagation algorithm.

2. The angle of the activation function:

If the choice of activation function is not appropriate, such as using sigmoid, the gradient disappearance will be obvious.

The figure below shows the derivatives of the sigmoid function (also called Logistic) and the Tanh function. When the Logistic derivative is the largest, it is only 0.25, and the rest of the time is much smaller than 0.25. Therefore, if the activation function of each layer is a sigmoid function, it is easy to cause the problem of gradient disappearance. , the peak value of the derivative of the Tanh function is 1, which is only when the value is 0, and the rest of the time is less than 1. Therefore, after chain derivation, the Tanh function can easily cause the gradient to disappear.

3. Solutions for gradient disappearance and explosion

1. Pre-training and fine-tuning

Pre-training: unsupervised layer-by-layer training, each time a layer of hidden points is trained, the output of the hidden node of the previous layer is used as input during training, and the output of the hidden node of this layer is used as the input of the hidden node of the next layer. It is called layer-by-layer pre-training. After the pre-training is completed, the entire network needs to be fine-tuned.

2. Gradient shearing, regularization

Gradient clipping, also known as gradient truncation , is a method to prevent gradient explosion. The idea is to set a gradient clipping threshold. When updating the gradient, if the gradient exceeds this threshold, it is forced to be within this range.

3. Relu, leakyrelu, elu and other activation functions

From the function characteristics of relu, we know that the gradient is 0 when it is less than 0, and the gradient is always 1 when it is greater than 0, then there will be no more problems of gradient disappearance and gradient explosion at this time, because the network of each layer gets The gradient update speed is the same.

The main contributions of relu:

  • Solved the problem of gradient disappearance and gradient explosion
  • Calculation is convenient and the calculation speed is fast (the gradient is constant at 0 or 1)
  • Accelerates the training of the network

defect:

  • Since the negative part is always 0, some neurons cannot be activated (it can be partially solved by setting a small learning rate)
  • Output is not zero-centralized

Although relu also has shortcomings, it is still the most used activation function at present

Leakyrelu is to solve the impact of the 0 interval of relu. In the interval less than 0, the gradient is a small number (non-zero). Leakyrelu solves the impact of the 0 interval and includes all the advantages of relu.

The elu activation function is also to solve the impact of the 0 interval of relu, but compared with leakyrelu, the calculation of elu takes some time (the power of e is calculated).

4.Batch Normalization (batch normalization)

Read the article in detail:

Regarding BN, here is a simple example to illustrate:

During forward propagation: \large f_{2}=f_{1}(w^{T}\times x+b), during backward pass:\large \frac{\partial f_{2}}{\partial x}=\frac{\partial f_{2}}{\partial f_{1}}\times w

The size of w in the backpropagation process affects the disappearance and explosion of the gradient. BN eliminates the effect of zooming in and out of w by standardizing the output of each layer as the mean and variance, and then solves the gradient disappearance and explosion. The problem, or it can be understood that BN pulls the output from the saturated area to the unsaturated area (such as the Sigmoid function).

5. Residual structure

Read the article in detail:

Since the residual network was proposed, almost all deep networks are inseparable from the residual figure. Compared with the previous deep network with several layers and tens of layers, the residual network can easily build hundreds of layers without worrying about the gradient. The problem of disappearing too quickly is due to the shortcut of the residual. We can abstractly understand that the residual network is to change the multiplication inside into continuous addition , which solves the influence of a certain weight derivation to 0. , that is: the local does not affect the global.

6. LSTM (Long Short-Term Memory Network)

In the RNN (recurrent neural network) network structure, due to the use of sigmoid or Tanh functions, it is easy to cause the problem of gradient disappearance, that is, when the time is far apart, the influence of the former on the latter is almost non-existent. The mechanism of LSTM It is to solve this long-term dependence problem. For a specific explanation of RNN and LSTM, I will write another article after I learn it.

Guess you like

Origin blog.csdn.net/GWENGJING/article/details/126804613