How to Identify and Solve Vanishing/Exploding Gradients

0. Reasons and solutions for gradient disappearance and gradient explosion

  1. https://blog.csdn.net/qq_25737169/article/details/78847691

1. The reason and performance of gradient disappearance

1.1. Reasons for gradient disappearance:

In the deep network, if the derivative of the activation function is less than 1, according to the chain derivation rule, the gradient of the parameters close to the input layer will become smaller and smaller because it is multiplied by a lot of numbers less than 1, and it will eventually approach 0. For example, the sigmoid function, whose derivative f'(x)=f(x)(1−f(x)) has a value range of (0, 1/4), is very prone to this situation.

Therefore, the reason for the disappearance of the gradient is often because the network layer is too deep and the activation function is improperly selected, such as the sigmoid function.

1.2. The phenomenon of gradient disappearance

The model fails to get updates from the training data and the loss remains almost constant.

2. Causes and manifestations of gradient explosion

2.1. Reasons for Gradient Explosion

Gradient explosion is because the initial weight value is too large, the front layer will change faster than the back layer, which will cause the weight value to become larger and larger, and the phenomenon of gradient explosion will occur.

In deep or recurrent neural networks, error gradients can accumulate over updates, becoming very large gradients, which then lead to large updates of the network weights and thus make the network unstable. In extreme cases, the value of the weight becomes so large that it overflows, resulting in NaN values.

The exponential growth caused by repeated multiplication of gradients (values ​​greater than 1.0) between network layers can produce a gradient explosion.

In a deep multilayer perceptron network, exploding gradients can cause network instability, at best failing to learn from the training data, and at worst NaN weight values ​​that can no longer be updated. In recurrent neural networks, exploding gradients can cause the network to be unstable and unable to learn from the training data, and the best result is that the network cannot learn long input sequence data.

2.2 The phenomenon of gradient explosion

Gradient explosion during training will be accompanied by some subtle signals , such as:

  1. The model cannot get updates from the training data (eg low loss).

  1. The model is unstable, causing significant changes in loss during updates.

  1. During training, the model loss becomes NaN.

If you find these problems, then you need to look carefully for exploding gradients.

Here are some slightly more obvious signals that can help confirm whether an exploding gradient problem is occurring.

  1. The gradient of the model becomes large rapidly during the training process.

  1. Model weights become NaN values ​​during training.

  1. During training, the error gradient values ​​of each node and layer consistently exceed 1.0.

3. Solutions

3.1. Redesign the network model

  1. In deep neural networks, exploding gradients can be solved by redesigning the network with fewer layers.

  1. Using a smaller batch size is also beneficial for network training.

  1. In recurrent neural networks, updating on fewer previous time steps during training (truncated Backpropagation through time) can alleviate the gradient explosion problem.

3.2. Using the ReLU activation function

  1. In deep multilayer perceptron neural networks, gradient explosions may occur because of activation functions, such as the previously popular Sigmoid and Tanh functions.

  1. Gradient explosion can be reduced by using ReLU activation function.

3.3. Using long short-term memory network

  1. In recurrent neural networks, the explosion of gradients may occur because of the inherent instability of the training of a certain network, such as backpropagation over time, which essentially converts the recurrent network into a deep multi-layer perceptron neural network.

  1. The exploding gradient problem can be reduced by using long short-term memory (LSTM) units and associated gate-type neuron structures.

3.4, using gradient truncation (Gradient Clipping)

  • Exploding gradients are still possible in very deep multilayer perceptron networks with large batch sizes and LSTMs with long input sequences. If the gradient explosion still occurs, you can check and limit the size of the gradient during training. This is gradient truncation.

  • There is a simple and effective solution to dealing with exploding gradients: truncate gradients if they exceed a threshold.
    Specifically, check whether the value of the error gradient exceeds a threshold, and if so, truncate the gradient and set the gradient to the threshold.

3.5. Use Weight Regularization

  • If the gradient explosion persists, another approach can be tried, which is to check the size of the network weights and penalize the loss function that produces large weight values. This process is called weight regularization and typically uses either an L1 penalty (the absolute value of the weight) or an L2 penalty (the square of the weight).
    Using L1 or L2 penalties for recurrent weights can help mitigate exploding gradients.

3.6. Use the residual module to avoid gradient disappearance

Summary:

Guess you like

Origin blog.csdn.net/ytusdc/article/details/128635206