Gradient vanishing, gradient exploding and gradient diffusing and the solutions to each

The reason for vanishing gradient

  1. deep web
  2. Inappropriate loss function

Causes of exploding gradients

  1. The initial value of the weight is too large

Reasons for gradient disappearance and explosion: During the gradient update process, reverse derivation is performed. The derivative of the current layer is multiplied by the learning rate.

When the value of the partial derivative is greater than 1, as the number of layers increases, the calculated gradient will increase exponentially, which is the gradient explosion. For example, 5 to the 30th power

When the value of the partial derivative is less than 1, as the number of layers increases, the calculated gradient will decay exponentially. This is the gradient update, such as 0.2 raised to the 30th power.

Solution

Gradient clipping:

1: Set for gradient explosion . Set a threshold. When updating the gradient, if the gradient exceeds the threshold, modify its value to the value set to the threshold. This can prevent gradient explosion.

2: Weight regularization & weight attenuation: Add the square norm of the weight (L2 norm, L1, etc.) to the loss function, which can suppress the weight from becoming larger, or use Relu.

Setting the value of weight_decay in Pytorch can achieve weight decay .

Implement gradient clipping in Pytorch .

gradient diffusion

The technical reason why gradient descent (and the related L-BFGS algorithm, etc.) does not work well on deep networks with randomly initialized weights is that the gradients become very small. Specifically, when backpropagation is used to calculate derivatives, the magnitude of the backpropagation gradient (from the output layer to the first few layers of the network) decreases dramatically as the depth of the network increases. The result is that the derivative of the overall loss function relative to the weights of the first few layers is very small. In this way, when using gradient descent, the weights of the first few layers change so slowly that they cannot effectively learn from the samples.

habit. This problem is often called "gradient dispersion".

  • Gradient dispersion:
    • Using BN algorithm ( batchnormalization )
    • Change activation function

BN algorithm benefits

1: It can increase the training speed and prevent over-fitting: If there is no normalization, the data distribution after each layer of training will be different, and the network will need greater overhead to learn new distributions, making the network model more complex and therefore prone to occur. Overfitting will cause the network to converge slowly.

2: It can prevent the activation function from entering the nonlinear saturation zone, causing gradient dispersion problems.

3: Because BN has the characteristics of improving the generalization ability of the network, you can remove the dropout ratio and regularization parameters, thus reducing the cumbersome parameter adjustment.

4: The LRN local normalization layer can be omitted.

Guess you like

Origin blog.csdn.net/weixin_43852823/article/details/127561792