Gradient disappearance explanation and simple examples

Article directory

gradient disappears

Vanishing gradient is an important problem in deep learning, especially when using deep neural networks. The core of this problem is that during the backpropagation process, the gradient gradually becomes so small that weight updates almost no longer occur, making the model difficult to train or the convergence speed is very slow. This problem usually arises in deep neural networks, where there are many layers stacked on top of each other.

The main cause of the vanishing gradient problem is chain derivation in deep networks. During backpropagation, the gradient is propagated from the top of the network to the bottom, and the gradient is calculated for each layer and passed to the previous layer. If the derivative value of the activation function is near 0, or the eigenvalue of the weight matrix is ​​close to 0, then the gradient will decrease rapidly and eventually approach zero, causing the weight to be almost no longer updated.

Here are some possible causes of vanishing gradient problems:

  • Selection of activation functions: Use some activation functions, such as Sigmoid and Tanh, whose derivatives will be close to zero when the input is large or small, which will cause the gradient to disappear. To solve this problem, you can try to use an activation function with better properties, such as ReLU (Rectified Linear Unit) or Leaky ReLU.

  • Weight initialization: Inappropriate weight initialization method may also lead to vanishing gradient problem. Initializing with smaller random weights, or using some specially designed initialization methods, such as Xavier initialization or He initialization, can help alleviate the vanishing gradient problem.

  • Batch Normalization: Batch Normalization is a technique used to alleviate the vanishing gradient problem. It helps gradients flow more smoothly by normalizing the input distribution of each layer.

  • Gradient clipping: During training, gradient clipping techniques can be used to limit the size of gradients to avoid them becoming too small.

  • Layer design: Sometimes, architectural designs such as reducing the depth of the network or using skip connections (such as ResNet) can alleviate the vanishing gradient problem.

  • Use LSTM or GRU: When processing sequence data, using recurrent neural network structures such as long short-term memory networks (LSTM) or gated recurrent units (GRU) can reduce the vanishing gradient problem.

Example

Take a simple example . Input
: What is the calculated gradient?



计算y关于x'的梯度:

$\frac{dy}{dx'} = \frac{
    
    d}{
    
    dx'}(w_2x' + b_2) = w_2$

计算x'关于x的梯度:

$\frac{dx'}{
    
    dx} = \frac{
    
    d}{
    
    dx}(w_1x + b_1) = w_1$
最后,将它们相乘以获得y关于x的梯度:
$\frac{
    
    dy}{
    
    dx} = \frac{
    
    dy}{
    
    dx'} \cdot \frac{dx'}{
    
    dx} = w_2 \cdot w_1$

In this example, the vanishing gradient means: when w_1 and w_2 are multiplied together, they are very small.

Can be equivalent to multi-layer situations.

Guess you like

Origin blog.csdn.net/weixin_46483785/article/details/132814010