Let’s talk about gradient disappearance and gradient explosion again

As the number of layers of the neural network increases and the structure becomes more and more complex, the model will encounter gradient disappearance and gradient explosion during the training process, causing the model to fail to converge effectively. So do you know how this problem is solved?

1 Gradient Vanishing and Gradient Explosion

In the process of backpropagation of the neural network, the gradient calculation of each parameter layer will involve the value of the derivative function of the activation function. Specifically, assume that there is now a three-layer neural network, which is the input, the neural network parameters, and the activation function:

When updating backpropagation, according to the chain rule:

The parameters of the first layer need to be multiplied by the derivative function of the activation function in the process of calculating the gradient, so when the number of neural network layers is more, the more it needs to be multiplied.

Then, at that time, there will be a situation where the gradient explodes; when it is close, there will be a situation where the gradient disappears.

2 Gradient update problem of sigmoid

For the sigmoid activation function, simple superposition is very prone to the problem of gradient disappearance.

picture

According to the above picture, it is not difficult to find:

  • The interval between the left and right ends of the sigmoid function is the saturation interval of the function

  • The maximum value of the sigmoid derivative function is 0.25 (taken at 0), and when x is larger or smaller, the derivative value tends to 0

3 tanh gradient update

First, let's observe the properties of the derivative function of the tanh activation function.

picture

For the tanh function, the value of the derivative function is distributed between 0 and 1, which avoids the disappearance of the gradient to a certain extent, but when other variables affecting the gradient of the first few layers are greater than 1, it will cause the gradient to explode.

As an "upgraded version" of the sigmoid activation function, the tanh activation function can not only avoid the gradient disappearance problem to a certain extent, but also generate Zero-Centered Datazero-point symmetric data to ensure that the next layer receives it Zero-Centered Data. This data distribution is to solve the gradient disappearance and gradient explosion problems. key.

4 Zero-Centered Data and Glorot Condition

By analyzing the gradient changes of the model after the superposition of sigmoid and tanh activation functions, it can be concluded that for deep networks, gradient instability is the core factor affecting the modeling effect of the model.

There are five categories of solutions (optimization methods) for gradient instability, which are:

  • parameter initialization method

  • Normalization method for input data

  • How to use the derived activation function

  • Learning Rate Scheduling Method

  • Gradient Descent Optimization Method

And a basic theory of all the above optimization algorithms is Xavier Glorotthe condition proposed in 2010 Glorot.

5 Zero-centered Data

Before introducing the Glorot condition, let's talk about the Zero-Centered Datacorrelation effect to help us understand the subsequent Glorotcondition.

In order to solve the problem of gradient disappearance and gradient explosion, it is necessary to ensure the effectiveness of the multi-layer neural network, that is, the gradient of each layer should not be too large or too small. One of the most basic ideas at this time is to let all input data and all The parameters of the layers are all set to Zero-Centered Data(zero-point symmetric data) .

Since the input and parameters are both symmetric at the zero point, the value of the derivative function in each linear layer can also be more stable, thereby ensuring that the gradient of each layer can basically be maintained in a stable state.

6 Glorot conditions

Xavier GlorotThe condition was put forward in the paper published in 2010. In order to ensure the validity and stability of the model itself, it is hoped that when the model is propagated forward, the variance of the input data of each linear layer is equal to the variance of the output data. , the gradient of the layer before and after the data flows through a layer also has the same variance.

Although it is difficult to satisfy the two at the same time, Glorot and Bengio (the second author of the paper) said that if we modify the calculation process appropriately, we can find a compromise solution to design the initial parameter values, so as to ensure the two conditions as much as possible. Satisfied, this method of designing the initial values ​​of parameters is also known as the Xavier method.

In the Xavier method, the core problem is how much the variance of the parameter should be when initializing the Zero-Centered parameter.

7 ending

Since the Glorot condition and the Xavier method were proposed in 2010, when the ReLU activation function had not yet emerged, the Xavier method was mainly optimized around the possible gradient explosion or gradient disappearance of the tanh activation function, followed by the sigmoid activation function.

However, despite this, the Glorot condition is a general condition, and the subsequent optimization methods (such as the HE initialization method) around the ReLU activation function to solve the failure of neuron activity are also designed according to the Glorot condition.

It can be said that the Glorot condition is the core guiding idea of ​​all model parameter initialization .

recommended article

Technology Exchange

Welcome to reprint, collect, like and support!

insert image description here

At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends

  • Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
  • Method ②, add micro-signal: dkl88191 , note: from CSDN
  • Method ③, WeChat search public account: Python learning and data mining , background reply: add group

long press follow

Guess you like

Origin blog.csdn.net/weixin_38037405/article/details/123871294