Gradient Vanishing and Gradient Explosion

● one word per week

Time is a necessary price for growth.

Introduction

Among the common problems of machine learning model training, in addition to the problems of overfitting and underfitting, there is also a type of problem that often occurs, that is, the gradient problem . What exactly is the gradient problem? What are the ways to deal with it?

Gradient problem

The gradient problem occurs in the residual back-propagation process of the deep neural network. The reason is that the neural network solves the gradients of different layers through the chain rule, and the intermediate multiplication operation may lead to the instability of the residual calculation, making the model training invalid. . There are two specific manifestations of the gradient problem: gradient disappearance and gradient explosion .

fig1

Gradient disappearance is also called gradient dispersion. According to the chain rule, if the product of the weight of each layer of neurons and the residual passed from the previous layer is less than 1, after enough layers of transmission, the residual value will approach 0.

Examples of vanishing gradients: y1 = w1x1, x2 = f(y1), y2 = w2x2, x3 = f(y2), y3 = w3x3, …, xn = f(yn-1).

Where x1 is the input of the input layer, x2 and x3 are the outputs of the two middle hidden layers, respectively, and xn is the output of the output layer. Gradient solution for w1: σw1 = L' * f'(y1) * f'(y2) * ... * f'(yn-1) * w2 * ... * wn-1 * x1.

fig2

It can be seen from the above example that the specific reason for the disappearance of the gradient is either the weight w is too small, or the x is too small. A small w is generally a problem of parameter initialization, and a small x is a problem of the activation function.

For example, for the common sigmoid activation function, the maximum value of its derivative f'(x) = f(x) * (1 - f(x)) is 0.25, and it is easy to cause the gradient to disappear.

As in the above example, the specific reason for the gradient explosion is that the initialization weight w is too large. For example, if the activation function of each layer is sigmoid, the necessary condition for causing the gradient to explode is that the initial value of w is at least greater than 4.

fig3

There are six main methods to solve the gradient problem:

Pre-training&fine-tunning This method was proposed by Hinton in 2006, adopting the idea of ​​​​first part and then the whole. Pre-training first trains the neurons of each hidden layer layer by layer, and then fine-tuning is used to train and fine-tune the entire network through BP.

Gradient clipping As the name implies, when the residual is passed forward, the residual is truncated according to the set threshold. This means is mainly used to prevent gradient explosion.

Adding a regular term to the loss function By adding a regular term, the weight is limited. For the explanation of the specific regular term, please refer to Regular term: the coachman who controls the fitting direction . This means is mainly to prevent gradient explosion.

fig4

The activation function using linear combination As mentioned above, the maximum value of the derivative of the sigmoid activation function is 0.25, and the closer the boundary is to the more obvious the derivative value, it can be replaced by the activation function of linear combination such as ReLU. constant, which can effectively prevent the gradient from disappearing.

Batch Normalization This is a "batch normalization" scheme proposed by G company in the ICML2015 paper. In short, the output of each layer is normalized (minus the mean and divided by the variance), and then activate the transmission to the next layer.

Adding Residual Networks Traditional BP networks are serial transmission of residuals layer by layer, but this network supports the transmission of residuals across layers, which directly avoids the computational instability caused by continuous multiplication.

The above is the explanation of the gradient problem, so stay tuned for the next section.

Epilogue

Thank you for your patience in reading, and the follow-up articles will be presented every Sunday, so stay tuned. Welcome everyone to pay attention to Xiaodou's public account for the half monologue !

face

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324481063&siteId=291194637