Gradient disappears and gradient explosion and solutions

1. gradient disappears

The chain rule, if the neurons in each layer on the partial derivative of the output layer multiplication results on a weight less than 1 weight, then even if the result is 0.99, after a multilayer spread sufficiently, the biasing of the input layer error It tends to 0 lead. This situation will lead to hidden layer neurons adjustment input layer near minimal.

2. expansion gradient

The chain rule, if the neurons in each layer on the partial derivative of the output layer by the weighting if the result is greater than 1, after a multilayer spread enough, the error of the input layer deflector tends to infinity. This situation will lead to close the input layer hidden layer neurons, adjusting great.

3. disappear gradient and gradient expansion solution:

3.1 Pre-training plus fine-tune
this method Hinton from a paper published in 2006, Hinton gradient in order to solve the problem, take unsupervised training method layer by layer, the basic idea is that each layer of hidden nodes of training, the training will hidden layer as input the output node and the output node of the hidden layer is present as an input of the next layer of hidden nodes, layer by layer process is a "pre-trained" (pre-training); after the pre-training is complete, then the whole network "fine-tuning" (fine-tunning).
Hinton depth training in belief networks (Deep Belief Networks) using this method, after the completion of each layer pre-training, re-use BP algorithm to train the entire network. This idea is equivalent to first look for a local optimum, then integrated together to find global optimum, this method has certain advantages, but there are not many applications a
3.2 shear gradient, regular
gradient cut this program is mainly presented against gradient explosion the idea is provided when a shear gradient threshold, then update the gradient, if the gradient exceeds the threshold value, then the force which will be confined within this range, this direct approach gradient explosion can be prevented.
Further means of solving gradient explosion is to use weights regularization (weithts regularization) More common is l1 regular, and l2 regular, at various depths frame has a corresponding API may be used regularization, such tensorflow in building the network the time has been set regularization parameter, then call the following code can be directly calculated from the regular loss:

regularization_loss = tf.add_n(tf.losses.get_regularization_losses(scope='my_resnet_50'))

If no initialization parameters, may be calculated using the following code l2 regular loss:

l2_loss = tf.add_n([tf.nn.l2_loss(var) for var in tf.trainable_variables() if 'weights' in var.name])

Regularization through the network the right to redo the regular limit over-fitting, form a close look at the regularization term loss function:

Wherein, the correction coefficient [alpha] is, therefore, an explosion if the gradient norm weights will become very large, the regularization term can be partially restricted gradient explosion occurred. Note: In fact, in the depth of neural networks, some more often disappearing gradient appears.
3.3 relu, leakrelu, relu such as activation function
Relu: thinking is very simple, function is activated if the derivative is 1, then there is no problem gradient disappeared explosion, and each network can get the same update rate, relu so It came into being.
Relu's main contribution is: solve the gradient disappears, the problem of the explosion. Calculating convenient, fast calculation. Accelerate the training of the network. There are also some disadvantages: Because of the negative part fixed to 0, it will cause some neurons can not be activated (by setting small learning rate part of the solution) is not output to 0 as the center.
leakrelu to solve the influence brought relu interval of 0, which is expressed mathematically as: leakrelu = max (k * x , x)
where k is the leak factor, typically 0.01 or 0.02 to select, from or through learning. leakrelu solve the impact brought by the interval 0, and contains all of the advantages of relu.
Batchnorm 3.4
Batchnorm is one of the most important achievements since depth study of the proposed development, has been widely applied to the major networks, the Internet has accelerated the convergence rate, the effect of improving the stability of the training, the essence is the solution to reverse Batchnorm gradient problem in the communication process.
batchnorm full name batch normalization, referred to as BN, standardized operation by the output x standardized in order to ensure stability of the network. batchnorm is through each layer of the output specification for the mean and variance consistent method to eliminate the influence of w brought zoom, and thus solve the problem gradient disappears and explosion.
3.5 residual structure
In fact, the network is the emergence residual image net led to the end of the game, since the residual suggested that almost all of the depth of the network are inseparable from the residual figure, compared to the previous layers, dozens of layers of depth network , are not worth mentioning in the face of network residuals, residuals can easily build hundreds of layers, a thousand layers of the network without worrying about the problem of excessive gradient disappeared, perhaps because the residual shortcut (shortcut) section . Speaking of residual residual structure, then, I must mention the paper:
Deep Residual Learning for Image Recognition
3.6 LSTM
LSTM stands for long and short term memory network (long-short term memory networks) , is less prone to disappear gradient, the main reason is the internal LSTM complex "door" (gates), LSTM by the time its internal "door" to the next update of "remember" previous training "residual memory", so often used to generate text .

Released nine original articles · won praise 0 · Views 73

Guess you like

Origin blog.csdn.net/GFDGFHSDS/article/details/104596371