Gradient demise--study notes

							梯度消亡

1. The gradient disappears.
The coefficients of the network layer at the input end of the neural network gradually no longer change with training, or change very slowly.
As the number of network layers increases, this phenomenon becomes more and more obvious.

The premise of gradient demise: the activation function
used by gradient-based training methods (such as gradient descent method)
has an output value range that is much smaller than the input value range, such as logistic (logic function), tanh (hyperbolic tangent)
insert image description here

2. Analysis of the problem of gradient demise
The gradient descent method relies on understanding the impact of small changes in coefficients on the output to learn the value of network coefficients.
If a small change in a coefficient has no or minimal impact on the output of the network, then it is impossible to know how to optimize it. This coefficient, or the optimization is particularly slow, making training difficult

The reason for the demise of the gradient:
Using the gradient descent method to train the neural network, if the activation function has the range of its output value greatly compressed relative to the value of the input, then the gradient demise will occur. For example, the hyperbolic tangent
function (tanh) will be negative infinity The input to positive infinity is compressed to between -1 and 1. Except for the input value between -6 and +6, the gradient corresponding to other input values ​​is very small, close to 0.
insert image description here

It can be seen from the figure that when our value is greater than 3, the gradient is close to 1. At this time, if you input 5 or 500 or 5000, their gradients will not change significantly, and vice versa our neural network is the
same , if we want to propagate the direction of 0.1 to find the gradient, then his gradient will depend on the information of the next layer. When we use the hyperbolic tangent function (tanh) as the activation function, because the derivative function of tanh is as the red line above, When the input is greater than 2 or less than -2, the gradient becomes very, very small, and then progressing through layers, we will find that the gradient of our reverse derivation will become very small. When the network is too deep, it will appear vanishing gradient phenomenon
insert image description here

3. Gradient demise solution

Use the new activation function to solve the problem of tanh
insert image description here
insert image description here

During optimization, unlike the saturation at both ends of the Sigmoid function (the gradient at both ends is close to 0), the ReLU function is a left saturation function, and the derivative is 1 when x> 0, and the derivative is also easy to find, to a certain extent. Solve the problem of gradient disappearance and accelerate the convergence speed of
gradient descent. 4. Gradient Explosion
When our gradient is too small, the gradient will disappear through the advancement of the network layer. On the contrary, when my gradient is too large, it will be progressive through layers , the gradient will become larger and larger, resulting in an explosion.
Solution
Gradient Clipping: The idea is to set a gradient clipping threshold, and then update the gradient. If the gradient exceeds this threshold, it will be forced to be limited to within this range
insert image description here

Five, overfitting
The accuracy rate on the training data set is very high, but the accuracy rate on the test set is relatively low
insert image description here

The solution to overfitting:
(1) DropOut
(2) L2 regularization
(3) L1 regularization
(4) MaxNorm
6. DropOut
Dropout is the node that traverses each layer of the neural network during the forward propagation process, and then passes the The neural network of the layer sets a keep_prob (node ​​retention probability), that is, the node of this layer has the probability of keep_prob being retained, and the value range of keep_prob is between 0 and 1. By setting the retention probability of nodes in this layer of the neural network, the neural network will not be biased towards a certain node (because the node may be deleted), so that the weight of each node will not be too large, which is similar to L2 regularization , to reduce the overfitting of the neural network.
Network structure without DropOut
insert image description here

Network structure with DropOut added
insert image description here

Dropout is generally only used in the training phase of the network, and Dropout is not used in the testing phase, that is to say, the forward propagation only uses the part of the neurons that are not deactivated during training, while all the neurons are used during the test.
When using DropOut training, because some neurons are screened out, we calculate the weight of the parameter, which needs to be multiplied by keep_prob (the retention rate we set), so that the node weight used in the test is accurate
7. Regularization

In the model training phase, in order to prevent the model from overfitting, we usually add regularization to improve the generalization ability of the model. Among them, the L1 and L2 regularizations are the most widely used. These regular terms will make the parameters of the model smaller, so that the learned model itself becomes simpler and prevents overfitting.
insert image description here

The characteristics of L1 and L2 regularization:
Both L1 regularization and L2 regularization can be used to prevent over-fitting of the model, but L1 has another feature compared to L2 regularization, which is to make the parameters of the model sparse. First, try to understand what sparsity is. Generally speaking, if the model has 100100 parameters, more than 9090 of the parameters finally learned through the introduction of L1 regularization may be 00, which is called sparsity. On the contrary, if you use L2 regularization, there is no such feature
(1) L1 regularization:
We know the regularized mathematical expression of L1, then I conclude that it is geometrically a rhombus, as shown in the figure below shown
insert image description here

As shown in the figure above, when our objective function is added to the L1 regularization, we hope that the value learned by W will not only make f(w) smaller, but also make the value of the L1 regularization smaller, so our W The value range is the intersection of our objective function and L1 regularization, and the intersection with L1 regularization is likely to be the vertex of the rhombus, because the L1 regularization is a rhombus with many vertices, so it will cause some w values ​​​​to become 0 , so that the parameters are sparse
L1 Disadvantages:
(1) L1 regularization does not have a gradient at every position, so it will do a certain amount of processing for the calculation that is a little complicated.
insert image description here

(2) The sparsity of L1
When selecting features, he randomly selects a feature instead of choosing the best feature

(2) L2 regularization
We know the regularized mathematical expression of L1, so I conclude that it is also a smooth shape geometrically, as shown in the figure below
insert image description here

Each element of w obtained by L2 regularization is relatively small, close to 0, and smoother (not equal to 0)

Guess you like

Origin blog.csdn.net/weixin_43391596/article/details/128000074