Mathematics in Machine Learning - The Challenges of Deep Learning Optimization: Vanishing and Exploding Gradients

Categories: General Catalogue of Mathematics in Machine Learning
Related Articles:
· Ill
-conditioned · Local Minima
· Plateaus, Saddle Points, and Other Flat Regions
· Vanishing and Exploding
Gradients · Inexact Gradients
· Weak Correspondence Between Local and Global Structures


Multilayer neural networks often have regions with steep slopes like cliffs, as shown in the figure below. This is due to the multiplication of several larger weights. When encountering cliff structures with extremely steep slopes, gradient updates can change parameter values ​​considerably, often skipping such cliff structures entirely.

cliff
The objective function of a highly nonlinear deep neural network or recurrent neural network usually contains sharp nonlinearity in the parameter space caused by the multiplication of several parameters. These nonlinearities produce very large derivatives in certain regions. When the parameters are close to such a cliff region, gradient descent updates can catapult the parameters very far, potentially rendering much of the optimization work done for nothing.

It doesn't matter if we are approaching a cliff from above or below, but luckily, we can use heuristic gradient clipping to avoid its serious consequences. The basic idea comes from the fact that the gradient does not specify the best step size, but only the best direction in an infinitely small area. When a traditional gradient descent algorithm proposes an update by a large step, heuristic gradient truncation interferes to reduce the step size, making it less likely to walk off a cliff where the gradient approximates the direction of the steepest descent. Cliff structures are common in the cost function of recurrent neural networks, because such models involve the multiplication of multiple factors, where each factor corresponds to a time step. Therefore, long-term time series produce a lot of multiplication.

Problems caused by deep dependencies

Another difficulty that neural network optimization algorithms face when the computational graph becomes extremely deep is the long-term dependency problem—optimization becomes extremely difficult because the deep structure makes the model lose the ability to learn previous information. Deep computational graphs exist not only in feedforward networks, but also in recurrent networks. The problem is exacerbated by the fact that recurrent networks build very deep computational graphs by repeatedly applying the same operations at various points in a long sequence, and model parameters are shared. For example, suppose a computation graph contains a repeated and matrix WWThe path where W is multiplied. thenttAfter t steps, it is equivalent to multiplying byW t W^tWt . SupposeWWW has eigenvalue decompositionW = V diag ( λ ) V − 1 W=V\text{diag}(\lambda)V^{-1}W=V diag ( λ ) V1 . In this simple case, it is easy to see:
wt = ( V diag ( λ ) V − 1 ) t = V diag ( λ ) t V − 1 w^t=(V\text{diag}(\lambda )V^{-1})^t=V\text{diag}(\lambda)^tV^{-1}wt=( V diag ( λ ) V1)t=V diag ( λ )tV1

When the eigenvalue λ \lambdaWhen λ is not near 1, if it is greater than 1 in magnitude, it will explode; if it is less than 1, it will disappear. Vanishing and Exploding Gradient Problem (Vanishing and Exploding Gradient Problem) means that the gradient on the computational graph will also be caused bydiag ( λ ) t \text{diag}(\lambda)^tdiag ( λ )t varies greatly. Vanishing gradients make it difficult to know in which direction the parameters are moving to improve the cost function, and exploding gradients can make learning unstable. The previously described cliff structure that motivates us to use gradient truncation is an example of the gradient explosion phenomenon.

Repeat at each time step described here with WWMultiplying W is very similar to finding the matrixWWThe largest eigenvalue of W and the power method of the corresponding eigenvector. From this point of view,x T wtx^Tw^txTwt will eventually dropxxx all withWWThe principal eigenvectors of W are orthogonal components. The recurrent network uses the same matrixWWW , while the feedforward network does not. So even with very deep feed-forward networks, the vanishing and exploding gradients can be largely avoided.

Guess you like

Origin blog.csdn.net/hy592070616/article/details/123285048