Mathematics in Machine Learning - The Challenges of Deep Learning Optimization: Plateaus, Saddle Points, and Other Flat Regions

Categories: General Catalogue of Mathematics in Machine Learning
Related Articles:
· Ill
-conditioned · Local Minima
· Plateaus, Saddle Points, and Other Flat Regions
· Vanishing and Exploding
Gradients · Inexact Gradients
· Weak Correspondence Between Local and Global Structures


For many high-dimensional non-convex functions, local minima (and maxima) are actually far fewer than another class of points with zero gradient: saddle points. Some points near the saddle point have a larger cost than the saddle point, while others have a smaller cost. At the saddle point, the Hessian matrix has both positive and negative eigenvalues . A point located in the direction of the eigenvector corresponding to a positive eigenvalue has a larger cost than a saddle point, and conversely, a point located in the direction of the eigenvector corresponding to a negative eigenvalue has a smaller cost. We can regard the saddle point as a local minimum point on a certain cross-section of the cost function, and it can also be regarded as a local maximum point on a certain cross-section of the cost function.
saddle point

Multiclass random functions exhibit the following properties: In low-dimensional spaces, local minima are common. In higher dimensional spaces, local minima are rare, while saddle points are common. For this kind of function f : R n → R f:R^n\rightarrow Rf:RnFor R , the expectation of the ratio of the number of saddle points to local minima varies withnnn grows exponentially. We can understand this phenomenon intuitively - the Hessian matrix has only positive eigenvalues ​​at local minima. At the saddle point, the Hessian matrix has both positive and negative eigenvalues. Just imagine that the sign of each eigenvalue is determined by a coin toss. In the one-dimensional case, it is easy to flip a coin to get heads once and get a local minimum. In n-dimensional space, it is exponentially difficult to toss a coin n heads.

A surprising property of many random functions is that the eigenvalues ​​of the Hessian are more likely to be positive when we reach the lower-cost interval. Analogous to a coin toss, this means that if we are at the critical point of low cost, the coin tossed heads nnThe probability of n times is greater. This also means that local minima are much more likely to have low cost than high cost. Critical points with high cost are more likely to be saddle points. Critical points with extremely high costs are likely to be local maxima. The above phenomenon occurs in many kinds of random functions.

Shallow autoencoders with no nonlinearity, only global minima and saddle points, no local minima that are more expensive than the global minima. They also found that the results could be extended to deeper networks without nonlinearities, but did not prove it. The output of such networks is a linear function of their input, but they are still useful for analyzing nonlinear neural network models because their loss function is a non-convex function with respect to the parameters. This type of network is essentially a combination of multiple matrices. Saxe et al. precisely dissect the complete learning dynamics in such networks, showing that the learning of these models is able to capture many of the qualitative features observed when training deep models with nonlinear activation functions. Dauphin et al. showed through experiments that real neural networks also have loss functions that contain many expensive saddle points. Choromanska et al. provide additional theoretical arguments showing that this is also the case for another class of high-dimensional random functions associated with neural networks.

For first-order optimization algorithms that only use gradient information, the gradient around the saddle point will usually be very small. On the other hand, gradient descent in experiments seems to be able to escape from saddle points in many cases. Visualize several learning trajectories of state-of-the-art neural networks. These visualizations show that around the prominent saddle points, the cost functions are all flat with zero weights. But they also show that gradient descent trajectories can quickly escape this interval. Goodfellow et al. also argue that it should be possible to analytically show that continuous-time gradient descent escapes rather than attracts saddle points, but for more realistic use cases of gradient descent, the situation may be different.
Visualization of Neural Network Cost Functions
Saddle points are obviously a problem for Newton's method. Gradient descent is designed to move "downhill" rather than explicitly seeking tipping points. The goal of Newton's method is to find the point where the gradient is zero. Without proper modification, Newton's method jumps into a saddle point. The proliferation of saddle points in high-dimensional spaces may explain why second-order methods have failed to successfully replace gradient descent in neural network training. Dauphin et al. (2014) introduced the saddle-free Newton method for second-order optimization and showed significant improvements over traditional algorithms. Second-order methods are still difficult to scale to large neural networks, but it would be promising if such saddleless algorithms could scale. Besides minima and saddle points, there are other points where the gradient is zero. For example, from an optimization point of view, maxima are very similar to saddle points, and many algorithms will not be attracted to maxima, except for the unmodified Newton's method. Like minima, maxima of many kinds of random functions are exponentially rare in high-dimensional spaces. There may also be constant, wide and flat regions. In these regions, both the gradient and the Hessian are zero. This degenerate situation is a major problem for all numerical optimization algorithms. In convex problems, a wide, flat interval must contain the global minimum, but for general optimization problems, such an area may correspond to a higher value in the objective function.

Guess you like

Origin blog.csdn.net/hy592070616/article/details/123284368