Mathematics in Machine Learning - The Challenge of Deep Learning Optimization: Weak Correspondence Between Local and Global Structures

Categories: General Catalogue of Mathematics in Machine Learning
Related Articles:
· Ill
-conditioned · Local Minima
· Plateaus, Saddle Points, and Other Flat Regions
· Vanishing and Exploding
Gradients · Inexact Gradients
· Weak Correspondence Between Local and Global Structures


Many of the issues we've discussed so far have been about the properties of the loss function at a single point - if J ( θ ) J(\theta)J ( θ ) is the current pointθ \thetaThe ill-conditioned condition of θ, or θ \thetaθ is in the cliff, orθ \thetaθ is a saddle point where the descending direction is not obvious, so it will be difficult to update the current step.

If the direction improves locally, but doesn't point to a far less expensive region, it is possible for us to overcome all of the above difficulties at a single point and still perform poorly. Goodfellow et al. argue that the runtime of most of the training depends on the length of the trajectory to the solution. As shown in the figure below, the learning trajectory will spend a lot of time exploring a wide arc around the mountain-shaped structure.
Problems caused by local optima
The difficulty of most optimization research focuses on whether training finds a global minimum, a local minimum, or a saddle point, but in practice neural networks do not reach any of these critical points. The figure below shows that neural networks generally do not reach regions with small gradients. Even, these critical points do not necessarily exist. For example, the loss function − log ⁡ p ( y ∣ x ; θ ) -\log p(y|x;\theta)logp(yx;θ ) may not have a global minimum point, but converge asymptotically to a certain value when the model is gradually stabilized with training. foryyy and Softmax distributionp ( y ∣ x ) p(y|x)For a classifier of p ( y x ) , if the model can correctly classify every sample on the training set, the negative log-likelihood can approach infinitely but not equal to zero. Likewise, the real-valued modelp ( y ∣ x ) = N ( y ; f ( θ , β − 1 ) ) p(y|x)=N(y;f(\theta,\beta^{-1}) )p(yx)=N ( and ;f ( θ ,b1 ))the negative log-likelihood tends to negative infinity—iff ( θ ) f(\theta)f ( θ ) can correctly predict the targetyyy , the learning algorithm increasesβ \betaβ . The figure above shows a failed example of not being able to find a good cost function value from local optimization even without local minima and saddle points.
Gradient descent usually doesn't reach any kind of tipping point
Future research is needed to further explore the consequences that affect the length of the learning trajectory and better characterize the training process. Many existing research methods aim at finding good initial points when solving problems with difficult global structure, rather than developing algorithms for non-local scope updates.

Gradient descent and basically all learning algorithms that can efficiently train neural networks are based on locally small updates. The previous subsections focused on why the correct direction of these local-scope updates is difficult to compute. We may be able to compute some properties of the objective function, such as the approximate biased gradient or the variance of the estimate in the correct direction. In these cases, it is difficult to determine whether local descent can define a sufficiently short path to a valid solution, but we cannot really follow the path of local descent. The objective function may have problems such as ill-conditioned or discontinuous gradients, such that the interval over which the gradient provides a good approximation to the objective function is very small. In these cases, the step size is ϵ \epsilonA local dip in ϵ may define a reasonable short-circuit to the solution, but we can only compute it with a step size ofδ < < ϵ \delta<<\epsilond<<The local descent direction of ϵ . In these cases, a local descent might define a path to the solution, but the path involves many updates, so following it is computationally expensive. Sometimes, such as when the objective function has a wide and flat region, or when we are trying to find exact critical points (usually the latter case only occurs with methods that explicitly solve critical points, such as Newton's method), the local Information does not provide us with any guidance. In these cases, local descent is completely unable to define a path to the solution. In other cases, the local movement may be too greedy, moving in a downslope direction that is at odds with all feasible solutions, or solving the problem in a far-flung approach.

Regardless of which problem is most important, these problems can be avoided if there is a region where we follow local descent reasonably directly to some solution, and we can initialize learning on that good region. The final point of view is that it is recommended to study how to choose a better initialization point on the traditional optimization algorithm, so as to achieve the goal is more feasible.

Some theoretical results suggest that any optimization algorithm we design for neural networks has performance limitations. Usually these results do not affect the application of neural networks in practice.

Some theoretical results apply only to the case where the units of the neural network output discrete values. However, most neural network units output smooth continuous values, making local search solution optimization feasible. Some theoretical results suggest that there is a certain class of problems that are unsolvable, but it is difficult to tell whether a particular problem belongs to that class. Other results show that finding a feasible solution for a given size network is difficult, but in practical situations we can easily find an acceptable solution by setting more parameters and using a larger network. Furthermore, in neural network training, we usually don't focus on the exact minima of a function, but only on getting its value down enough to get a good generalization error. It is very difficult to theoretically analyze whether an optimization algorithm can accomplish this goal. Therefore, researching more realistic upper bounds on the performance of optimization algorithms is still an important goal in academia.

Guess you like

Origin blog.csdn.net/hy592070616/article/details/123285338