A new idea of the gradient descent algorithm

Benjamin Grimmer, assistant professor of applied mathematics and statistics at Johns Hopkins University, offers a whole new way of understanding the gradient descent algorithm.

In the world of machine learning, optimization problems are very important, and they can make the world a better place. Optimization problems seek to find the best way to accomplish something, such as a mobile phone GPS calculating the shortest route to a destination, or a travel website searching for the cheapest flight that matches an itinerary. At the same time, machine learning applications learn by analyzing patterns in data and attempt to provide the most accurate and humane answer to any given optimization problem.

For simple optimization problems, finding the optimal solution is just a matter of arithmetic. In 1847, the French mathematician Augustin-Louis Cauchy studied a rather complex example - astronomical calculations. At that time he pioneered a common optimization method, now known as gradient descent, which is one of the most classical and simplest first-order methods in optimization methods.

Today, thanks to its low complexity and simplicity, most machine learning programs rely heavily on gradient descent, which is also used by other fields to analyze data and solve engineering problems. Mathematicians have been perfecting the gradient descent method for over a hundred years. Yet a paper last month suggested that a fundamental assumption about gradient descent methods may be wrong.

The paper is "Provably Faster Gradient Descent via Long Steps," and the sole author is Benjamin Grimmer, assistant professor of applied mathematics and statistics at Johns Hopkins University. He was astonished at what he found, like a shattered intuition.

His counter-intuitive results showed that gradient descent could be nearly 3x faster if the long-established rules for finding the best answer to a given problem were broken. To be more specific: He argues that the gradient descent algorithm can work faster by including unexpectedly large step sizes, contrary to what researchers have long believed.

Paper address: https://arxiv.org/pdf/2307.06324.pdf

While this theoretical advance may not be applicable to machine learning to solve tougher problems, it could prompt researchers to rethink their understanding of gradient descent.

Shuvomoy Das Gupta, an optimization researcher at MIT, said, "It turns out that we don't fully understand the theory behind gradient descent. Now, this study brings us closer to understanding the role of gradient descent."

This paper establishes arguably faster convergence rates for gradient descent in smooth convex optimization through a computer-aided analysis technique. Here, the authors analyze the overall effect of one iteration over multiple iterations rather than the typical single-iteration induction used in the analysis of most first-order methods, allowing non-constant step-size strategies.

The results show that larger step sizes increase the target value in the short term, but achieve provably faster convergence in the long term. In addition, through simple numerical verification, the author also proposes a conjecture that proves a faster O (1/T log T) gradient descent rate.

Specifically, the authors' proof is based on the Performance Estimation Problem (PEP) idea, which computes or constrains worst-case problem instances for a given algorithm as a Semidefinite Program (SDP). By virtue of the existence of a feasible solution to the correlated SDP, the authors prove the descent guarantee after applying the non-constant step size mode, and thus obtain faster convergence guarantees.

In practice, designing provably faster non-constant-step gradient descent methods amounts to finding straightforward step-size patterns with large average step-size values. Proving a given pattern is simple and can be done using semidefinite programming, see Theorem 3.1. Table 1 below demonstrates the direct-step modes with increasingly faster convergence guarantees, where each mode is verified using a computer-generated, exact arithmetic semidefinite programming solution. Future work will identify direct modes with larger step sizes and other tractable non-constant, periodic large-step strategies. However, finding long, direct-step patterns h is difficult, and the set of all direct patterns is non-convex, resulting in often fruitless local searches. As shown in Table 1, a pattern of length t = 2^m − 1 is created by repeating t = 2^m−1 − 1 twice, adding a new long step in between, and shortening the length 2^m manually −1 Long steps in the −1 subpattern. According to the authors, this recursive pattern has strong similarities to the circular and fractal Chebyshev patterns of quadratic minimization in previous studies, and the connection between them has not yet been demonstrated.

The authors say their approach is very similar to that first proposed by Penn optimization researcher Jason Altschuler, who established a pattern of repeated steps of length 2 or 3 and shrunk faster toward the minimizer to achieve a smooth, strongly convex minimum change.

For more details, please refer to the original paper. whaosoft aiot  http://143ai.com

From small steps to large steps, breaking through the length limit

We know that the conventional wisdom in the field has been to use small step sizes for decades, although no one can prove that smaller step sizes are better. This means that in the gradient descent equation, the step size is no larger than 2.

With advances in computer-aided technology, optimization theorists have begun to test ever more extreme techniques. In work recently published in the journal Mathematical Programming, Das Gupta and others asked computers to find the optimal step size for an algorithm limited to 50 steps, a type of meta-optimization problem. They found that the length of the optimal 50 steps varied widely, with a step in the sequence almost reaching length 37, well above the typical upper limit of length 2.

Paper address: https://link.springer.com/article/10.1007/s10107-023-01973-1
This result shows that optimization researchers are missing something. So, out of curiosity, Grimmer transformed Das Gupta's numerical results into a more general theorem. To break the arbitrary upper limit of 50 steps, he explored the optimal step size for a repeatable sequence, getting closer to the optimal answer with each repetition. Grimmer had the computer permutate sequences of steps millions of times to find those that converged to an answer the fastest.

Grimmer found that the fastest sequences always have in common that the middle step is always large, whose size depends on the number of steps in the repeated sequence. For the 3-step sequence, the stride length is 4.9; for the 15-step sequence, the algorithm suggests a step size of 29.7; for the longest 127-step sequence in the test, the maximum step size in the middle is 370. The final results show that the sequence reaches the sweet spot nearly three times faster than successive small steps.

Although the theory is new, it cannot change the current usage

This looping approach represents a different way of thinking about gradient descent, says Aymeric Dieuleveut, an optimization researcher at the École Polytechnique in France. “My gut tells me that I shouldn’t be thinking about a problem step-by-step, but multiple steps in succession. I think a lot of people miss that,” he said.

But while these insights may change how researchers think about gradient descent, they probably won't change how the technique is currently used. After all, Grimmer's paper was only concerned with smooth functions, smooth functions without sharp bends, and convex functions shaped like a bowl with a single optimal value at the bottom. These functions are fundamental in theory but less important in practice. The optimization procedures used by machine learning researchers are usually much more complex.

Gauthier Gidel, an optimization and machine learning researcher at Université de Montréal, says improved techniques could make Grimmer's large-step method faster, but those techniques come at an additional cost to run. So people have always hoped that the conventional gradient descent method can win with the correct combination of step sizes. Unfortunately, the new study's triple speedup wasn't enough.

Gidel posed his own question, "Although it shows a slight improvement, I think the real question is: can we really close this gap?"

These results also present another theoretical mystery that has kept the authors up at night. Why do ideal patterns of step sizes all have such symmetrical shapes? Not only is the largest step always exactly in the middle, but the same pattern occurs on either side of it: continuing to zoom in and subdivide the sequence results in an "almost fractal pattern" with large steps surrounded by smaller ones. This repetition implies that an underlying structure is governing the optimal solution, which no one has yet been able to explain.

But the author of this article at least has hope, "This puzzle, if I can't solve it, someone else will."

Original link: https://www.quantamagazine.org/risky-giant-steps-can-solve-optimization-problems-faster-20230811/

Tribute to the boss~

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/132266675