Those things about ASG and Adam

(1) A framework to understand the optimization algorithm

There are a group of alchemists in the machine learning world. Their daily routine is:

Bring medicinal materials (data), set up the Bagua furnace (model), light Liuwei real fire (optimization algorithm), and shake the fan and wait for the medicinal medicine to be released.

However, everyone who has been a cook knows that the same ingredients, the same recipes, but the fire conditions are different, and the flavors that come out are very different. When the fire is small, it will start to grow, but when the fire is high, it will be easy to burn, and if the fire is uneven, it will be half-baked.

The same is true for machine learning. The choice of model optimization algorithm is directly related to the performance of the final model. Sometimes the effect is not good. It may not be a problem of feature or model design. It may be a problem of optimization algorithm.

When it comes to optimization algorithms, the entry level must start with SGD, and the old driver will tell you that there are better AdaGrad / AdaDelta, or just use Adam without thinking. But looking at the latest papers in academia, I found that a lot of great gods are still using entry-level SGD, plus Momentum or Nesterov at most, and often black Adam. For example, a UC Berkeley paper wrote in the Conclusion:

Despite the fact that our experimental evidence demonstrates that adaptive methods are not advantageous for machine learning, the Adam algorithm remains incredibly popular. We are not sure exactly as to why ……

Feelings of helplessness and sorrow are beyond words.

Why is this? Is it true that it is plain?

01 A framework review optimization algorithm

First, let's review various optimization algorithms.

The deep learning optimization algorithm has gone through the development process of SGD -> SGDM -> NAG ->AdaGrad -> AdaDelta -> Adam -> Nadam. You can see many tutorial articles on Google, telling you in detail how these algorithms evolved step by step. Here, we change our thinking, use a framework to sort out all optimization algorithms, and make a higher comparison.

General framework of optimization algorithms

First define: parameter to be optimized: w, objective function: f(w), initial learning rate α. Then, iterative optimization began. At each epoch t:

1. Calculate the gradient of the objective function with respect to the current parameters

2. Calculate first-order momentum and second-order momentum based on historical gradients

3. Calculate the descent gradient at the current moment:

4. Update according to the descending gradient:

Mastering this framework, you can easily design your own optimization algorithm.

Let's take this framework and take a picture of the real body of various mysterious optimization algorithms. Steps 3 and 4 are the same for each algorithm, and the main difference is reflected in 1 and 2.

02 Fixed learning rate optimization algorithm

SGD

Let's look at SGD first. SGD has no concept of momentum, which means:

Substituting step 3, you can see that the descending gradient is the simplest

The biggest disadvantage of SGD is its slow descent, and it may continue to oscillate on both sides of the ravine, staying at a local optimum.

SGD with Momentum

In order to suppress the oscillation of SGD, SGDM believes that inertia can be added to the gradient descent process. When going downhill, if you find a steep slope, use inertia to run faster. The full name of SGDM is SGD with momentum, which introduces first-order momentum on the basis of SGD:

The first-order momentum is the exponential moving average of the gradient direction at each time, which is approximately equal to the average of the sum of the gradient vectors of the most recent 1/(1-β1) time.

In other words, the descending direction at time t is determined not only by the gradient direction of the current point, but also by the descending direction accumulated before. The empirical value of β1 is 0.9, which means that the descending direction is mainly the descending direction accumulated before, and is slightly biased to the descending direction at the current moment. Imagine that a car turns on a highway, and it is slightly deflected while moving forward at high speed, and accidents will happen when turning sharply.

SGD with Nesterov Acceleration

Another problem with SGD is that it oscillates in the gully of local optimum. Imagine you walk into a basin, surrounded by slightly higher hills, you feel that there is no downhill direction, you can only stay here. But if you climb up high ground, you will find that the outside world is still very vast. Therefore, we cannot stay in the current position to observe the future direction, but must take a step forward, look one more step, and look farther.

(source: http://cs231n.github.io/neural-networks-3)

This method is also called NAG, or Nesterov Accelerated Gradient, which is a further improvement on the basis of SGD and SGD-M. The improvement point lies in step 1. We know that the main direction of descent at time t is determined by the cumulative momentum, and the direction of our own gradient is not counted. Instead of looking at the current gradient direction, we should first see if we take a step with the cumulative momentum, then how to go at that time . Therefore, in step 1, NAG does not calculate the gradient direction of the current position, but calculates the descending direction at that time if the accumulated momentum is taken one step:

Then use the gradient direction of the next point, combined with the historical cumulative momentum, to calculate the cumulative momentum at the current moment in step 2.

03 Optimization algorithm for adaptive learning rate

We have not used second-order momentum before. The emergence of second-order momentum means the arrival of the "adaptive learning rate" optimization algorithm era. SGD and its variants update each parameter at the same learning rate, but deep neural networks often contain a large number of parameters, which are not always available (think large-scale embedding).

For the frequently updated parameters, we have accumulated a lot of knowledge about it. We don’t want to be affected too much by a single sample. We hope that the learning rate will be slower. For the occasionally updated parameters, we know too little information. Learn more from the samples that appear, that is, the learning rate is higher.

AdaGrad

How to measure the historical update frequency? That is the second-order momentum-the sum of the squares of all the gradient values so far in this dimension:

Let's review the descent gradient in step 3:

It can be seen that the actual learning rate has changed from now. Generally, in order to prevent the denominator from being 0, a small smoothing term is added to the denominator. Therefore, it is always greater than 0, and the more frequently the parameters are updated, the greater the second-order momentum, and the smaller the learning rate.

This method performs very well in sparse data scenarios. But there are some problems: because it is monotonously increasing, it will make the learning rate monotonously decrease to 0, which may end the training process prematurely, even if there is data in the follow-up, it is impossible to learn the necessary knowledge.

AdaDelta / RMSProp

Since AdaGrad's monotonically decreasing learning rate changes are too radical, we consider a strategy to change the second-order momentum calculation method: do not accumulate all historical gradients, but only focus on the descending gradients of the past time window. This is the origin of Delta in the name AdaDelta.

The idea of modification is simple. As we mentioned earlier, the exponential moving average is approximately the average over a period of time, so we use this method to calculate the second-order cumulative momentum:

This avoids the problem of continuous accumulation of second-order momentum, leading to the premature end of the training process.

Adam

Speaking of this, the emergence of Adam and Nadam is very natural-they are the masters of the aforementioned methods. We see that SGD-M adds first-order momentum to SGD, and AdaGrad and AdaDelta add second-order momentum to SGD. Using both the first-order momentum and the second-order momentum is Adam-Adaptive + Momentum.

SGD's first-order momentum:

Add the second-order momentum of AdaDelta:

The two most common hyperparameters in the optimization algorithm are here. The former controls the first-order momentum and the latter controls the second-order momentum.

I hope so

Finally, Nadam. We say that Adam is a master, but it has missed Nesterov. Can this be tolerated? It must be added-just follow the step 1 of NAG to calculate the gradient:

This is Nesterov + Adam = Nadam.

Having said that, I can probably understand why Adam / Nadam is currently the most mainstream and best used algorithm. Using Adam/Nadam without brain, the convergence speed is whizzing, and the effect is also leveraged.

Then why is Adam still recruiting people and being scorned by academia? Is it just to fill the paper with water?

(2) Adam's two sins

In the above content, we used a framework to review the mainstream deep learning optimization algorithms. It can be seen that generations of researchers have worked hard to make (xun) good (hao) gold (mo) pill (xing) for us. From a theoretical point of view, one generation is more perfect than one generation. Adam/Nadam has reached its peak. Why don't you forget your original intention of SGD?

Give a chestnut. Many years ago, photography was very far away from the general public. Ten years ago, point-and-shoot cameras became popular and tourists almost had one. After the advent of smart phones, photography has entered thousands of households, and you can shoot with the phone at will, before and after 20 million, illuminate your beauty (hey, what a mess is this). However, professional photographers still like to use SLRs, tirelessly adjusting the aperture, shutter, ISO, white balance... a bunch of terms that never care about selfies. Advances in technology have allowed foolish operations to achieve good results. However, in order to achieve the best results in a specific scene, you still need to have a deep understanding of light, structure, and equipment.

The same goes for optimization algorithms. In the previous article, we used the same framework to make all kinds of algorithms situate. It can be seen that everyone has the same goal by different paths, but it is equivalent to adding active control of various learning rates on the basis of SGD. If you don't want to do fine tuning, then Adam is obviously the easiest to use directly.

But such foolish operations may not be suitable for all occasions. If you can understand the data in depth, researchers can more freely control the various parameters of the optimization iteration, and it is not surprising that they can achieve better results. After all, the fine-tuned parameters are no better than foolish SGD, which is undoubtedly challenging the alchemy experience of top researchers!

Recently, many papers have opened Adam, let’s take a brief look at what they are talking about:

04 Adam's guilt 1: May not converge

This is a paper "On the Convergence of Adam and Beyond" in the anonymous review of ICLR 2018, one of the top conferences in the field of deep learning. It discusses the convergence of Adam's algorithm and proves that Adam is possible in some cases through counterexamples. Will not converge.

Recall the learning rate of the major optimization algorithms mentioned above:

Among them, SGD does not use second-order momentum, so the learning rate is constant (the learning rate decay strategy will be used in actual use, so the learning rate is decreasing). The second-order momentum of AdaGrad keeps accumulating and monotonically increasing, so the learning rate is monotonically decreasing. Therefore, these two types of algorithms will make the learning rate continue to decrease, eventually converging to 0, and the model can also converge.

But AdaDelta and Adam are not. The second-order momentum is the accumulation within a fixed time window. As the time window changes, the data encountered may change drastically, making it possible to change from time to time, rather than monotonously. This may cause the learning rate to oscillate in the later stage of training, causing the model to fail to converge.

This article also gives a correction method. Since the learning rate in Adam is mainly controlled by the second-order momentum, in order to ensure the convergence of the algorithm, the change of the second-order momentum can be controlled to avoid fluctuations.

Through this modification, it is guaranteed

As a result, the learning rate decreases monotonically.

05 Adam's crime two: may miss the global optimal solution

Deep neural networks often contain a large number of parameters. In such a very high-dimensional space, the non-convex objective function often undulates, with countless highlands and depressions. Some are peaks, which may be easily surpassed by introducing momentum; but some are plateaus, which may not be able to come out after many explorations, so the training is stopped.

Previously, two articles on Arxiv addressed this issue.

The first is the article "The Marginal Value of Adaptive Gradient Methods in Machine Learning" by UC Berkeley, the most ruthless UC Berkeley mentioned above. As mentioned in the article, for the same optimization problem, different optimization algorithms may find different answers, but algorithms with adaptive learning rate often find very poor solutions. They designed a specific data example. The adaptive learning rate algorithm may overfit the features that appeared in the early stage, and it is difficult for the features that appear in the later stage to correct the early fitting effect. However, the examples given in this article are extreme and may not appear in actual situations.

The other one is "Improving Generalization Performance by Switching from Adam to SGD", which was verified by experiments. They tested on the CIFAR-10 data set. Adam's convergence speed is faster than SGD, but the final convergence result is not as good as SGD. They further experiment and found that the main reason is that Adam's learning rate is too low in the later stage, which affects the effective convergence. They tried to control the lower bound of Adam's learning rate and found that the effect was much better.

So they proposed a method to improve Adam: use Adam in the early stage to enjoy the advantages of Adam's rapid convergence; later switch to SGD and slowly find the optimal solution. This method has also been used by researchers before, but it is mainly based on experience to choose the timing of switching and the learning rate after switching. This article makes the switching process foolish, and gives the timing selection method for switching SGD and the calculation method of the learning rate. The effect looks good.

This algorithm is very interesting, we can talk about it in the next article, here is the algorithm framework diagram:

06 Should I use Adam or SGD?

So, speaking of now, is Adam better or SGD better? This may be difficult to say clearly in one sentence. Go to see various papers in academic conferences, many of them use SGD, and many of Adam, and many prefer AdaGrad or AdaDelta. Perhaps the researcher has tried every algorithm, and will use whichever comes out. After all, the focus of the paper is to highlight one aspect of one's own contributions, and of course it does everything else, how can you lose in the details?

Judging from these papers that angered Adam, most of them have constructed some extreme examples to demonstrate the possibility of Adam failure. These examples are generally too extreme, and may not be the case in actual situations, but this reminds us that understanding the data is necessary for designing algorithms. The evolution history of optimization algorithms is based on certain assumptions about data. Then whether a certain algorithm is effective depends on whether your data meets the appetite of the algorithm.

Although the algorithm is beautiful, the data is the root.

On the other hand, although Adam and his like have simplified the parameter adjustment, they have not solved the problem once and for all. Although the default parameters are good, they are not universal. Therefore, on the basis of a full understanding of the data, it is still necessary to conduct sufficient tuning experiments based on the characteristics of the data and algorithms.

Young man, make alchemy well.

(3) Selection and use strategy of optimization algorithm

"In the previous two articles, we used a framework to sort out the major optimization algorithms, and pointed out the possible problems of the adaptive learning rate optimization algorithm represented by Adam. So, how should we choose in practice? Introduce the combination strategy of Adam + SGD and some useful tricks."

07 The core difference of different algorithms: the direction of decline

From the framework of the first article, we can see that the core difference between different optimization algorithms is the downward direction performed in the third step:

In this formula, the first half is the actual learning rate (that is, the descending step length), and the second half is the actual descending direction. The descending direction of the SGD algorithm is the opposite direction of the gradient direction of the position, and the descending direction of the SGD with first-order momentum is the first-order momentum direction of the position. The adaptive learning rate optimization algorithm sets a different learning rate for each parameter, and sets different synchronization lengths in different dimensions, so its descending direction is the scaled first-order momentum direction.

Due to the different descent directions, different algorithms may reach completely different local optimal points. "An empirical analysis of the optimization of deep network loss surfaces" This paper did an interesting experiment. They mapped the hyperplane formed by the objective function value and the corresponding parameters into a three-dimensional space, so that we can intuitively see How each algorithm finds the lowest point on the hyperplane.

The above figure is the experimental result of the paper. The abscissa and ordinate represent the feature space after dimensionality reduction, and the color of the area represents the change of the objective function value. Red is a plateau and blue is a depression. What they did is a paired experiment, let the two algorithms start from the same initial position, and then compare the optimized results. It can be seen that almost any two algorithms have reached different depressions, and they are often separated by a high plateau. This shows that different algorithms choose different descending directions when they are on the plateau.

08 Adam+SGD combination strategy

It is the choice at each crossroad that determines your destination. If God could give me another chance, I would say to that girl: SGD!

The pros and cons of different optimization algorithms are still a controversial topic. According to the feedback I have seen in the paper and various communities, the mainstream view is that: Adaptive learning rate algorithms such as Adam have advantages for sparse data and fast convergence; but SGD (+Momentum) with fine-tuned parameters can often be achieved Better end result.

Then we will think, can we combine the two, first use Adam to drop quickly, and then use SGD to tune it, which will kill two birds with one stone? The idea is simple, but there are two technical problems:

When to switch the optimization algorithm? ——If the switch is too late, Adam may have run into his basin, and SGD will not be able to get out no matter how good it is.
What learning rate is used after switching the algorithm? ——Adam uses an adaptive learning rate, which relies on the accumulation of second-order momentum. If SGD continues training, what learning rate is used?

The paper Improving Generalization Performance by Switching from Adam to SGD mentioned in the previous article puts forward ideas to solve these two problems.

Let's first look at the second question, the learning rate after switching.

Adam's descending direction is

And the downward direction of SGD is

The descent direction of SGD must be decomposed into the sum of the two directions in the descent direction of Adam and its orthogonal direction. Then its projection on the descent direction of Adam means the distance that SGD advances in the descent direction determined by the Adam algorithm. The projection in the orthogonal direction of Adam's descent direction is the distance that SGD advances in the correction direction selected by itself.

(The picture is from the original text, where p is the Adam descent direction, g is the gradient direction, and r is the learning rate of SGD.)

If SGD wants to complete Adam's unfinished road, it must first take Adam's banner-take one step in the direction, and then take the corresponding step in the orthogonal direction.

In this way, we know how to determine the step length (learning rate) of SGD-the orthogonal projection of SGD in the descending direction of Adam should be exactly equal to the descending direction of Adam (including step length). That is:

Solving this equation, we can get the learning rate for successive SGD:

In order to reduce the influence of noise, we can use a moving average to modify the estimation of the learning rate:

Adam's beta parameter is directly reused here.

Then look at the first question, when to switch the algorithm.

The method proposed by the author is very simple, that is, when the moving average of the corresponding learning rate of SGD is basically unchanged, namely:

After each iteration, calculate the corresponding learning rate of the SGD successor. If it is found to be basically stable, then SGD assumes that the learning rate will continue to advance. However, whether this timing is the optimal switching timing or not, the author did not give a mathematical proof, but verified the effect through experiments. The switching timing is still a topic worthy of in-depth study.

09 Common tricks for optimization algorithms

Finally, share some tricks in the selection and use of optimization algorithms.

1. First of all, there is no conclusive conclusion about which algorithm is better or worse.

If you are just getting started, give priority to SGD+Nesterov Momentum or Adam. (Standford 231n: The two recommended updates to use are either SGD+Nesterov Momentum or Adam)

2. Choose the algorithm you are familiar with-so that you can use your experience more skillfully to adjust parameters.

3. Fully understand your data-if the model is very sparse, then give priority to the algorithm of adaptive learning rate.

4. Choose according to your needs-in the process of model design experiment, to quickly verify the effect of the new model, you can first use Adam for rapid experimental optimization; before the model is online or the results are released, you can use the fine-tuned SGD to model The ultimate optimization.

5. Experiment with a small data set first. Some papers pointed out that the convergence speed of the stochastic gradient descent algorithm has little relationship with the size of the data set. (The mathematics of stochastic gradient descent are amazingly independent of the training set size. In particular, the asymptotic SGD convergence rates are independent from the sample size. [2]) So you can use a small representative data set for experiment first, Test the best optimization algorithm and find the best training parameters through parameter search.

6. Consider the combination of different algorithms. First use Adam for rapid decline, and then switch to SGD for full tuning. The switching strategy can refer to the method introduced in this article.

7. The data set must be fully shuffled. In this way, when using the adaptive learning rate algorithm, it can avoid the concentrated appearance of certain features, which leads to sometimes over-learning and sometimes under-learning, resulting in a deviation in the descending direction.

8. During the training process, continuously monitor the changes of the target function value and accuracy or AUC on the training data and verification data. The monitoring of the training data is to ensure that the model is fully trained-the descent direction is correct and the learning rate is high enough; the monitoring of the verification data is to avoid overfitting.

9. Develop an appropriate learning rate decay strategy. You can use a regular decay strategy, such as decay once every epoch, or use performance indicators such as accuracy or AUC to monitor. When the indicators on the test set remain unchanged or fall, the learning rate is reduced.

Here are only some tricks in the optimization algorithm. If there are any omissions, please add in the comments. Thanks in advance!

The article is transferred from: https://blog.csdn.net/jiachen0212/article/details/80086926 , intrusion and deletion!