Learning summary, getting started with optimizers in neural network models, and understanding how to choose a more suitable optimizer

I started to learn how to adjust the parameters of the neural network model. I really feel that the training is time-consuming, laborious and brain-intensive. There is a sense of "metaphysics" in it, but I still can't figure out the way inside. It is difficult to grasp the entry threshold of these AI models. But the interest is inexplicably great! ! ! I feel like I'm above my ideal, but it's okay, I can get simple satisfaction by doing related things. Hahaha

Summary of optimizer-related learning records.

First, the reason for choosing the optimizer

As a "device" for model optimization, the optimizer is one of the important components in the field of deep learning. Different optimizers will produce completely different results when performing deep learning tasks.

Choosing the right optimizer for a machine learning project is both important and difficult. Each optimizer has its own advantages and disadvantages, and has its own application scenarios. Choosing a different optimizer has a great impact on the machine learning model. A suitable optimizer can have great positive effects, while an inappropriate one can have great negative effects.

So, this article records, how to choose an optimizer that is more suitable for your project model?

Second, various optimizers

Rationale : Most popular optimizers in deep learning are based on gradient descent, i.e. iteratively estimating the slope of a given loss function and shifting the parameters in the opposite direction (thus moving down to an assumed global minimum) .

Common optimization algorithms include gradient descent (stochastic gradient descent SGD, variant BGD, MBGD), Adagrad, Adam, Momentum and other optimizers.

Stochastic gradient descent (SGD) has been used since the 1950s; adaptive gradient methods such as AdaGrad and Adam have grown in popularity in the 2000s.
But the recent trend shows that some studies switch to the previous SGD instead of the adaptive gradient method. Furthermore, current challenges in deep learning lead to new SGD variants such as LATS, LAMB.
From this it can be seen thatIt is not that the old optimization algorithm must have poor effect, and the new optimization algorithm must have good effect; as far as the current situation is concerned, which optimization algorithm is more suitable.

1. Stochastic gradient descent (SGD) algorithm

In Stochastic Gradient Descent (SGD), the optimizer estimates the direction of fastest gradient descent based on the mini-batch and takes a step in that direction. Since the step size is fixed, SGD may quickly stagnate in the plateau or local minimum. (It is conceivable that when there is a 'trough' or a long straight line, it is difficult to jump out and fall into a local optimum)

2. SGD with momentum

Constant β < 1. When with momentum, SGD will accelerate in the direction of continuous decline (this is why the method is called [heavy ball method]). This acceleration helps the model escape the plateau, making it less prone to getting stuck in local minima. (It is conceivable that as the gradient descends, the pace of 'crossing' increases, and the valley bottom or smooth straight line can be skipped to a certain extent with a certain probability)

3、AdaGrad

AdaGrad was one of the first methods to successfully exploit adaptive learning rates. AdaGrad scales the learning rate for each parameter based on the square root of the inverse of the sum of squared gradients. This process scales up the sparse gradient directions to allow larger adjustments in those directions. The results show that AdaGrad is able to converge faster in scenes with sparse features. ( Conjecture : the randomness is too large, so the effect is unstable and limited)

4,RMS plug

RMSprop is an unreleased optimizer that has been used a lot in recent years. The idea is similar to AdaGrad, but the rescaling of gradients is less aggressive: it replaces the sum of squared gradients with a moving average of squared gradients. RMSprop is often used with momentum and can be understood as the adaptation of RMSprop to the mini-batch setting.

5,Adam

Adam combines AdaGrad, RMSprop and momentum methods. The direction of the next step is determined by the moving average of the gradient, and the step size is capped by the global step size. In addition, similar to RMSprop, Adam rescales each dimension of the gradient. A major difference between Adam and RMSprop (AdaGrad) is the correction for zero bias in the instantaneous estimates of m and v. Adam is known for achieving good performance with a small amount of hyperparameter fine-tuning.

insert image description here

6, AdamW

Loshchilov and Hutter identified the inequalities of L2 regularization and weight descent in adaptive gradient methods and hypothesized that this inequality limits the performance of Adam. From this, they propose to decouple weight decay from learning rate. Experimental results show that AdamW has better generalization performance than Adam (using momentum to narrow the gap of SGD), and for AdamW, the range of optimal hyperparameters is wider.

7,LARS

LARS is a momentum-based extension of SGD that adapts to the learning rate of each layer. LARS has recently attracted attention in the research community. This is due to the steady growth of available data and the popularity of distributed training for machine learning. This made the batch size start to grow, causing the training to become unstable. Some researchers (Yang et al) believe that these instabilities stem from the imbalance between the gradient criterion and the weight criterion of some layers. From this, they propose an optimizer that retunes the learning rate for each layer based on the 'trust' parameter η < 1 and the inverse norm of the layer's gradient.

Third, the method of choosing a suitable optimizer

The choice of a suitable optimizer is very difficult and there is no one-size-fits-all solution. We can only choose an optimizer that is more suitable for this task and problem according to our own specific problems .

  1. Find and read relevant research papers to see what the SOTA results are for similar datasets and tasks; why you use these optimizers; use the same optimizer as it at first, and observe the application effect on your own tasks
    .
  2. Summarize the characteristics of your own data set and see if there is an optimizer that matches it, that is, whether there are characteristics that can take advantage of some optimizers.
  3. Consider what resources are available for the project, compute constraints or memory constraints, and project timeframes, etc. will also affect the scope of the optimizer's choices.

Example: When we are working on a project, we want to train a self-supervised model (such as SimCLR) on the image dataset of the home computer. For models such as SimCLR, performance increases as the batch size increases. Therefore, we want to save as much memory as possible for large batch training. Simple stochastic gradient descent without momentum was chosen as the optimizer because it requires the least amount of additional memory to store state compared to other optimizers.

It is only used for learning records and academic analysis, please contact to delete if there is any infringement.

Guess you like

Origin blog.csdn.net/qq_53250079/article/details/128981653