Summary of pytorch commonly used optimizers

1. The principle of gradient descent

        The main purpose of gradient descent is to find the minimum value of the objective function through iteration, or to converge to the minimum value. The most commonly used example is going down a mountain. For example, every step is a step length, and every time you go down you find the steepest place (regardless of the situation of falling to death...). Assuming that the mountain is approximately smooth and has no valleys, then the shortest and fastest path down the mountain will eventually be found, and this path is also where the gradient is the steepest.

                                                                           Formula: x1 = x0 + λ▽J(x) (1)

    J is a function of x, our current position is x0, and we have to go from this point to the minimum point of J. First determine the direction of advance, that is, the reverse of the gradient, and then the step size of λ reaches the point x1. The above is the intuitive explanation of gradient descent.

2. Gradient descent in deep learning

2.1 sgd

        Deep learning generally uses mini-batch gradient descent (Mini-batch gradient descent). Let's talk about the sgd algorithm, which actually refers to the mini-batch gradient descent algorithm. No one will go to optimize the entire data set or one sample at a time. The core principle is Formula 1.

Pros and cons;

(1) The learning rate and strategy selection are difficult. When using mini-batch, it can converge very quickly.

(2) The learning rate is not smart enough to treat all dimensions of the parameters equally.

(3) Simultaneously facing the problems of local extreme value and saddle point. Can't solve the problem of local optimal solution

(4) The random selection of gradients will introduce noise at the same time, so that the direction of weight update may not be correct.

2.2 Momentum method

        Each step of the momentum method is a combination of an accumulation of the previous descent direction and the gradient direction of the current point , reducing the vibration caused by the 2.1 method and making the gradient smoother. The formula is as follows

                                                                                         vt = β * vt-1 + λ ▽ J (x) (2)

                                                                                         x1 = x0 + vt                                                              (3)

The momentum term shown in the figure is (1-λ)* vt-1 in Formula 2 The gradient of point B is the λ▽J(x) in Formula 2, and the gradient update direction is the entire Formula 2.

                                                                                                                              image

  The comparison chart shows that the jitter is obviously smaller

                                                                                               image

 

2.3 NAG algorithm   

 

The first three algorithm codes are as follows;

 

class torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)[source]
实现随机梯度下降算法(momentum可选)。
参数:

params (iterable) – 待优化参数的iterable或者是定义了参数组的dict
lr (float) – 学习率
momentum (float, 可选) – 动量因子(默认:0)
weight_decay (float, 可选) – 权重衰减(L2惩罚)(默认:0)
dampening (float, 可选) – 动量的抑制因子(默认:0)
nesterov (bool, 可选) – 使用Nesterov动量(默认:False)
例子:

>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()

To be continued........

references:

https://blog.csdn.net/qq_41800366/article/details/86583789

https://blog.csdn.net/weixin_36811328/article/details/83451096

https://mp.weixin.qq.com/s?__biz=MzA3NDIyMjM1NA==&mid=2649031658&idx=1&sn=fd1b54b24b607a9d28dc4e83ecc480fb&chksm=8712bd97b065348132d8261907c56ce14077646dfc9c7531a4c3f1ecf6da1a488450428e4580&scene=21#wechat_redirect

https://blog.csdn.net/kyle1314608/article/details/104401836

 

 

                              

 

Guess you like

Origin blog.csdn.net/gbz3300255/article/details/111212418