Quick start with pytorch (8) ----- Introduction to pytorch optimizer


There are five steps in deep learning: data ——> model ——> loss function —> optimizer —> iterative training , through forward propagation, the difference between the output of the model and the real label is obtained, that is, the loss function , there is After the loss function is obtained, the model is backpropagated to obtain the gradient of the parameter , and the next step is 优化器根据这个梯度去更新参数 .

1. Introduction

Optimizer for pytorch: 更新模型参数.

梯度下降It is generally used to update parameters when updating them . Common basic concepts of gradient descent

  • 导数: The rate of change of the function on the specified coordinate axis;
  • 方向导数: rate of change in the specified direction;
  • 梯度: A vector whose direction is the direction where the directional derivative obtains the maximum value.

So the gradient is a vector , the direction is 导数取得最大值的方向, that is, the fastest growing direction, and the gradient descent is to change along the negative direction of the gradient.

Two, optimizer

class Optimizer:
    defaults: dict
    state: dict
    param_groups: List[dict]
 
    def __init__(self, params: _params_t, default: dict) -> None: ...
    def __setstate__(self, state: dict) -> None: ...
    def state_dict(self) -> dict: ...
    def load_state_dict(self, state_dict: dict) -> None: ...
    def zero_grad(self, set_to_none: Optional[bool]=...) -> None: ...
    def step(self, closure: Optional[Callable[[], float]]=...) -> Optional[float]: ...
    def add_param_group(self, param_group: dict) -> None: ...

Attributes

insert image description here

  • defaults: the hyperparameters of the optimizer, mainly存储一些学习率、momentum的值等等
  • state: used for storage 参数的一些缓存. For example, when using momentum, you need to use the gradients of the previous few times, and this exists.
  • params_groups: Manage parameter groups. is one list. Each element of the list is one 字典. There is a ' ' key in the dictionary params, and its corresponding value is the real parameter.

method

1. zero_grad()

清空所管理参数的梯度

The parameter is a tensor, and the tensor has the gradient grad.

Pytorch has a feature: 张量梯度是不会清零的. When using autograd to calculate the gradient for each backpropagation, yes 累加的.

So it should be 在梯度求导之前(backward之前)把梯度清零.

2. step()

step() will 执行当前采用的优化器策略update parameters. There are many specific strategies, such as stochastic gradient descent method, method of adding momentum to momentum, method of adaptive learning rate, etc., which will be introduced in detail later.

3. add_param_group()

Adds a set of parameters to the optimizer.

The optimizer can manage many parameters, which can be grouped. us 对不同组的参数可以有不同的超参数的设置. For example, in the fintune of the model, it is hoped that the learning rate of the feature extraction part in the front of the model will be smaller and the update will be slower; while the fully connected layer defined by itself later, it is hoped that the learning rate will be larger. In this way, the entire model can be divided into two groups, one group is the parameters of the previous feature extraction, and the other group is the parameters of the subsequent fully connected layer.
insert image description here

4. state_dict()

Get the optimizer current state information dictionary.

optimizer = optim.SGD([weight], lr=0.1, momentum=0.9)
opt_state_dict = optimizer.state_dict()
 
print("state_dict before step:\n", opt_state_dict)
 
for i in range(10):
    optimizer.step()
 
print("state_dict after step:\n", optimizer.state_dict())
# 训练10次之后将模型的参数保存下来
torch.save(optimizer.state_dict(), os.path.join(BASE_DIR, "optimizer_state_dict.pkl"))

5. load_state_dict()

Load state information dictionary

optimizer = optim.SGD([weight], lr=0.1, momentum=0.9)
state_dict = torch.load(os.path.join(BASE_DIR, "optimizer_state_dict.pkl"))
 
print("state_dict before load state:\n", optimizer.state_dict())
optimizer.load_state_dict(state_dict)
print("state_dict after load state:\n", optimizer.state_dict())

learning rate

In the process of gradient descent, the learning rate plays a role in controlling the pace of parameter update.
insert image description here
If there is no learning rate, as the number of iterations increases, the loss value will become larger and larger, indicating that the step is too large during the parameter update process. On the contrary, at 跳过了最优值this time 需要一个参数来控制这个跨度, this is the learning rate.

momentum

Momentum (momentum, impulse): 结合当前的梯度与上一次更新的信息,用于当前更新.
insert image description here
So in the case of considering the momentum, you can go to the foot of the mountain faster, that is to say 参数更新的更快. How is momentum used for parameter update?

Let's take a look at 指数加权平均the concept first. Exponential weighted average is often used in time series as a method of calculating the average value. Its idea is to calculate the average value of the current moment, and those parameter values ​​​​that are closer to the current moment. Its The greater the reference, the greater the weight it occupies. This weight decreases exponentially with the increase of the time interval , so it is called 指数滑动平均. The formula is as follows:
insert image description here
v_t is an average value at the current moment, and this average value consists of two items,

  • One item is the parameter value θ_t at the current moment, and its weight is 1-β, and this β is a parameter.
  • The other term is an average value at the previous moment, with a weight of β.

Assuming that a series of day-temperature data is given, and the average temperature of the 100th day is calculated, it
insert image description here
insert image description here
can be found that the smaller the beta, the shorter the distance it pays attention to the previous period of time, such as 0.8, and it will be found that the focus is 20 days ahead Basically, the following weights are all 0, which means that this time is the average temperature of the past 20 days, and 0.98 focuses on the number of past days will be very long, which means that this time is the average temperature of the past 50 days. So beta is here 控制着记忆周期的长短, or how many days of past data are averaged over the present. The parameter β is usually set to 0.9, that is, 1/(1-β) is equal to 10. Pay attention to the temperature in the past 10 days or so. The following figure shows a change curve of temperature under different β:
insert image description here

  • The red one is beta=0.9, which is the average temperature of the past 10 days;
  • The green one is beta=0.98, which is the average temperature of the past 50 days;
  • The yellow one, beta=0.5, is the average temperature of the past 2 days.

After understanding the exponential weighted average, let's take a look at the gradient descent with Momentum added. The basic idea is to calculate the exponential weighted average of the gradient, and use the gradient to update the weight. It is implemented in pytorch:

  • Ordinary gradient descent:
    insert image description here
  • Momentum gradient descent:
    insert image description here
    So the update amount of the current gradient will take into account the current gradient, the gradient of the previous moment, and the gradient of the previous moment, so that the weight goes all the way forward, and the weight is smaller the further back. Let's take a look at the role of momentum through the code
def func(x):
    return torch.pow(2*x, 2)    # y = (2x)^2 = 4*x^2        dy/dx = 8x

iteration = 100
m = 0.0     # .9 .63

lr_list = [0.01, 0.03]

momentum_list = list()
loss_rec = [[] for l in range(len(lr_list))]
iter_rec = list()

for i, lr in enumerate(lr_list):
    x = torch.tensor([2.], requires_grad=True)

    momentum = 0. if lr == 0.03 else m
    momentum_list.append(momentum)

    optimizer = optim.SGD([x], lr=lr, momentum=momentum)

    for iter in range(iteration):

        y = func(x)
        y.backward()

        optimizer.step()
        optimizer.zero_grad()

        loss_rec[i].append(y.item())

for i, loss_r in enumerate(loss_rec):
    plt.plot(range(len(loss_r)), loss_r, label="LR: {} M:{}".format(lr_list[i], momentum_list[i]))
plt.legend()
plt.xlabel('Iterations')
plt.ylabel('Loss value')
plt.show()

insert image description here
insert image description here

3. Introduction to common optimizers

The optimizers in pytorch can be roughly divided into two categories:

  • A class is based on SGD及其优化,
  • The other type is Per-parameter adaptive learning rate methods (parameter-by-parameter adaptive learning rate method), such as AdaGrad, RMSProp, Adamand so on.

1. BGD(Batch Gradient Descent)

Gradient update rule:
BGD adopts 整个训练集的数据to calculate the gradient of cost function to parameters

Disadvantages:
Since the gradient is calculated for the entire data set in one update, the training speed is slow. If the training set is large, it needs to consume a lot of memory, and the full gradient descent cannot update the online model parameters.

2. Stochastic Gradient Descent(SGD)

SGD is passed 每个样本迭代更新一次. If the sample size is large, then only some of the sample data parameters may be used to update to the optimum. Compared with BGD, one iteration requires all the data, and one iteration cannot reach the optimum. Iteration 10 The training set needs to be trained 10 times.

Disadvantages:
1. If there is a lot of noise in the sample, SGD is not carried out in the direction of overall optimization in each iteration;
2. Because SGD is updated frequently, it will cause serious oscillations in the cost function;
3. It may converge to the local Optimum, but skips the optimum due to oscillations.

3. Mini-Batch Gradient Descent(MBGD)

Gradient update rule:
MBGD 利用一小批样本is calculated every time, that is, n samples, which can reduce the variance of parameter update and make the convergence more stable. On the other hand, matrix operations can be used to perform more effective gradient calculations.

Disadvantages:
1. MBGD cannot guarantee good convergence. If the learning rate is selected too small, the convergence speed will be slow. If the selection is too large, the cost function will oscillate near the minimum value (a solution is to set a larger learning rate first. rate, when a certain threshold is reached, the learning rate will be reduced, but this threshold should be set in advance);
2. Apply the same learning rate to all parameters when updating. If the data is sparse, it is more desirable to have a low frequency Features undergo a major update.

Note:深度学习中的SGD优化算法是指mini-batch SGD(MBGD)

torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)
  • param: group of parameters to manage
  • lr: initial learning rate
  • momentum: momentum coefficient,
  • beta weight_decay: L2 regularization coefficient
  • nesterov: Whether to use NAG

4. SGD + Momentum (momentum gradient descent)

Connecting the previous gradients is no longer the case that each gradient is independent. Let the update direction of each parameter not only depend on the gradient of the current position, but also be affected by the direction of the last parameter update.
insert image description here
Advantages:
The descent speed is optimized through the past gradient information. If the current gradient is in the same direction as the previous gradient, the convergence speed will be strengthened, otherwise it will be weakened. In other words, speed up convergence while reducing oscillations.

Disadvantages:
The cumulative momentum may be too large during the downhill process, and the minimum point may be rushed.

In addition, pytorch中的 SGD with momentum 已经在optim.SGD中的参数momentum中实现.

5. Nesterov accelerated gradient(NAG)

The difference between NAG (accelerated gradient descent) and momentum gradient descent is that the momentum is updated by using future gradients. The next predicted gradient ∇θJ(θ−η⋅m) will be taken into account.
The parameter update formula is:
insert image description here
The difference from ordinary momentum is as shown in the figure below
insert image description here
. In pytorch, nesterov=TrueNesterov Momentum is realized through the parameter .
Advantages:
1. Compared with the momentum gradient descent method, because NAG takes into account the future prediction gradient, the convergence speed is faster (as shown in the figure above).
2. When the update range is large, NAG can suppress the shock. For example, the starting point is on the left side of the optimal point ←, and the value corresponding to γm is on the right side of the optimal point →. For the momentum gradient, superimposing η∇1 makes the point after iteration farther away from the optimal point →→. And NAG first jumps to the value → corresponding to γm, calculates the gradient as positive, and then superimposes η∇2 ← in the opposite direction, so as to achieve the purpose of suppressing the oscillation.

6. Adagrad (Adaptive Gradient/Adaptive Gradient)

AdaGrad dynamically adjusts the learning rate during the training process, and updates different learning rates for different parameters according to the cumulative sum of squared gradients.
Parameter update formula:
insert image description here
where ⊙ is the dot product, which is equivalent to finding the square of the gradient. ϵ is a very small term to prevent division by 0 and maintain data stability. Generally, 10^(-6) is used
because s is an accumulative term of the sum of squared gradients, so:

1. For parameters whose gradients have been changing greatly, the learning rate drops faster, that is, the high-frequency features use a smaller learning rate.
2. For parameters with small gradient changes, the learning rate drops slowly, that is, low-frequency features use a larger learning rate.
3. Because of the additive nature, the trend of the learning rate is constantly decaying, which is also in line with the intuitive idea that a smaller learning rate needs to be set when the iteration is close to the extreme point in the later stage.

Advantages: each variable has its own learning rate

Disadvantages: Due to the continuous decay of the learning rate, too fast decay in the early stage of the iterative process may directly lead to insufficient convergence power in the later stage, making AdaGrad unable to obtain satisfactory results.

pytroch implementation:

torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0)

7. RMSProp(Root Mean Square Propagation)

Aiming at the disadvantage of AdaGrad's learning rate decay too fast, RMSProp optimizes AdaGrad by replacing the accumulated square gradient sum with exponentially weighted moving average (accumulated local gradient information), so that the contribution of the gradient away from the current point is small.
Iterative update formula:
insert image description here
where β is the decay factor of RMSProp. s is the exponentially weighted moving sum of squares with respect to the gradient, with an initial value of 0. ⊙ is the dot product, that is, the product of corresponding items.

Advantages: Adding an attenuation factor on the basis of Adagrad, weighing the past and current gradient information during the learning rate update process, reducing the impact of a significant decrease in the learning rate due to the continuous accumulation of gradients, and preventing premature end of learning.

Disadvantages: The hyperparameter β is introduced, which increases the complexity of the model. It also depends on the global learning rate η.
Implementation in pytorch:

torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)

8. AdaDelta (Adaptive Delta)

AdaDelta is another optimization for Adagrad, which replaces the global learning rate η with an exponentially weighted moving sum of squares of the parameter θ variation relative to RMSProp. The idea is to approximate the second-order Newton's method with a first-order method.
insert image description here
sg is the exponentially weighted moving sum of squares about the gradient, and sΔθ is the exponentially weighted moving sum of squares about the variation of the parameter θ. Both initial values ​​are set to 0. ϵ is a constant to maintain data stability, generally set to 10^{-6}.
In AdaDelta optimization, the molecule can be regarded as a momentum acceleration item, and the previous gradient change is accumulated by exponential weighting. The denominator term is the same as RMSProp, so RMSProp can also be regarded as a special case of AdaDelta.
Advantages:
No need to manually set the learning rate.

9. Adam (Adaptive Matrix/Adaptive Momentum Estimation)

Adam combines the ideas of RMSProp and Momentum to achieve the effect of learning rate adaptation and momentum accelerated convergence.
The parameter update formula is:
insert image description here
The third and fourth items are the deviation correction values ​​of s and m, so that the sum of the past gradient weights is 1, preventing the value from being too small. Hyperparameters are generally set to β=0.999, γ=0.9, ε=10^-8​.

torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)

Reference link: https://blog.csdn.net/Dear_learner/article/details/123219459

Guess you like

Origin blog.csdn.net/All_In_gzx_cc/article/details/128001991