Summary of commonly used optimizers for deep learning

1. Definition of optimizer

An optimizer is essentially an algorithm for optimizing the parameters of a deep learning model and minimizing model loss by continuously updating the parameters of the model. When choosing an optimizer, factors such as the structure of the model, the amount of data in the model, and the objective function of the model need to be considered. 

2. Commonly used optimizers

BGD(Batch Gradient Descent)

definition

BGD is the most primitive form of gradient descent method. Its basic idea is to use all samples to update when updating parameters.

official

The formula is as follows, assuming that the total number of samples is N,

features

BGD obtains a global optimal solution, but each iteration step requires all the data in the training set. If the number of samples is huge, the above formula will be very time-consuming to iterate, and the model training speed is very slow; the number of iterations is small

code example

# 数据
inputs = ...

# 标签
labels = ...

# 模型
model = ...

# 损失函数
criterion = ...

# 优化器
optimizer = torch.optim.SGD(model.parameters())

# 训练
for i in range(epochs):
    # 计算损失
    outputs = model(inputs)
    loss = criterion(outputs, labels)

    # 计算梯度
    optimizer.zero_grad()
    loss.backward()

    # 更新参数
    optimizer.step()

SGD(Stochastic Gradient Descent)

definition

The basic idea of ​​SGD is to use a randomly selected sample to update the parameters

official

The formula is as follows,

Among them, \theta ^{(t))}: the parameter value of the model at the t-th iteration, \alpha: the learning rate, : the gradient of the loss function J(\theta )with respect to the model parameters \theta.

features

The advantages of SGD are simple implementation and high efficiency, but the disadvantages are slow convergence speed and easy to fall into local minimum; many iterations

code example

# 数据
inputs = ...

# 标签
labels = ...

# 模型
model = ...

# 损失函数
criterion = ...

# 优化器
optimizer = torch.optim.SGD(model.parameters())

# 训练
for i in range(epochs):
    for input, label in zip(inputs, labels)
        # 计算损失
        output = model(input)
        loss = criterion(output, label)

        # 计算梯度
        optimizer.zero_grad()
        loss.backward()

        # 更新参数
        optimizer.step()

MBGD(Mini-batch Gradient Descent)

definition

It can be seen from BGD and SGD that they have their own advantages and disadvantages, so can a compromise be made between the performance of these two methods? That is, the training process of the algorithm is relatively fast, and the accuracy of the final parameter training is guaranteed, which is the original intention of MBGD. MBGD uses b samples each time the parameters are updated

official

features

The training process is relatively stable; BGD can find the local optimal solution, not necessarily the global optimal solution; if the loss function is a convex function, then the solution BGD must solve is the global optimal solution

code example

# 数据
inputs = ...

# 标签
labels = ...

# 模型
model = ...

# 损失函数
criterion = ...

# 优化器
optimizer = torch.optim.SGD(model.parameters())

# mini batch大小
b = ...

# 训练
optimizer.zero_grad()
for epoch in range(epochs):
    for i, (input, label) in enumerate(zip(inputs, labels))
        ni = i + len(inputs) * epoch        

        # 计算损失
        output = model(input)
        loss = criterion(output, label)

        # 计算梯度
        loss.backward()
        
        if ni % b == 0:
            # 更新参数
            optimizer.step()
            optimizer.zero_grad()

Adam(Adaptive Moment Estimation)

definition

Adam is an optimizer similar to stochastic gradient descent. Its basic idea is to adjust the parameters of the model by calculating the gradient of the model parameters and the weighted average of the gradient square (first-order momentum and second-order momentum).

official

​​​​​​​

​​​​​​​

​​​​​​​

​​​​​​​

​​​​​​​ 

in,

g^{(t)}: the gradient of the model parameters at the t-th iteration,

m^{(t)}and v^{(t)}: the first and second order momentums of the model parameters at iteration t,

\beta _{1}and \beta _{2}: hyperparameters (0.9 and 0.999 by default),

\beta _{1}^{t}and \beta _{2}^{t}: sum to the power \beta _{1}of \beta _{2}t,

And ​​​​​​​​: The first-order and second-order momentum after bias correction (because the initial values ​​of m and v are 0, the first-order momentum and the second-order momentum may be biased towards 0 in the early stage of training, so they need to be corrected Perform deviation correction; in the early stage of training, the corrected m and v will become larger, and in the late training period, the difference between before and after the correction is not large),​​​​​​,

"""
一阶动量及二阶动量初始化
"""

# Exponential moving average of gradient values
state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)

# Exponential moving average of squared gradient values
state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)

\alpha: learning rate,

\varepsilon: The value is a very small constant (the default is 1e-8, except 0)

features

Adam has high computational efficiency and fast convergence speed; it needs to adjust hyperparameters; it can adaptively adjust the learning rate of each parameter, thereby improving the convergence speed and generalization ability of the model;

code example

# 数据
inputs = ...

# 标签
labels = ...

# 模型
model = ...

# 定义优化器
optimizer = torch.optim.Adam(model.parameters(), lr=0.1, betas=(0.9, 0.999))

# 损失函数
criterion = ...

# 训练模型
for i in range(epochs):
    # 前向传播
    outputs = model(inputs)
    # 计算损失
    loss = criterion(outputs, labels)

    # 计算梯度
    optimizer.zero_grad()
    loss.backward()

    # 更新参数
    optimizer.step()

3. Summary

SGD

  • SGD only uses the current gradient direction to update, it is easily affected by data noise, and the training is unstable, so a smaller learning rate may be required to achieve good convergence performance
  • The biggest disadvantage of SGD is that the descending speed is slow, and it may continue to oscillate on both sides of the gully, staying at a local optimal point
  • SGD is relatively simple to implement, has low computational overhead, and may be more efficient than Adam in some datasets and models

Adam

  • Adam considers the average value of historical gradients and gradient squares in the gradient descent process, which can converge faster, but Adam is more prone to overfitting than SGD because it considers the average value of historical gradients, which may lead to overconfidence in updating parameters
  • Adam is suitable for dealing with large-scale data and parameters
  • Adam's learning rate does not change monotonically, which may cause the learning rate to fluctuate in the later stage of training, causing the model to fail to converge, but it can be constrained by the following formula to make the learning rate monotonically decreasing

【Reference article】

Adam optimizer (common understanding) - Longer2048's Blog - CSDN Blog

http://bbs.xfyun.cn/thread/46940&wd=&eqid=f85ffc5f000f860300000006642b26be

Guess you like

Origin blog.csdn.net/qq_38964360/article/details/131586625