Pytorch Tutorial Introduction Series 10----Introduction to Optimizer


foreword

1. What is an optimizer?

Parameters used to optimize the model. When choosing an optimizer, factors such as the structure of the model, the amount of data in the model, and the objective function of the model need to be considered.
An optimizer is an algorithm used to train a model and minimize the loss of the model. It does this by continuously updating the parameters of the model.
Optimizers are often used for deep learning models because these models usually have a large number of trainable parameters and require a lot of data and computation to optimize. The optimizer fits the training data by continuously updating the parameters of the model so that the model performs well on new data.

2. Introduction to the types of optimizers

1、SGD(Stochastic Gradient Descent)

  • Thought

SGD is a classic optimizer used to optimize the parameters of a model. The basic idea of ​​SGD is to continuously adjust the parameters of the model through the method of gradient descent to minimize the loss function of the model. The advantages of SGD are simple implementation and high efficiency, but the disadvantages are slow convergence and easy to fall into local minimum.

  • mathematical expression

The parameters of the model are updated in the following ways:

θ ( t + 1 ) = θ ( t ) − α ⋅ ∇ θ J ( θ ( t ) ) \theta^{(t+1)} = \theta^{(t)} - ​​\alpha \cdot \nabla_{ \theta}J(\theta^{(t)})i(t+1)=i(t)aiJ(θ( t ) )
Among them,θ ( t ) means that the model is at the \theta^{(t)} means that the model is at thei( t ) representsthe modelatthet$th iteration,α \alphaDetermine the function,∇ θ J ( θ ( t ) ) \nabla_{\theta} J(\theta^{(t)})iJ(θ( t ) )represents the loss functionJ ( θ ) J(\theta)J ( θ ) with respect to model parametersθ \thetaGradient of θ .

  • actual use

In PyTorch, torch.optim.SGD类SGD can be implemented using

# 定义模型
model = ...

# 定义优化器
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# 训练模型
for inputs, labels in dataset:
    # 计算损失函数
    outputs = model(inputs)
    loss = ...

    # 计算梯度
    optimizer.zero_grad()
    loss.backward()

    # 更新参数
    optimizer.step()

First define the model, then define the SGD optimizer, and specify a learning rate of 0.1. Next, the data set is iterated through a loop, the loss function and gradient are calculated, and the parameters of the model are updated. In this way, you can use SGD in PyTorch to train the model.

2、Adam

  • Thought

Adam is an optimizer that approximates stochastic gradient descent and is used to optimize the parameters of the model. The basic idea of ​​Adam is to adjust the parameters of the model by maintaining the gradient of the model and the first-order momentum and second-order momentum of the square of the gradient. The advantages of Adam are high computational efficiency and fast convergence speed, but the disadvantage is that hyperparameters need to be adjusted.

  • mathematical expression

  • The parameters of the model are updated in the following ways:

m t = β 1 m t − 1 + ( 1 − β 1 ) g t m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t mt=b1mt1+(1b1)gt

v t = β 2 v t − 1 + ( 1 − β 2 ) g t 2 v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 vt=b2vt1+(1b2)gt2

Among them, mt m_tmtand vt v_tvtrespectively represent the first-order momentum and second-order momentum of the gradient, gt g_tgtrepresents the model at ttGradient at t iterations, β 1 \beta_1b1and β 2 \beta_2b2is a hyperparameter.

θ ( t + 1 ) = θ ( t ) − α v t + ϵ m t \theta^{(t+1)} = \theta^{(t)} - \frac{\alpha}{\sqrt{v_t} + \epsilon} m_t i(t+1)=i(t)vt +ϵamt
where θ ( t ) \theta^{(t)}i( t ) indicates that the model atttThe parameter value at t iterations,α \alphaα represents the learning rate,mt m_tmtand vt v_tvtRepresent the first-order momentum and second-order momentum of the gradient, ϵ \epsilonϵ is a small constant to prevent the denominator from being 0.

  • actual use

In PyTorch, torch.optim.Adam类Adam can be implemented using

# 定义模型
model = ...

# 定义优化器
optimizer = torch.optim.Adam(model.parameters(), lr=0.1, betas=(0.9, 0.999))

# 训练模型
for inputs, labels in dataset:
    # 计算损失函数
    outputs = model(inputs)
    loss = ...

    # 计算梯度
    optimizer.zero_grad()
    loss.backward()

    # 更新参数
    optimizer.step()

In the above code, the model is first defined, then the Adam optimizer is defined, and the learning rate is specified as 0.1, β 1 \beta_1b1and β 2 \beta_2b2The values ​​of are 0.9 and 0.999, respectively. Next, the data set is iterated through a loop, the loss function and gradient are calculated, and the parameters of the model are updated. In this way, Adam can be used in PyTorch to train the model.

3、RMSprop(Root Mean Square Propagation)

  • Thought

RMSprop is a modified stochastic gradient descent optimizer for optimizing the parameters of a model. The basic idea of ​​RMSprop is to adjust the parameters of the model by maintaining the exponentially weighted average of the squared gradients of the model. The advantage of RMSprop is the fast convergence speed, but the disadvantage is the high computational complexity and the need to adjust hyperparameters.

  • mathematical expression

Specifically, the formula of the RMSprop optimization algorithm is as follows:

g t + 1 = α g t + ( 1 − α ) g t 2 g_{t+1} = \alpha g_t + (1 - \alpha) g_t^2 gt+1=a gt+(1a ) gt2

θ t + 1 = θ t − η gt + 1 + ϵ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{g_{t+1} + \epsilon}}it+1=itgt+1+ϵ h

Among them, gt g_tgtrepresents the model at ttThe sum of squares of gradients in t iterations, θ t \theta_titrepresents the model at ttThe parameter value in t iterations, α \alphaα represents the exponential decay rate of the gradient,η \etaη represents the learning rate,ϵ \epsilonϵ represents a small constant used to prevent division by 0.

  • actual use

In PyTorch, torch.optim.Adam类Adam can be implemented using

import torch

# 定义模型
model = MyModel()

# 如果可用则model移至GPU
if torch.cuda.is_available():
    model = model.cuda()

# 设定训练模式
model.train()
# 定义 RMSprop 优化器
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)

# 循环训练
for input, target in dataset:
    # 如果可用则将input、target移至GPU
    if torch.cuda.is_available():
        input = input.cuda()
        target = target.cuda()

    # 前向传递:通过将输入传递给模型来计算预测输出
    output = model(input)

    # 计算损失
    loss = loss_fn(output, target)

    # 清除所有优化变量的梯度
    optimizer.zero_grad()

    # 反向传递:计算损失相对于模型参数的梯度
    loss.backward()

    # 执行单个优化步骤(参数更新)
    optimizer.step()

In the above code, the model is first defined and converted to training mode. Then the RMSprop optimizer is defined, and the model parameters to be optimized are specified, the learning rate is 0.1, α \alphaThe value of α is 0.9. Next, the data set is iterated through a loop, the loss function and gradient are calculated, and the parameters of the model are updated. In this way, you can use RMSprop in PyTorch to train the model.


Summarize

In addition to the three optimizers mentioned above, PyTorch also provides a variety of optimizers, such as Adadelta, Adagrad, AdamW, SparseAdam, etc. To use the optimizer, you need to define the model and switch to training mode, then define the optimizer and specify the model parameters and learning rate to optimize. In the training loop, the loss of the model is calculated at each iteration, and the optimizer is used to update the model parameters. When choosing an optimizer, you need to choose an appropriate optimizer according to the actual situation. In addition, the hyperparameters of the optimizer also need to be adjusted appropriately to obtain better optimization results.

Guess you like

Origin blog.csdn.net/weixin_46417939/article/details/128274510