Optimizer
optimizer.param_groups
Example analysis of usage
Date: July 25, 2022
pytorch version: 1.11.0
For param_groups
the exploration
optimizer.param_groups
: is a list whose elements are dictionaries;
optimizer.param_groups[0]
: A dictionary with a length of 7, including [' params ', ' lr ', ' betas ', ' eps ', ' weight_decay ', ' amsgrad ', ' maximize '] these 7 parameters;
The Adam optimizer used below creates a optimizer
variable:
>>> optimizer.param_groups[0].keys()
>>> dict_keys(['params', 'lr', 'betas', 'eps', 'weight_decay', 'amsgrad', 'maximize'])
You can assign different learning rates to the training parameters, so that there will be more than one element in the list, but multiple dictionaries.
-
params
Is a list[…], which stores parameters>>> len(optimizer.param_groups[0]['params']) >>> 48 >>> optimizer.param_groups[0]['params'][0] >>> Parameter containing: tensor([[ 0.0212, -0.1151, 0.0499, ..., -0.0807, -0.0572, 0.1166], [-0.0356, -0.0397, -0.0980, ..., 0.0690, -0.1066, -0.0583], [ 0.0238, 0.0316, -0.0636, ..., 0.0754, -0.0891, 0.0258], ..., [ 0.0603, -0.0173, 0.0627, ..., 0.0152, -0.0215, -0.0730], [-0.1183, -0.0636, 0.0381, ..., 0.0745, -0.0427, -0.0713],
-
lr
is the learning rate>>> optimizer.param_groups[0]['lr'] >>> 0.0005
-
betas
is a tuple (...), associated with the momentum>>> optimizer.param_groups[0]['betas'] >>> (0.9, 0.999)
-
eps
>>> optimizer.param_groups[0]['eps'] >>> 1e-08
-
weight_decay
is an int variable>>> optimizer.param_groups[0]['weight_decay'] >>> 0
-
amsgrad
is a bool variable>>> optimizer.param_groups[0]['amsgrad'] >>> False
-
maximize
is a bool variable>>> optimizer.param_groups[0]['maximize'] >>> False
Continue experimenting with examples from the Internet:
import torch
import torch.optim as optim
w1 = torch.randn(3, 3)
w1.requires_grad = True
w2 = torch.randn(3, 3)
w2.requires_grad = True
o = optim.Adam([w1])
print(o.param_groups)
# 输出
>>>
[{
'params': [tensor([[-0.1002, 0.3526, -1.2212],
[-0.4659, 0.0498, -0.2905],
[ 1.1862, -0.6085, 0.4965]], requires_grad=True)],
'lr': 0.001,
'betas': (0.9, 0.999),
'eps': 1e-08,
'weight_decay': 0,
'amsgrad': False,
'maximize': False}]
The following are the main methods Optimizer
of this classadd_param_group
# Per the docs, the add_param_group method accepts a param_group parameter that is a dict. Example of use:
import torch
import torch.optim as optim
w1 = torch.randn(3, 3)
w1.requires_grad = True
w2 = torch.randn(3, 3)
w2.requires_grad = True
o = optim.Adam([w1])
print(o.param_groups)
# 输出
>>> [{
'params': [tensor([[-1.5916, -1.6110, -0.5739],
[ 0.0589, -0.5848, -0.9199],
[-0.4206, -2.3198, -0.2062]], requires_grad=True)], 'lr': 0.001, 'betas': (0.9, 0.999), 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False, 'maximize': False}]
o.add_param_group({
'params': w2})
print(o.param_groups)
# 输出
>>> [{
'params': [tensor([[-1.5916, -1.6110, -0.5739],
[ 0.0589, -0.5848, -0.9199],
[-0.4206, -2.3198, -0.2062]], requires_grad=True)], 'lr': 0.001, 'betas': (0.9, 0.999), 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False, 'maximize': False},
{
'params': [tensor([[-0.5546, -1.2646, 1.6420],
[ 0.0730, -0.0460, -0.0865],
[ 0.3043, 0.4203, -0.3607]], requires_grad=True)], 'lr': 0.001, 'betas': (0.9, 0.999), 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False, 'maximize': False}]
How to dynamically modify the learning rate when writing code (routine operation)
for param_group in optimizer.param_groups:
param_group["lr"] = lr
Supplement: Summary of optimizers in pytorch
Take the SGD optimizer as an example:
from torch import nn as nn
import torch as t
from torch.autograd import Variable as V
from torch import optim # 优化器
# 定义一个LeNet网络
class LeNet(t.nn.Module):
def __init__(self):
super(LeNet, self).__init__()
self.features = t.nn.Sequential(
t.nn.Conv2d(3, 6, 5),
t.nn.ReLU(),
t.nn.MaxPool2d(2, 2),
t.nn.Conv2d(6, 16, 5),
t.nn.ReLU(),
t.nn.MaxPool2d(2, 2)
)
# 由于调整shape并不是一个class层,
# 所以在涉及这种操作(非nn.Module操作)需要拆分为多个模型
self.classifiter = t.nn.Sequential(
t.nn.Linear(16*5*5, 120),
t.nn.ReLU(),
t.nn.Linear(120, 84),
t.nn.ReLU(),
t.nn.Linear(84, 10)
)
def forward(self, x):
x = self.features(x)
x = x.view(-1, 16*5*5)
x = self.classifiter(x)
return x
net = LeNet()
# 通常的step优化过程
optimizer = optim.SGD(params=net.parameters(), lr=1)
optimizer.zero_grad() # 梯度清零,相当于net.zero_grad()
input = V(t.randn(1, 3, 32, 32))
output = net(input)
output.backward(output)
optimizer.step() # 执行优化
For different sub-network parameters with different learning rates, finetune is commonly used, so that the classifier learning rate parameters are higher and the learning speed is faster (in theory).
1. Set the learning rate through the divided modules when building the network,
# 为不同子网络设置不同的学习率,在finetune中经常用到
# 如果对某个参数不指定学习率,就使用默认学习率
optimizer = optim.SGD(
[{
'params': net.features.parameters()}, # 学习率为1e-5
{
'params': net.classifiter.parameters(), 'lr': 1e-2}], lr=1e-5
)
2. Group by network layer object and set the learning rate
# 只为两个全连接层设置较大的学习率,其余层的学习率较小
# 以层为单位,为不同层指定不同的学习率
# 提取指定层对象
special_layers = nn.ModuleList([net.classifiter[0], net.classifiter[3]])
# 获取指定层参数id
special_layers_params = list(map(id, special_layers.parameters()))
# 获取非指定层的参数id
base_params = filter(lambda p: id(p) not in special_layers_params, net.parameters())
optimizer = t.optim.SGD([
{
'params': base_params},
{
'params': special_layers.parameters(), 'lr': 0.01}], lr=0.001)
Reference:
https://blog.csdn.net/weixin_43593330/article/details/108490956
https://www.cnblogs.com/hellcat/p/8496727.html
https://www.yisu.com/zixun/456082. html