I recently read the paper and saw that some programs are to train a model first, then take part of the structure of the trained model into the new model, and then use the new data to train the new model, but the parameters of that part are To keep it, at first I thought it was very similar to transfer learning, because I hadn’t studied it in detail, so I was not sure, so I first learned how to realize the plan of the thesis, and recorded it here for future reference. As for transfer learning, let’s study in the future, it should be used!
PyTorch saves some model parameters and loads them in a new model
- state_dict
- Save the model to the current path with the name test_state_dict.pth
- OrderedDict
- Save some model parameters and load them in a new model
- The difference between state_dict(), named_parameters(), model.parameter(), named_modules()
- Freeze certain layers / let only certain layers learn
- References
state_dict
Introduction to state_dict
state_dict
is a Python dictionary object that can be used to save model parameters, hyperparameters, and state information of the optimizer (torch.optim). It should be noted that only layers with learnable parameters (such as convolutional layers, linear layers, etc.) have a state_dict.
Give a chestnut to illustrate the use of state_dict:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
# 定义模型
class TheModelClass(nn.Module):
def __init__(self):
super(TheModelClass, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
# 初始化模型
model = TheModelClass()
# 初始化优化器
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
# 打印模型的状态字典
print("Model's state_dict:")
for param_tensor in model.state_dict():
print(param_tensor, "\t", model.state_dict()[param_tensor].size())
output:
Model's state_dict:
conv1.weight torch.Size([6, 3, 5, 5])
conv1.bias torch.Size([6])
conv2.weight torch.Size([16, 6, 5, 5])
conv2.bias torch.Size([16])
fc1.weight torch.Size([120, 400])
fc1.bias torch.Size([120])
fc2.weight torch.Size([84, 120])
fc2.bias torch.Size([84])
fc3.weight torch.Size([10, 84])
fc3.bias torch.Size([10])
# 打印优化器的状态字典
print("Optimizer's state_dict:")
for var_name in optimizer.state_dict():
print(var_name, "\t", optimizer.state_dict()[var_name])
output:
Optimizer's state_dict:
state {
}
param_groups [{
'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}]
Save and load state_dict
The state_dict of the model can torch.save()
be saved by , that is, only the learned model parameters are saved, and the load_state_dict()
model parameters can be loaded and restored by . The most common model saving extensions in PyTorch are ' .pt ' or ' .pth '.
Save the model to the current path with the name test_state_dict.pth
PATH = './test_state_dict.pth'
torch.save(model.state_dict(), PATH)
model = TheModelClass() # 首先通过代码获取模型结构
model.load_state_dict(torch.load(PATH)) # 然后加载模型的state_dict
model.eval()
Note : The load_state_dict() function only accepts 字典对象
and cannot directly pass in the model path, so you need to use torch.load() to deserialize the saved state_dict first.
Save and load full models
# 保存完整模型
torch.save(model, PATH)
# 加载完整模型
model = torch.load(PATH)
model.eval()
Although the code of this method looks more concise than the state_dict method, it is less flexible. Because the torch.save() function uses Python's pickle module for serialization, but pickle cannot save the model itself, but saves the file path containing the class, which will be used when the model is loaded. So when the model is refactored in other projects, unexpected errors may appear.
OrderedDict
If we print state_dict
the data type, we will get the following output:
print(type(model.state_dict()))
output:
<class 'collections.OrderedDict'>
The collections module implements object-specific containers to provide an alternative to Python's standard built-in containers dict , list , set , and tuple .
class collections.OrderedDict([items])
OrderedDict
dict
An instance of a subclass , an ordered dictionary is just like a regular dictionary, but with some extra functionality related to sorting operations.
It is worth mentioning that after python3.7 , the built-in dict class gained the ability to remember the order of insertion, so this container is not so important.
Some differences from dict:
- Regular dicts are designed to be very good at mapping operations. Tracking insertion order is secondary;
- OrderedDict is designed to be good at reordering operations. Space efficiency, iteration speed, and performance of update operations are secondary;
- Algorithmically, OrderedDict can handle frequent reordering operations better than dict. This makes it suitable for keeping track of recent accesses (e.g. in an LRU cache);
Save some model parameters and load them in a new model
For the above model, the state dictionary of the model is:
Model's state_dict:
conv1.weight torch.Size([6, 3, 5, 5])
conv1.bias torch.Size([6])
conv2.weight torch.Size([16, 6, 5, 5])
conv2.bias torch.Size([16])
fc1.weight torch.Size([120, 400])
fc1.bias torch.Size([120])
fc2.weight torch.Size([84, 120])
fc2.bias torch.Size([84])
fc3.weight torch.Size([10, 84])
fc3.bias torch.Size([10])
If we only want to save the trained parameters of conv1 , we can do this:
save_state = {
}
print("Model's state_dict:")
for param_tensor in model.state_dict():
if 'conv1' in param_tensor:
save_state.update({
param_tensor:torch.ones((model.state_dict()[param_tensor].size()))})
print(param_tensor, "\t", model.state_dict()[param_tensor].size())
Here, for the convenience of subsequent demonstrations, our key sentence is written like this:
save_state.update({
param_tensor:torch.ones((model.state_dict()[param_tensor].size()))})
But when actually saving, we should write like this:
save_state.update({
param_tensor:model.state_dict()[param_tensor]})
Then save the save_state dictionary:
PATH = './test_state_dict.pth'
torch.save(save_state, PATH)
Then load the new model and assign the saved parameters to the new model:
model = TheModelClass() # 首先通过代码获取模型结构
model.load_state_dict(torch.load(PATH), strict=False) # 然后加载模型的state_dict
output:
_IncompatibleKeys(missing_keys=['conv2.weight', 'conv2.bias', 'fc1.weight', 'fc1.bias', 'fc2.weight', 'fc2.bias', 'fc3.weight', 'fc3.bias'], unexpected_keys=[])
Here is the hot start mode, by setting the strict parameter to False in the load_state_dict() function to ignore the parameters of the non-matching keys.
Let's look at the parameters of the new model again:
model.state_dict()['conv1.bias']
output:
tensor([1., 1., 1., 1., 1., 1.])
The saved parameters between discoveries have been loaded into the new model.
Look at the other parameters in the model:
model.state_dict()['conv2.bias']
output:
tensor([ 0.0468, 0.0024, -0.0510, 0.0791, 0.0244, -0.0379, -0.0708, 0.0317,
-0.0410, -0.0238, 0.0071, 0.0193, -0.0562, -0.0336, 0.0109, -0.0323])
You can see that other parameters are normal!
The difference between state_dict(), named_parameters(), model.parameter(), named_modules()
model.state_dict()
state_dict()
It is to store layer_name and layer_param as keys in the form of dict . Contains the names and parameters of all layers, the stored model parameters tensor 的 require_grad 属性都是 False
. The output value does not include require_grad. You cannot use model.state_dict() to obtain parameters and set the require_grad attribute when fixing a certain layer .
import torch
import torch.nn as nn
import torch.optim as optim
# 定义模型
class TheModelClass(nn.Module):
def __init__(self):
super(TheModelClass, self).__init__()
self.conv1 = nn.Conv2d(1, 2, 3)
self.bn = nn.BatchNorm2d(num_features=2)
self.act = nn.ReLU()
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(8, 4)
self.softmax = nn.Softmax(dim=-1)
def forward(self, x):
x = self.conv1(x)
x = self.bn(x)
x = self.act(x)
x = self.pool(x)
x = x.view(-1, 8)
x = self.fc1(x)
x = self.softmax(x)
return x
# 初始化模型
model = TheModelClass()
# 初始化优化器
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
for param_tensor in model.state_dict():
print(param_tensor, "\n", model.state_dict()[param_tensor])
output:
conv1.weight
tensor([[[[ 0.2438, -0.0467, 0.0486],
[-0.1932, -0.2083, 0.3239],
[ 0.1712, 0.0379, -0.2381]]],
[[[ 0.2853, 0.0961, 0.0809],
[ 0.2526, 0.3138, -0.2243],
[-0.1627, -0.2958, -0.1995]]]]) # 没有 require_grad
conv1.bias
tensor([-0.3287, -0.0686])
bn.weight
tensor([1., 1.])
bn.bias
tensor([0., 0.])
bn.running_mean
tensor([0., 0.])
bn.running_var
tensor([1., 1.])
bn.num_batches_tracked
tensor(0)
fc1.weight
tensor([[ 0.2246, -0.1272, 0.0163, -0.3089, 0.3511, -0.0189, 0.3025, 0.0770],
[ 0.2964, 0.2050, 0.2879, 0.0237, -0.3424, 0.0346, -0.0659, -0.0115],
[ 0.1960, -0.2104, -0.2839, 0.0977, -0.2857, -0.0610, -0.3029, 0.1230],
[-0.2176, 0.2868, -0.2258, 0.2992, -0.2619, 0.3286, 0.0410, 0.0152]])
fc1.bias
tensor([-0.0623, 0.1708, -0.1836, -0.1411])
model.named_parameters()
named_parameters()
It is to pack layer_name and layer_param into a tuple and then store it in the list.
Only save the parameters that can be learned and updated. model.named_parameters() Stored model parameters tensor 的 require_grad 属性都是True
. It is often used to fix whether the parameters of a certain layer are trained , usually through model.named_parameters() to obtain parameters and set the require_grad attribute .
import torch
import torch.nn as nn
import torch.optim as optim
# 定义模型
class TheModelClass(nn.Module):
def __init__(self):
super(TheModelClass, self).__init__()
self.conv1 = nn.Conv2d(1, 2, 3)
self.bn = nn.BatchNorm2d(num_features=2)
self.act = nn.ReLU()
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(8, 4)
self.softmax = nn.Softmax(dim=-1)
def forward(self, x):
x = self.conv1(x)
x = self.bn(x)
x = self.act(x)
x = self.pool(x)
x = x.view(-1, 8)
x = self.fc1(x)
x = self.softmax(x)
return x
# 初始化模型
model = TheModelClass()
# 初始化优化器
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
for layer_name, layer_param in model.named_parameters():
print(layer_name, "\n", layer_param)
output:
conv1.weight
Parameter containing:
tensor([[[[ 0.2438, -0.0467, 0.0486],
[-0.1932, -0.2083, 0.3239],
[ 0.1712, 0.0379, -0.2381]]],
[[[ 0.2853, 0.0961, 0.0809],
[ 0.2526, 0.3138, -0.2243],
[-0.1627, -0.2958, -0.1995]]]], requires_grad=True) # require_grad为True
conv1.bias
Parameter containing:
tensor([-0.3287, -0.0686], requires_grad=True)
bn.weight
Parameter containing:
tensor([1., 1.], requires_grad=True)
bn.bias
Parameter containing:
tensor([0., 0.], requires_grad=True)
fc1.weight
Parameter containing:
tensor([[ 0.2246, -0.1272, 0.0163, -0.3089, 0.3511, -0.0189, 0.3025, 0.0770],
[ 0.2964, 0.2050, 0.2879, 0.0237, -0.3424, 0.0346, -0.0659, -0.0115],
[ 0.1960, -0.2104, -0.2839, 0.0977, -0.2857, -0.0610, -0.3029, 0.1230],
[-0.2176, 0.2868, -0.2258, 0.2992, -0.2619, 0.3286, 0.0410, 0.0152]],
requires_grad=True)
fc1.bias
Parameter containing:
tensor([-0.0623, 0.1708, -0.1836, -0.1411], requires_grad=True)
model.parameter()
parameter()
Only the parameters are returned, layer_name is not included . 返回结果包含 require_grad,且均为 Ture
, which is mainly because the default parameters need to be learned when the network is created, that is, require_grad is all True.
import torch
import torch.nn as nn
import torch.optim as optim
# 定义模型
class TheModelClass(nn.Module):
def __init__(self):
super(TheModelClass, self).__init__()
self.conv1 = nn.Conv2d(1, 2, 3)
self.bn = nn.BatchNorm2d(num_features=2)
self.act = nn.ReLU()
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(8, 4)
self.softmax = nn.Softmax(dim=-1)
def forward(self, x):
x = self.conv1(x)
x = self.bn(x)
x = self.act(x)
x = self.pool(x)
x = x.view(-1, 8)
x = self.fc1(x)
x = self.softmax(x)
return x
# 初始化模型
model = TheModelClass()
# 初始化优化器
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
for layer_param in model.parameters():
print(layer_param)
output:
Parameter containing:
tensor([[[[ 0.2438, -0.0467, 0.0486],
[-0.1932, -0.2083, 0.3239],
[ 0.1712, 0.0379, -0.2381]]],
[[[ 0.2853, 0.0961, 0.0809],
[ 0.2526, 0.3138, -0.2243],
[-0.1627, -0.2958, -0.1995]]]], requires_grad=True)
Parameter containing:
tensor([-0.3287, -0.0686], requires_grad=True)
Parameter containing:
tensor([1., 1.], requires_grad=True)
Parameter containing:
tensor([0., 0.], requires_grad=True)
Parameter containing:
tensor([[ 0.2246, -0.1272, 0.0163, -0.3089, 0.3511, -0.0189, 0.3025, 0.0770],
[ 0.2964, 0.2050, 0.2879, 0.0237, -0.3424, 0.0346, -0.0659, -0.0115],
[ 0.1960, -0.2104, -0.2839, 0.0977, -0.2857, -0.0610, -0.3029, 0.1230],
[-0.2176, 0.2868, -0.2258, 0.2992, -0.2619, 0.3286, 0.0410, 0.0152]],
requires_grad=True)
Parameter containing:
tensor([-0.0623, 0.1708, -0.1836, -0.1411], requires_grad=True)
model.named_modules()
Return the name and structure of each layer model
import torch
import torch.nn as nn
import torch.optim as optim
# 定义模型
class TheModelClass(nn.Module):
def __init__(self):
super(TheModelClass, self).__init__()
self.conv1 = nn.Conv2d(1, 2, 3)
self.bn = nn.BatchNorm2d(num_features=2)
self.act = nn.ReLU()
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(8, 4)
self.softmax = nn.Softmax(dim=-1)
def forward(self, x):
x = self.conv1(x)
x = self.bn(x)
x = self.act(x)
x = self.pool(x)
x = x.view(-1, 8)
x = self.fc1(x)
x = self.softmax(x)
return x
# 初始化模型
model = TheModelClass()
# 初始化优化器
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
for name, module in model.named_modules():
print(name,'\n', module)
output:
TheModelClass(
(conv1): Conv2d(1, 2, kernel_size=(3, 3), stride=(1, 1))
(bn): BatchNorm2d(2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act): ReLU()
(pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(fc1): Linear(in_features=8, out_features=4, bias=True)
(softmax): Softmax(dim=-1)
)
conv1
Conv2d(1, 2, kernel_size=(3, 3), stride=(1, 1))
bn
BatchNorm2d(2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
act
ReLU()
pool
MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
fc1
Linear(in_features=8, out_features=4, bias=True)
softmax
Softmax(dim=-1)
Freeze certain layers / let only certain layers learn
At present, there is such a requirement: I have trained a network with a large amount of data, and then I need to use this trained network to test the accuracy of new subjects. When the network model remains unchanged, I want to use migration The idea of learning is to take the trained parameters and only need to use a very small amount of data from the new subjects to train the classification head of the model, and the other layers of the model do not need to be trained. In this way I need to freeze some layers of the model, that is, only train some layers of the model.
Through the above analysis, I use to state_dict()
read the parameters of the model and save them:
model_dict = model.state_dict()
Because state_dict()
the model parameters obtained by using the function have no require_grad
attributes, and the article also said that state_dict()
the require_grad attributes of the stored model parameters tensor are all False .
Then we delete the parameters of the layer to be trained in the parameters of the saved model, because only after deletion, after creating a new model object and loading the parameters of the previous model, the parameters of the layer to be trained will not be deleted. The parameter coverage of the model.
model_dict.pop('fc1.weight', None)
output:
tensor([[ 0.2246, -0.1272, 0.0163, -0.3089, 0.3511, -0.0189, 0.3025, 0.0770],
[ 0.2964, 0.2050, 0.2879, 0.0237, -0.3424, 0.0346, -0.0659, -0.0115],
[ 0.1960, -0.2104, -0.2839, 0.0977, -0.2857, -0.0610, -0.3029, 0.1230],
[-0.2176, 0.2868, -0.2258, 0.2992, -0.2619, 0.3286, 0.0410, 0.0152]])
model_dict.pop('fc1.bias', None)
tensor([-0.0623, 0.1708, -0.1836, -0.1411])
Then we print the parameters of the saved model:
for param_tensor in model_dict:
print(param_tensor, "\n", model_dict[param_tensor])
output:
conv1.weight
tensor([[[[ 0.2438, -0.0467, 0.0486],
[-0.1932, -0.2083, 0.3239],
[ 0.1712, 0.0379, -0.2381]]],
[[[ 0.2853, 0.0961, 0.0809],
[ 0.2526, 0.3138, -0.2243],
[-0.1627, -0.2958, -0.1995]]]])
conv1.bias
tensor([-0.3287, -0.0686])
bn.weight
tensor([1., 1.])
bn.bias
tensor([0., 0.])
bn.running_mean
tensor([0., 0.])
bn.running_var
tensor([1., 1.])
bn.num_batches_tracked
tensor(0)
It is found that the parameters of the layer we deleted are gone.
Then we create a new model object and load the parameters of the previously saved model into the new model object:
model_ = TheModelClass()
model_.load_state_dict(model_dict, strict=False)
output:
_IncompatibleKeys(missing_keys=['fc1.weight', 'fc1.bias'], unexpected_keys=[])
Then let's see what the properties of the parameters of the new model object look require_grad
like:
model_dict_ = model_.named_parameters()
for layer_name, layer_param in model_dict_ :
print(layer_name, “\n”, layer_param)
output:
conv1.weight
Parameter containing:
tensor([[[[ 0.2438, -0.0467, 0.0486],
[-0.1932, -0.2083, 0.3239],
[ 0.1712, 0.0379, -0.2381]]],
[[[ 0.2853, 0.0961, 0.0809],
[ 0.2526, 0.3138, -0.2243],
[-0.1627, -0.2958, -0.1995]]]], requires_grad=True)
conv1.bias
Parameter containing:
tensor([-0.3287, -0.0686], requires_grad=True)
bn.weight
Parameter containing:
tensor([1., 1.], requires_grad=True)
bn.bias
Parameter containing:
tensor([0., 0.], requires_grad=True)
fc1.weight
Parameter containing:
tensor([[-0.2306, -0.3159, -0.3105, -0.3051, 0.2721, -0.0691, 0.2208, -0.1724],
[-0.0238, -0.1555, 0.2341, -0.2668, 0.3143, 0.1433, 0.3140, -0.2014],
[ 0.0696, -0.0250, 0.0316, -0.1065, 0.2260, -0.1009, -0.1990, -0.1758],
[-0.1782, -0.2045, -0.3030, 0.2643, 0.1951, -0.2213, -0.0040, 0.1542]],
requires_grad=True)
fc1.bias
Parameter containing:
tensor([-0.0472, -0.0569, -0.1912, -0.2139], requires_grad=True)
require_grad
We can see that the parameters of the previous model have been loaded into the new model object, but the attributes of the new parameters are all True , which is not what we want.
From the above analysis, we can see that state_dict()
we can't achieve the effect we want by reading the parameters of the model, saving them, and then loading them into a new model object. We also need some other operations to complete the goal.
We can solve the above problem in two ways:
require_grad=False
We can set the properties of the parameters of the layers that do not need to be learned require_grad
to False
model_dict_ = model_.named_parameters()
for layer_name, layer_param in model_dict_:
if 'fc1' in layer_name:
continue
else:
layer_param.requires_grad = False
Then we look at the parameters of the model:
for layer_param in model_.parameters():
print(layer_param)
output:
Parameter containing:
tensor([[[[ 0.2438, -0.0467, 0.0486],
[-0.1932, -0.2083, 0.3239],
[ 0.1712, 0.0379, -0.2381]]],
[[[ 0.2853, 0.0961, 0.0809],
[ 0.2526, 0.3138, -0.2243],
[-0.1627, -0.2958, -0.1995]]]])
Parameter containing:
tensor([-0.3287, -0.0686])
Parameter containing:
tensor([1., 1.])
Parameter containing:
tensor([0., 0.])
Parameter containing:
tensor([[ 0.0182, 0.1294, 0.0250, -0.1819, -0.2250, -0.2540, -0.2728, 0.2732],
[ 0.0167, -0.0969, 0.1498, -0.1844, 0.1387, 0.2436, 0.1278, -0.1875],
[-0.0408, 0.0786, 0.2352, 0.0277, 0.2571, 0.2782, 0.2505, -0.2454],
[ 0.3369, -0.0804, 0.2677, 0.0927, 0.0433, 0.1716, -0.1870, -0.1738]],
requires_grad=True)
Parameter containing:
tensor([0.1084, 0.3018, 0.1211, 0.1081], requires_grad=True)
We can see that the properties of the parameters of the layers that do not need to be learned require_grad
have all changed to False .
Then these parameters can be sent to the optimizer:
optimizer = optim.SGD(model_.parameters(), lr=0.001, momentum=0.9)
Set optimizer update parameters
If you do not want to update a certain network layer, the simpler way is not to put the parameters of the network layer in the optimizer:
optimizer = optim.SGD(model_.fc1.parameters(), lr=0.001, momentum=0.9)
Note: The parameters that are frozen at this moment are still deriving during backpropagation, but the parameters are not updated.
It can be seen that if this method is adopted, memory usage can be reduced, and at the same time, if it is used in advance, require_grad=False
the model will skip parameters that do not need to be calculated and improve the calculation speed, so these two methods can be used together.
References
PyTorch study notes: use state_dict to save and load models