PyTorch saves some model parameters and loads them in a new model

I recently read the paper and saw that some programs are to train a model first, then take part of the structure of the trained model into the new model, and then use the new data to train the new model, but the parameters of that part are To keep it, at first I thought it was very similar to transfer learning, because I hadn’t studied it in detail, so I was not sure, so I first learned how to realize the plan of the thesis, and recorded it here for future reference. As for transfer learning, let’s study in the future, it should be used!

state_dict

Introduction to state_dict

state_dictis a Python dictionary object that can be used to save model parameters, hyperparameters, and state information of the optimizer (torch.optim). It should be noted that only layers with learnable parameters (such as convolutional layers, linear layers, etc.) have a state_dict.

Give a chestnut to illustrate the use of state_dict:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
 
# 定义模型
class TheModelClass(nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
 
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
 
# 初始化模型
model = TheModelClass()
 
# 初始化优化器
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
# 打印模型的状态字典
print("Model's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

output:

Model's state_dict:
conv1.weight 	 torch.Size([6, 3, 5, 5])
conv1.bias 	 torch.Size([6])
conv2.weight 	 torch.Size([16, 6, 5, 5])
conv2.bias 	 torch.Size([16])
fc1.weight 	 torch.Size([120, 400])
fc1.bias 	 torch.Size([120])
fc2.weight 	 torch.Size([84, 120])
fc2.bias 	 torch.Size([84])
fc3.weight 	 torch.Size([10, 84])
fc3.bias 	 torch.Size([10])
# 打印优化器的状态字典
print("Optimizer's state_dict:")
for var_name in optimizer.state_dict():
    print(var_name, "\t", optimizer.state_dict()[var_name])

output:

Optimizer's state_dict:
state 	 {
    
    }
param_groups 	 [{
    
    'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}]

Save and load state_dict

The state_dict of the model can torch.save()be saved by , that is, only the learned model parameters are saved, and the load_state_dict()model parameters can be loaded and restored by . The most common model saving extensions in PyTorch are ' .pt ' or ' .pth '.

Save the model to the current path with the name test_state_dict.pth

PATH = './test_state_dict.pth'
torch.save(model.state_dict(), PATH)
 
model = TheModelClass()    # 首先通过代码获取模型结构
model.load_state_dict(torch.load(PATH))   # 然后加载模型的state_dict
model.eval()

Note : The load_state_dict() function only accepts 字典对象and cannot directly pass in the model path, so you need to use torch.load() to deserialize the saved state_dict first.

Save and load full models

# 保存完整模型
torch.save(model, PATH)
 
# 加载完整模型
model = torch.load(PATH)
model.eval()

Although the code of this method looks more concise than the state_dict method, it is less flexible. Because the torch.save() function uses Python's pickle module for serialization, but pickle cannot save the model itself, but saves the file path containing the class, which will be used when the model is loaded. So when the model is refactored in other projects, unexpected errors may appear.

OrderedDict

If we print state_dictthe data type, we will get the following output:

print(type(model.state_dict()))

output:

<class 'collections.OrderedDict'>

The collections module implements object-specific containers to provide an alternative to Python's standard built-in containers dict , list , set , and tuple .

class collections.OrderedDict([items])

OrderedDictdictAn instance of a subclass , an ordered dictionary is just like a regular dictionary, but with some extra functionality related to sorting operations.

It is worth mentioning that after python3.7 , the built-in dict class gained the ability to remember the order of insertion, so this container is not so important.

Some differences from dict:

  1. Regular dicts are designed to be very good at mapping operations. Tracking insertion order is secondary;
  2. OrderedDict is designed to be good at reordering operations. Space efficiency, iteration speed, and performance of update operations are secondary;
  3. Algorithmically, OrderedDict can handle frequent reordering operations better than dict. This makes it suitable for keeping track of recent accesses (e.g. in an LRU cache);

Save some model parameters and load them in a new model

For the above model, the state dictionary of the model is:

Model's state_dict:
conv1.weight 	 torch.Size([6, 3, 5, 5])
conv1.bias 	 torch.Size([6])
conv2.weight 	 torch.Size([16, 6, 5, 5])
conv2.bias 	 torch.Size([16])
fc1.weight 	 torch.Size([120, 400])
fc1.bias 	 torch.Size([120])
fc2.weight 	 torch.Size([84, 120])
fc2.bias 	 torch.Size([84])
fc3.weight 	 torch.Size([10, 84])
fc3.bias 	 torch.Size([10])

If we only want to save the trained parameters of conv1 , we can do this:

save_state = {
    
    }
print("Model's state_dict:")
for param_tensor in model.state_dict():
    if 'conv1' in param_tensor:
        save_state.update({
    
    param_tensor:torch.ones((model.state_dict()[param_tensor].size()))})
        print(param_tensor, "\t", model.state_dict()[param_tensor].size())

Here, for the convenience of subsequent demonstrations, our key sentence is written like this:

save_state.update({
    
    param_tensor:torch.ones((model.state_dict()[param_tensor].size()))})

But when actually saving, we should write like this:

save_state.update({
    
    param_tensor:model.state_dict()[param_tensor]})

Then save the save_state dictionary:

PATH = './test_state_dict.pth'
torch.save(save_state, PATH)

Then load the new model and assign the saved parameters to the new model:

model = TheModelClass()    # 首先通过代码获取模型结构
model.load_state_dict(torch.load(PATH), strict=False)   # 然后加载模型的state_dict

output:

_IncompatibleKeys(missing_keys=['conv2.weight', 'conv2.bias', 'fc1.weight', 'fc1.bias', 'fc2.weight', 'fc2.bias', 'fc3.weight', 'fc3.bias'], unexpected_keys=[])

Here is the hot start mode, by setting the strict parameter to False in the load_state_dict() function to ignore the parameters of the non-matching keys.

Let's look at the parameters of the new model again:

model.state_dict()['conv1.bias']

output:

tensor([1., 1., 1., 1., 1., 1.])

The saved parameters between discoveries have been loaded into the new model.

Look at the other parameters in the model:

model.state_dict()['conv2.bias']

output:

tensor([ 0.0468,  0.0024, -0.0510,  0.0791,  0.0244, -0.0379, -0.0708,  0.0317,
        -0.0410, -0.0238,  0.0071,  0.0193, -0.0562, -0.0336,  0.0109, -0.0323])

You can see that other parameters are normal!

The difference between state_dict(), named_parameters(), model.parameter(), named_modules()

model.state_dict()

state_dict()It is to store layer_name and layer_param as keys in the form of dict . Contains the names and parameters of all layers, the stored model parameters tensor 的 require_grad 属性都是 False. The output value does not include require_grad. You cannot use model.state_dict() to obtain parameters and set the require_grad attribute when fixing a certain layer .

import torch
import torch.nn as nn
import torch.optim as optim
 
# 定义模型
class TheModelClass(nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        self.conv1 = nn.Conv2d(1, 2, 3)
        self.bn = nn.BatchNorm2d(num_features=2)
        self.act = nn.ReLU()
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(8, 4)
        self.softmax = nn.Softmax(dim=-1)
 
    def forward(self, x):
        x = self.conv1(x)
        x = self.bn(x)
        x = self.act(x)
        x = self.pool(x)
        x = x.view(-1, 8)
        x = self.fc1(x)
        x = self.softmax(x)
        return x
 
# 初始化模型
model = TheModelClass()
 
# 初始化优化器
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
for param_tensor in model.state_dict():
    print(param_tensor, "\n", model.state_dict()[param_tensor])

output:

conv1.weight 
 tensor([[[[ 0.2438, -0.0467,  0.0486],
          [-0.1932, -0.2083,  0.3239],
          [ 0.1712,  0.0379, -0.2381]]],


        [[[ 0.2853,  0.0961,  0.0809],
          [ 0.2526,  0.3138, -0.2243],
          [-0.1627, -0.2958, -0.1995]]]])    # 没有 require_grad
conv1.bias 
 tensor([-0.3287, -0.0686])
bn.weight 
 tensor([1., 1.])
bn.bias 
 tensor([0., 0.])
bn.running_mean 
 tensor([0., 0.])
bn.running_var 
 tensor([1., 1.])
bn.num_batches_tracked 
 tensor(0)
fc1.weight 
 tensor([[ 0.2246, -0.1272,  0.0163, -0.3089,  0.3511, -0.0189,  0.3025,  0.0770],
        [ 0.2964,  0.2050,  0.2879,  0.0237, -0.3424,  0.0346, -0.0659, -0.0115],
        [ 0.1960, -0.2104, -0.2839,  0.0977, -0.2857, -0.0610, -0.3029,  0.1230],
        [-0.2176,  0.2868, -0.2258,  0.2992, -0.2619,  0.3286,  0.0410,  0.0152]])
fc1.bias 
 tensor([-0.0623,  0.1708, -0.1836, -0.1411])

model.named_parameters()

named_parameters()It is to pack layer_name and layer_param into a tuple and then store it in the list.
Only save the parameters that can be learned and updated. model.named_parameters() Stored model parameters tensor 的 require_grad 属性都是True. It is often used to fix whether the parameters of a certain layer are trained , usually through model.named_parameters() to obtain parameters and set the require_grad attribute .

import torch
import torch.nn as nn
import torch.optim as optim
 
# 定义模型
class TheModelClass(nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        self.conv1 = nn.Conv2d(1, 2, 3)
        self.bn = nn.BatchNorm2d(num_features=2)
        self.act = nn.ReLU()
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(8, 4)
        self.softmax = nn.Softmax(dim=-1)
 
    def forward(self, x):
        x = self.conv1(x)
        x = self.bn(x)
        x = self.act(x)
        x = self.pool(x)
        x = x.view(-1, 8)
        x = self.fc1(x)
        x = self.softmax(x)
        return x
 
# 初始化模型
model = TheModelClass()
 
# 初始化优化器
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
for layer_name, layer_param in model.named_parameters():
    print(layer_name, "\n", layer_param)

output:

conv1.weight 
 Parameter containing:
tensor([[[[ 0.2438, -0.0467,  0.0486],
          [-0.1932, -0.2083,  0.3239],
          [ 0.1712,  0.0379, -0.2381]]],


        [[[ 0.2853,  0.0961,  0.0809],
          [ 0.2526,  0.3138, -0.2243],
          [-0.1627, -0.2958, -0.1995]]]], requires_grad=True)    # require_grad为True
conv1.bias 
 Parameter containing:
tensor([-0.3287, -0.0686], requires_grad=True)
bn.weight 
 Parameter containing:
tensor([1., 1.], requires_grad=True)
bn.bias 
 Parameter containing:
tensor([0., 0.], requires_grad=True)
fc1.weight 
 Parameter containing:
tensor([[ 0.2246, -0.1272,  0.0163, -0.3089,  0.3511, -0.0189,  0.3025,  0.0770],
        [ 0.2964,  0.2050,  0.2879,  0.0237, -0.3424,  0.0346, -0.0659, -0.0115],
        [ 0.1960, -0.2104, -0.2839,  0.0977, -0.2857, -0.0610, -0.3029,  0.1230],
        [-0.2176,  0.2868, -0.2258,  0.2992, -0.2619,  0.3286,  0.0410,  0.0152]],
       requires_grad=True)
fc1.bias 
 Parameter containing:
tensor([-0.0623,  0.1708, -0.1836, -0.1411], requires_grad=True)

model.parameter()

parameter()Only the parameters are returned, layer_name is not included . 返回结果包含 require_grad,且均为 Ture, which is mainly because the default parameters need to be learned when the network is created, that is, require_grad is all True.

import torch
import torch.nn as nn
import torch.optim as optim
 
# 定义模型
class TheModelClass(nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        self.conv1 = nn.Conv2d(1, 2, 3)
        self.bn = nn.BatchNorm2d(num_features=2)
        self.act = nn.ReLU()
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(8, 4)
        self.softmax = nn.Softmax(dim=-1)
 
    def forward(self, x):
        x = self.conv1(x)
        x = self.bn(x)
        x = self.act(x)
        x = self.pool(x)
        x = x.view(-1, 8)
        x = self.fc1(x)
        x = self.softmax(x)
        return x
 
# 初始化模型
model = TheModelClass()
 
# 初始化优化器
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
for layer_param in model.parameters():
    print(layer_param)

output:

Parameter containing:
tensor([[[[ 0.2438, -0.0467,  0.0486],
          [-0.1932, -0.2083,  0.3239],
          [ 0.1712,  0.0379, -0.2381]]],


        [[[ 0.2853,  0.0961,  0.0809],
          [ 0.2526,  0.3138, -0.2243],
          [-0.1627, -0.2958, -0.1995]]]], requires_grad=True)
Parameter containing:
tensor([-0.3287, -0.0686], requires_grad=True)
Parameter containing:
tensor([1., 1.], requires_grad=True)
Parameter containing:
tensor([0., 0.], requires_grad=True)
Parameter containing:
tensor([[ 0.2246, -0.1272,  0.0163, -0.3089,  0.3511, -0.0189,  0.3025,  0.0770],
        [ 0.2964,  0.2050,  0.2879,  0.0237, -0.3424,  0.0346, -0.0659, -0.0115],
        [ 0.1960, -0.2104, -0.2839,  0.0977, -0.2857, -0.0610, -0.3029,  0.1230],
        [-0.2176,  0.2868, -0.2258,  0.2992, -0.2619,  0.3286,  0.0410,  0.0152]],
       requires_grad=True)
Parameter containing:
tensor([-0.0623,  0.1708, -0.1836, -0.1411], requires_grad=True)

model.named_modules()

Return the name and structure of each layer model

import torch
import torch.nn as nn
import torch.optim as optim
 
# 定义模型
class TheModelClass(nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        self.conv1 = nn.Conv2d(1, 2, 3)
        self.bn = nn.BatchNorm2d(num_features=2)
        self.act = nn.ReLU()
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(8, 4)
        self.softmax = nn.Softmax(dim=-1)
 
    def forward(self, x):
        x = self.conv1(x)
        x = self.bn(x)
        x = self.act(x)
        x = self.pool(x)
        x = x.view(-1, 8)
        x = self.fc1(x)
        x = self.softmax(x)
        return x
 
# 初始化模型
model = TheModelClass()
 
# 初始化优化器
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
for name, module in model.named_modules():
    print(name,'\n', module)

output:

 TheModelClass(
  (conv1): Conv2d(1, 2, kernel_size=(3, 3), stride=(1, 1))
  (bn): BatchNorm2d(2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (act): ReLU()
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=8, out_features=4, bias=True)
  (softmax): Softmax(dim=-1)
)
conv1 
 Conv2d(1, 2, kernel_size=(3, 3), stride=(1, 1))
bn 
 BatchNorm2d(2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
act 
 ReLU()
pool 
 MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
fc1 
 Linear(in_features=8, out_features=4, bias=True)
softmax 
 Softmax(dim=-1)

Freeze certain layers / let only certain layers learn

At present, there is such a requirement: I have trained a network with a large amount of data, and then I need to use this trained network to test the accuracy of new subjects. When the network model remains unchanged, I want to use migration The idea of ​​learning is to take the trained parameters and only need to use a very small amount of data from the new subjects to train the classification head of the model, and the other layers of the model do not need to be trained. In this way I need to freeze some layers of the model, that is, only train some layers of the model.

Through the above analysis, I use to state_dict()read the parameters of the model and save them:

model_dict = model.state_dict()

Because state_dict()the model parameters obtained by using the function have no require_gradattributes, and the article also said that state_dict()the require_grad attributes of the stored model parameters tensor are all False .

Then we delete the parameters of the layer to be trained in the parameters of the saved model, because only after deletion, after creating a new model object and loading the parameters of the previous model, the parameters of the layer to be trained will not be deleted. The parameter coverage of the model.

model_dict.pop('fc1.weight', None)

output:

tensor([[ 0.2246, -0.1272,  0.0163, -0.3089,  0.3511, -0.0189,  0.3025,  0.0770],
        [ 0.2964,  0.2050,  0.2879,  0.0237, -0.3424,  0.0346, -0.0659, -0.0115],
        [ 0.1960, -0.2104, -0.2839,  0.0977, -0.2857, -0.0610, -0.3029,  0.1230],
        [-0.2176,  0.2868, -0.2258,  0.2992, -0.2619,  0.3286,  0.0410,  0.0152]])
model_dict.pop('fc1.bias', None)
tensor([-0.0623,  0.1708, -0.1836, -0.1411])

Then we print the parameters of the saved model:

for param_tensor in model_dict:
    print(param_tensor, "\n", model_dict[param_tensor])

output:

conv1.weight 
 tensor([[[[ 0.2438, -0.0467,  0.0486],
          [-0.1932, -0.2083,  0.3239],
          [ 0.1712,  0.0379, -0.2381]]],


        [[[ 0.2853,  0.0961,  0.0809],
          [ 0.2526,  0.3138, -0.2243],
          [-0.1627, -0.2958, -0.1995]]]])
conv1.bias 
 tensor([-0.3287, -0.0686])
bn.weight 
 tensor([1., 1.])
bn.bias 
 tensor([0., 0.])
bn.running_mean 
 tensor([0., 0.])
bn.running_var 
 tensor([1., 1.])
bn.num_batches_tracked 
 tensor(0)

It is found that the parameters of the layer we deleted are gone.

Then we create a new model object and load the parameters of the previously saved model into the new model object:

model_ = TheModelClass()
model_.load_state_dict(model_dict, strict=False)

output:

_IncompatibleKeys(missing_keys=['fc1.weight', 'fc1.bias'], unexpected_keys=[])

Then let's see what the properties of the parameters of the new model object look require_gradlike:

model_dict_ = model_.named_parameters()
for layer_name, layer_param in model_dict_ :
print(layer_name, “\n”, layer_param)

output:

conv1.weight 
 Parameter containing:
tensor([[[[ 0.2438, -0.0467,  0.0486],
          [-0.1932, -0.2083,  0.3239],
          [ 0.1712,  0.0379, -0.2381]]],


        [[[ 0.2853,  0.0961,  0.0809],
          [ 0.2526,  0.3138, -0.2243],
          [-0.1627, -0.2958, -0.1995]]]], requires_grad=True)
conv1.bias 
 Parameter containing:
tensor([-0.3287, -0.0686], requires_grad=True)
bn.weight 
 Parameter containing:
tensor([1., 1.], requires_grad=True)
bn.bias 
 Parameter containing:
tensor([0., 0.], requires_grad=True)
fc1.weight 
 Parameter containing:
tensor([[-0.2306, -0.3159, -0.3105, -0.3051,  0.2721, -0.0691,  0.2208, -0.1724],
        [-0.0238, -0.1555,  0.2341, -0.2668,  0.3143,  0.1433,  0.3140, -0.2014],
        [ 0.0696, -0.0250,  0.0316, -0.1065,  0.2260, -0.1009, -0.1990, -0.1758],
        [-0.1782, -0.2045, -0.3030,  0.2643,  0.1951, -0.2213, -0.0040,  0.1542]],
       requires_grad=True)
fc1.bias 
 Parameter containing:
tensor([-0.0472, -0.0569, -0.1912, -0.2139], requires_grad=True)

require_gradWe can see that the parameters of the previous model have been loaded into the new model object, but the attributes of the new parameters are all True , which is not what we want.

From the above analysis, we can see that state_dict()we can't achieve the effect we want by reading the parameters of the model, saving them, and then loading them into a new model object. We also need some other operations to complete the goal.

We can solve the above problem in two ways:

require_grad=False

We can set the properties of the parameters of the layers that do not need to be learned require_gradto False

model_dict_ = model_.named_parameters()
for layer_name, layer_param in model_dict_:
    if 'fc1' in layer_name:
        continue
    else:
        layer_param.requires_grad = False

Then we look at the parameters of the model:

for layer_param in model_.parameters():
    print(layer_param)

output:

Parameter containing:
tensor([[[[ 0.2438, -0.0467,  0.0486],
          [-0.1932, -0.2083,  0.3239],
          [ 0.1712,  0.0379, -0.2381]]],


        [[[ 0.2853,  0.0961,  0.0809],
          [ 0.2526,  0.3138, -0.2243],
          [-0.1627, -0.2958, -0.1995]]]])
Parameter containing:
tensor([-0.3287, -0.0686])
Parameter containing:
tensor([1., 1.])
Parameter containing:
tensor([0., 0.])
Parameter containing:
tensor([[ 0.0182,  0.1294,  0.0250, -0.1819, -0.2250, -0.2540, -0.2728,  0.2732],
        [ 0.0167, -0.0969,  0.1498, -0.1844,  0.1387,  0.2436,  0.1278, -0.1875],
        [-0.0408,  0.0786,  0.2352,  0.0277,  0.2571,  0.2782,  0.2505, -0.2454],
        [ 0.3369, -0.0804,  0.2677,  0.0927,  0.0433,  0.1716, -0.1870, -0.1738]],
       requires_grad=True)
Parameter containing:
tensor([0.1084, 0.3018, 0.1211, 0.1081], requires_grad=True)

We can see that the properties of the parameters of the layers that do not need to be learned require_gradhave all changed to False .

Then these parameters can be sent to the optimizer:

optimizer = optim.SGD(model_.parameters(), lr=0.001, momentum=0.9)

Set optimizer update parameters

If you do not want to update a certain network layer, the simpler way is not to put the parameters of the network layer in the optimizer:

optimizer = optim.SGD(model_.fc1.parameters(), lr=0.001, momentum=0.9)

Note: The parameters that are frozen at this moment are still deriving during backpropagation, but the parameters are not updated.

It can be seen that if this method is adopted, memory usage can be reduced, and at the same time, if it is used in advance, require_grad=Falsethe model will skip parameters that do not need to be calculated and improve the calculation speed, so these two methods can be used together.

References

PyTorch study notes: use state_dict to save and load models

Python advanced container collections – OrderedDict

Pytorch pre-training model loading, modifying the network structure and fixing a certain layer of parameter training, different layers use different learning rates

Guess you like

Origin blog.csdn.net/qq_41990294/article/details/128942601