Training of Deep Learning Models

foreword
1. Save and load the model
- - Method 1: Save the model and parameters at the same time
  - Method 2: Only save parameters (small amount of data, recommended!)
2. Saving and loading of breakpoints
3. Use of pre-trained models
4. Freezing of the model
5. Special loading methods and loading skills of the model
6. Single GPU training and multi-GPU training

foreword

When we train models, we often use some tricks, including: saving and loading models, saving and loading breakpoints, freezing and warming up models, pre-training and loading models, single-GPU training and multi-GPU training. These are often encountered in the process of training our network.

1. Save and load the model

Method 1: Save the model and parameters at the same time

# 保存（vgg16是模型，vgg16_methond1.pth是保存的名称）
torch.save(vgg16,"vgg16_methond1.pth")
# 加载
model = torch.load("vgg16_methond1.pth")

Method 2: Only save parameters (small amount of data, recommended!)

# 保存
torch.save(vgg16.state_dict(),"vgg16_methond2.pth")
# 加载
model2= vgg16(classes=2019)                  # 初始化网络结构
vgg16_dict = torch.load("vgg16_methond2.pth")# 加载参数
model2.load_state_dict(vgg16_dict )          # 将参数加载到网络中去

注意：加载模型权重时，我们需要先实例化模型类，因为该类定义了网络的结构。

2. Saving and loading of breakpoints

If the training time of the model is very long, and there is a small accident in the middle, the training of the model is terminated, and in order to save time in the next training, the model is allowed to continue training from the breakpoint, which needs to be saved during the model training. Some information, so that the model can continue training from the breakpoint after retraining after an accident. Therefore, it is very important to record information (checkpoint) during model training. Five processes of model training: data, loss function, model, optimizer, and iterative training . In these five steps , the data and loss function cannot be changed , but some learnable parameters of the model and some caches in the optimizer will change during the iterative training process, so it is necessary to retain this information, and also need to retain The number of iterations and the learning rate.

# 在模型训练时，每隔5个epoch保存模型信息
if (epoch+1) % 5== 0:
    checkpoint = {
    
    
			        "net": model.state_dict(),
			        'optimizer': optimizer.state_dict(),
			        "epoch": epoch,
			        'lr_schedule': lr_schedule.state_dict()}
    path_checkpoint = "./checkpoint.pkl"
    torch.save(checkpoint, path_checkpoint)
    
# 加载断点
path_checkpoint = "./checkpoint_4_epoch.pkl"
checkpoint = torch.load(path_checkpoint)			  # 加载断点
net.load_state_dict(checkpoint['model'])			  # 加载模型可学习参数
optimizer.load_state_dict(checkpoint['optimizer'])	  # 加载优化器参数
start_epoch = checkpoint['epoch']					  # 设置开始的epoch
lr_schedule.load_state_dict(checkpoint['lr_schedule'])#加载lr_scheduler
scheduler.last_epoch = start_epoch

Before running inference, model.eval() must be called to set the dropout and batch normalization layers into evaluation mode. Failure to do so can produce inconsistent inference results. If you want to resume training, call model.train() to ensure that the layers are in training mode.

3. Use of pre-trained models

Pytorch comes with some advanced complex models, and there are two ways to use them;
Method 1: use the torchvision.models function call, for example torchvision.models.densenet169(pretrained=True) calls the pre-trained model of densenet169. Let's start by looking at what models this module covers: see https://pytorch.org/vision/stable/models.html for details. like resnet50

from torchvision.io import read_image
from torchvision.models.quantization import resnet50, ResNet50_QuantizedWeights
resnet50(weights=ResNet50_Weights.DEFAULT)

# Strings are also supported
resnet50(weights="IMAGENET1K_V2")

# No weights - random initialization
resnet50(weights=None)

Method 2: Download the trained parameters:
download the parameters on the website, and then directly load them into the network.
insert image description here
The website is still the website above, and you can find it by scrolling down. And not just classification models, semantic segmentation, quantification, object detection, instance segmentation and character key point detection, etc. can be found here

4. Freezing of the model

Loading partial models is a common situation when transferring learning or training new complex models. Utilizing trained parameters helps to warm-start the training process and hopefully help your model converge faster than training from scratch.

Method 1: Set requires_grad to False

The effect of this method is: the frozen layer can be propagated forward or backward, but the parameters of its own layer are not updated, and the parameters of other unfrozen layers are updated normally. ~~It should be noted that the optimizer needs to add a filter:~~ (Other articles say that it needs to be added, but I use the code test to find that it seems to be okay without it)

optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, perception.parameters()), lr=learning_rate)

Also note that: The position of constructing the optimizer object should be placed after the freezing layer operation.

There are two ways to set requires_grad to False:
①: x.requires_grad_(False)
②: x.requires_grad = False
The two effects are the same.

Just look at the example:

import torch
import numpy as np
from torch import nn
from torch.optim import lr_scheduler
# 检查GPU是否可用
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {
      
      device} device")


# 首先建立一个卷积的神经网络
class Bulk(nn.Module):
    def __init__(self,in_channels,out_channels):
        super(Bulk, self).__init__()
        self.bulk_6 = nn.Sequential(
            nn.Conv2d(in_channels,in_channels*2, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.BatchNorm2d(in_channels*2),
            nn.Conv2d(in_channels*2, out_channels, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.BatchNorm2d(out_channels)
        )

    def forward(self, x):
        logits = self.bulk_6(x)
        return logits


class module(nn.Module):
    def __init__(self,in_C, hid1, hid2, out_C):
        super(module, self).__init__()
        self.bulk1 = Bulk(in_C,hid1)
        self.bulk2 = Bulk(hid1, hid2)
        self.bulk3 = Bulk(hid2, out_C)

    def forward(self, x):
        x1 = self.bulk1(x)
        x2 = self.bulk2(x1)
        y = self.bulk3(x2)
        return y


def save_breakpoint(model, lr_schedule):
    if (epoch + 1) % 5 == 0:
        checkpoint = {
    
    
            "net": model.state_dict(),
            'optimizer': optimizer.state_dict(),
            "epoch": epoch,
            'lr_schedule': lr_schedule.state_dict()}
        path_checkpoint = "./checkpoint.pkl"
        torch.save(checkpoint, path_checkpoint)


# 模型实例化
module_1 = module(1, 2, 2, 1)

# 创建输入输出数据
X = np.random.normal(1, 0.5, (32, 32))
Y = X*3+np.random.normal(0, 0.1, (32,32))
x = torch.tensor(X, dtype=torch.float32).unsqueeze(0).unsqueeze(0)
y = torch.tensor(Y, dtype=torch.float32).unsqueeze(0).unsqueeze(0)
loss_fn = torch.nn.MSELoss(reduction='mean')
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, module_1.parameters()), lr=1e-2)

epochs=2

for epoch in range(epochs):
    # 模型冻结
    if epoch == 0:
        for k, v in module_1.named_parameters():
            if any(x in k.split('.') for x in ['0']):
                print('unfreezing %s' % k)
                v.requires_grad_(True)
    else:
        for k, v in module_1.named_parameters():
            if any(x in k.split('.') for x in ['0']):
                print('freezing %s' % k)
                v.requires_grad_(False)
    # 查看参数
    for k, v in module_1.bulk1.bulk_6.named_parameters():
        if any(x in k.split('.') for x in ['0']):
            print(k,v)
    y_pred = module_1(x)                     # 预测
    loss = loss_fn(y_pred, y)                # 损失函数
    optimizer.zero_grad()                    # 清空梯度
    loss.backward()                          # 反向传播计算梯度
    optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, module_1.parameters()), lr=1e-2)
    scheduler = lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.995)
    optimizer.step()                         # 更新参数
    scheduler.step()                         # 一次epoch进行一次scheduler.step()，进行学习率更新
    save_breakpoint(module_1, scheduler)     # 保存断点
    # print('lr={0}, Loss={1}'.format(optimizer.state_dict()['param_groups'][0]['lr'], loss.item()))
    for k, v in module_1.bulk1.bulk_6.named_parameters():  # 查看参数
        if any(x in k.split('.') for x in ['0']):
            print(k,v)

torch.save(module_1.state_dict(),"freezing.pth")

When the gradient is updated normally, all layers participate in the training. The following are the changes before and after training of the bulk6.0 layer:
When the gradient stops updating, the bulk6.0 layer does not participate in training. The following is the change before and after training of the bulk6.0 layer:

Method 2: Use with torch.no_grad()

The nature of this method:
This method only needs to place the layers that need to be frozen under with torch.no_grad() in the forward method in the network definition.
The network layer put into with torch.no_grad() can propagate forward, but the reverse propagation is blocked, its own layer (such as self.layer2) and all previous related layers (such as self.layer1) All parameters will be frozen and will not be updated. But if the previous layer is related to other layers in addition to self.layer2, the part related to other layers will be updated normally.
Or look directly at the example:

# 首先建立一个卷积的神经网络
class Bulk(nn.Module):
    def __init__(self,in_channels,out_channels):
        super(Bulk, self).__init__()
        self.bulk_6 = nn.Sequential(
            nn.Conv2d(in_channels,in_channels*2, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.BatchNorm2d(in_channels*2),
            nn.Conv2d(in_channels*2, out_channels, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.BatchNorm2d(out_channels)
        )

    def forward(self, x):
        logits = self.bulk_6(x)
        return logits


class module(nn.Module):
    def __init__(self,in_C, hid1, hid2, out_C):
        super(module, self).__init__()
        self.bulk1 = Bulk(in_C,hid1)
        self.bulk2 = Bulk(hid1, hid2)
        self.bulk3 = Bulk(hid2, out_C)

    def forward(self, x):
        with torch.no_grad():
            x1 = self.bulk1(x)
        x2 = self.bulk2(x1)
        y = self.bulk3(x2)
        return y

# 模型实例化
module_1 = module(1, 2, 2, 1)

# 创建输入输出数据
X = np.random.normal(1, 0.5, (32, 32))
Y = X*3+np.random.normal(0, 0.1, (32,32))
x = torch.tensor(X, dtype=torch.float32).unsqueeze(0).unsqueeze(0)
y = torch.tensor(Y, dtype=torch.float32).unsqueeze(0).unsqueeze(0)
loss_fn = torch.nn.MSELoss(reduction='mean')
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, module_1.parameters()), lr=1e-2)

epochs=2

for epoch in range(epochs):

    # 查看参数
    for k, v in module_1.bulk1.bulk_6.named_parameters():
        if any(x in k.split('.') for x in ['0']):
            print(k,v)
    y_pred = module_1(x)                     # 预测
    loss = loss_fn(y_pred, y)                # 损失函数
    optimizer.zero_grad()                    # 清空梯度
    loss.backward()                          # 反向传播计算梯度
    optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, module_1.parameters()), lr=1e-2)
    scheduler = lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.995)
    optimizer.step()                         # 更新参数
    scheduler.step()                         # 一次epoch进行一次scheduler.step()，进行学习率更新
    # print('lr={0}, Loss={1}'.format(optimizer.state_dict()['param_groups'][0]['lr'], loss.item()))
    for k, v in module_1.bulk1.bulk_6.named_parameters():  # 查看参数
        if any(x in k.split('.') for x in ['0']):
            print(k,v)

The model in the example is composed of three network blocks, among which bulk1 is blocked with torch.no_grad() and does not participate in parameter update. The experimental results are as follows: the weight of the network layer in bulk1 has not changed after running an epoch.

Summarize

Method 1: More flexible, but cumbersome to write.
Method 2: You cannot block a certain epoch separately, but it is simple and convenient to use, and it is actually sufficient in most cases.

Here I recommend another article [pytorch] What should I do when filtering and freezing some network layer parameters and setting parameter groups at the same time?

5. Special loading methods and loading skills of the model

Example 1: Load the pre-trained model and remove the layers that need to be trained again

注意：需要重新训练的层的名字要和之前的不同。

model=resnet()#自己构建的模型，以resnet为例
model_dict = model.state_dict()
pretrained_dict = torch.load('xxx.pkl')
pretrained_dict = {
    
    k: v for k, v in pretrained_dict.items() if k in model_dict}
model_dict.update(pretrained_dict)
model.load_state_dict(model_dict)

Example 2: Fixed some parameters

#k是可训练参数的名字，v是包含可训练参数的一个实体
#可以先print（k）,找到自己想进行调整的层，并将该层的名字加入到if语句中：
for k,v in model.named_parameters():
    if k!='xxx.weight' and k!='xxx.bias' :
        v.requires_grad=False#固定参数

Example 3: Training some parameters

#将要训练的参数放入优化器
optimizer2=torch.optim.Adam(params=[model.xxx.weight，model.xxx.bias],lr=learning_rate,betas=(0.9,0.999),weight_decay=1e-5)

Example 4: Check whether some parameters are fixed

for k,v in model.named_parameters():
    if k!='xxx.weight' and k!='xxx.bias' :
            print(v.requires_grad)#理想状态下，所有值都是False

It should be noted that if the operation is wrong, the loss function will hardly change, and it will always be in the initial state. This is probably because all parameters are fixed.

6. Single GPU training and multi-GPU training

The GPU can process large-scale matrix data 50-100 times faster than the CPU, so it is necessary to use the GPU to run the algorithm.

Pytorch uses a single GPU training

Using GPU training only needs to modify a few places in the original code. You only need to move the model and data that need to run on the GPU, and the rest is the same as the program running on the CPU. We have two ways to implement the code to train on the GPU.

Method 1.cuda()

We can train on the GPU by calling .cuda() on the three variables of the network model, data, and loss function

# 实例化网络模型和损失函数
model = Model()
loss_fn = nn.CrossEntropyLoss()
# 将网络模型和损失函数放在gpu上训练
if torch.cuda.is_available():# 检测是否有可用的GPU
	model = model.cuda()	
	loss_fn = loss_fn.cuda()

# 数据放在gpu上训练
for data in dataloader:                        
	imgs, targets = data
    if torch.cuda.is_available():
        imgs = imgs.cuda()
        targets = targets.cuda()

Method 2.to(device)

The method is similar to the above, so I won’t repeat it too much, just upload the code directly.

# 指定训练的设备
device = torch.device("cpu")	# 使用cpu训练
device = torch.device("cuda")	# 使用gpu训练 
device = torch.device("cuda:0")	# 当电脑中有多张显卡时，使用第一张显卡
device = torch.device("cuda:1")	# 当电脑中有多张显卡时，使用第二张显卡

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 使用 GPU 训练
model = model.to(device)
loss_fn = loss_fn.to(device)
for data in train_dataloader:
    imgs, targets = data
    imgs = imgs.to(device)
    targets = targets.to(device)

Single-machine multi-card and multi-machine multi-card

This part cannot be realized due to equipment problems, so I will not write it. I recommend an article that I think is well written in this area, and friends who need it can jump over and read it.
"Super Full Pytorch Multi-GPU Training": https://blog.csdn.net/Ema1997/article/details/106284407

Training of deep learning models (big summary)