ch07-Pytorch training skills

0 Preface

1. Model saving and loading

This section mainly introduces serialization and deserialization, as well as the two ways of saving and loading the model in PyTorch, and the model's breakpoint continued training.

1.1. Serialization and deserialization

The model is saved in the memory as a logical structure of objects, but in the hard disk it is saved as a binary stream.

  • Serialization refers to saving data in memory to the hard disk in the form of binary sequences. PyTorch's model preservation is serialization.
  • Deserialization refers to loading the binary sequence from the hard disk into memory to obtain the model object. PyTorch's model loading is deserialization.

Insert image description here

1.2. Model saving and loading in PyTorch

Insert image description here

(1) Model saving torch.save

  • torch.save(obj, f, pickle_module, pickle_protocol=2, _use_new_zipfile_serialization=False)
  • The main parameters:
    • obj: The saved object, which can be a model. It can also be a dict. Because generally when saving a model, not only the model must be saved, but also parameters such as the optimizer and the corresponding epoch must be saved. At this time, you can wrap it with dict.
    • f: output path

There are two ways to save the model:

  • Save the entire Module: This method is more time-consuming and the saved file is large:torch.savev(net, path)
  • Only save model parameters: This method is recommended because it runs faster and saves smaller files.
    state_sict = net.state_dict()
    torch.savev(state_sict, path)
    

Below is an example of saving LeNet. In the network initialization, set the weights to 2020, and then save the model.

import torch
import numpy as np
import torch.nn as nn
from common_tools import set_seed

class LeNet2(nn.Module):
    def __init__(self, classes):
        super(LeNet2, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 6, 5),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(6, 16, 5),
            nn.ReLU(),
            nn.MaxPool2d(2, 2)
        )
        self.classifier = nn.Sequential(
            nn.Linear(16*5*5, 120),
            nn.ReLU(),
            nn.Linear(120, 84),
            nn.ReLU(),
            nn.Linear(84, classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size()[0], -1)
        x = self.classifier(x)
        return x

    def initialize(self):
        for p in self.parameters():
            p.data.fill_(2020)

net = LeNet2(classes=2019)
# "训练"
print("训练前: ", net.features[0].weight[0, ...])
net.initialize()
print("训练后: ", net.features[0].weight[0, ...])

path_model = "./model.pkl"
path_state_dict = "./model_state_dict.pkl"
# 保存整个模型
torch.save(net, path_model)
# 保存模型参数
net_state_dict = net.state_dict()
torch.save(net_state_dict, path_state_dict)

After running, the folder is generated model.pkl和model_state_dict.pkl, and the entire network and network parameters are saved respectively.

(2) Model loading torch.load

  • torch.load(f, map_location=None, pickle_module, **pickle_load_args)
  • The main parameters:
    • f: file path
    • map_location: Specifies the presence of CPU or GPU.

There are also two ways to load models:

  • Load the entire Module

If the entire model is saved when saving, then the entire model will be loaded when loading. This method does not require creating a model object in advance, nor does it need to know the structure of the model. The code is as follows:

path_model = "./model.pkl"
net_load = torch.load(path_model)

print(net_load)

The output is as follows:

LeNet2(
  (features): Sequential(
    (0): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
    (4): ReLU()
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (classifier): Sequential(
    (0): Linear(in_features=400, out_features=120, bias=True)
    (1): ReLU()
    (2): Linear(in_features=120, out_features=84, bias=True)
    (3): ReLU()
    (4): Linear(in_features=84, out_features=2019, bias=True)
  )
)
  • Only load model parameters

If the parameters of the model are saved when saving, then the parameters will be used when loading. This method requires creating a model object in advance, and then using the model's load_state_dict() method to load parameters into the model. The code is as follows:

path_state_dict = "./model_state_dict.pkl"
state_dict_load = torch.load(path_state_dict)
net_new = LeNet2(classes=2019)

print("加载前: ", net_new.features[0].weight[0, ...])
net_new.load_state_dict(state_dict_load)
print("加载后: ", net_new.features[0].weight[0, ...])

1.3. Continue training at breakpoints of the model

During the training process, the training may be terminated due to some unexpected reasons such as breakpoints, etc. In this case, the training needs to be restarted. Breakpoint resume training saves the model parameters and optimizer parameters every certain number of epochs during the training process , so that if the training is terminated unexpectedly, the latest model parameters and optimizer parameters can be reloaded next time . Continue training on this basis.

In the code below, it is saved every 5 epochs. What is saved is a dict, including model parameters, optimizer parameters, and epochs. Then when epoch is greater than 5, break simulates the unexpected termination of training. The key code is as follows:

    if (epoch+1) % checkpoint_interval == 0:

        checkpoint = {
    
    "model_state_dict": net.state_dict(),
                      "optimizer_state_dict": optimizer.state_dict(),
                      "epoch": epoch}
        path_checkpoint = "./checkpoint_{}_epoch.pkl".format(epoch)
        torch.save(checkpoint, path_checkpoint)

When epoch is greater than 5, break simulates an unexpected termination of training.

    if epoch > 5:
        print("训练意外中断...")
        break

The recovery code for resume training from a breakpoint is as follows:

path_checkpoint = "./checkpoint_4_epoch.pkl"
checkpoint = torch.load(path_checkpoint)

net.load_state_dict(checkpoint['model_state_dict'])

optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

start_epoch = checkpoint['epoch']

scheduler.last_epoch = start_epoch

It should be noted that the scheduler.last_epoch parameter must also be set to the saved epoch. The starting epoch of model training must also be changed to the saved epoch.

2. Model Finetune

  • mThe main task of this section: Understand transfer learning and model finetune

  • Detailed introduction: The method of learning model fine-tuning (Finetune), and understanding the relationship between Transfer Learning (transfer learning) and Model Finetune.

2.1.Transfer Learning & Model Finetune

Transfer learning: A branch of machine learning that studies how knowledge from the source domain is applied to the target domain.

Insert image description here

The so-called model fine-tuning is actually the transfer learning of the model. In deep learning, through continuous iteration, the weights in the volume base layer are updated. The weights here can be called knowledge, and then we can migrate these knowledges. Mainly The purpose is to apply these knowledge to new models, which can not only reduce the overfitting phenomenon caused by insufficient data volume, but also speed up the training of the model.

Insert image description here

For example, for face recognition, ImageNet can be regarded as the source domain, and the face data set can be regarded as the target domain. Generally speaking, the source domain is much larger than the target domain. The network trained by ImageNet can be used in face recognition.

Specifically, for convolutional neural networks, we can regard the previous convolutional base layer and pooling layer as feature extractor, which is a very common part. Get a series of feature maps.
The fully connected layer that follows can be called a classifier, which is related to the specific task. And change the output of the last fully connected layer to adapt to the target task, and train the weight of the subsequent classifier. This is Finetune. Usually the data in the target domain is relatively small and is not enough to train all parameters, which can easily lead to overfitting, so the weight of the feature extractor is not changed.

Insert image description here

Finetune steps are as follows:

  • 1. Obtain the parameters of the pre-trained model
  • 2. Use load_state_dict() to load parameters into the model
  • 3. Modify the output layer
  • 4. Fixed the parameters of feature extractor. There are usually two approaches to this part:
    • 1. Fixed the pre-training parameters of the convolutional layer. You can set requires_grad=False or lr=0
    • 2. You can set a smaller learning rate for the feature extractor through params_group

Next, ResNet-18 is fine-tuned for binary classification of bee and ant images. The training set contains 120 images of each type of data, and the validation set contains 70 images of each category of data.

Insert image description here

The Resnet-18 model structure is shown in the figure below:

The first four layers are feature extraction, the next four layers (layer1~layer4) are residual networks, then the avgpool pooling layer, and finally FC classification (the original model has 1000 categories, trained on ImageNet).

Insert image description here

(1) Not using Finetune

For the first time, we do not use Finetune first, but start training the model from scratch. At this time, we only need to modify the fully connected layer:

# 首先拿到 fc 层的输入个数
num_ftrs = resnet18_ft.fc.in_features
# 然后构造新的 fc 层替换原来的 fc 层
resnet18_ft.fc = nn.Linear(num_ftrs, classes)

The output is as follows:

use device :cpu
Training:Epoch[000/025] Iteration[010/016] Loss: 0.7192 Acc:47.50%
Valid:  Epoch[000/025] Iteration[010/010] Loss: 0.6885 Acc:51.63%
...
Valid:  Epoch[024/025] Iteration[010/010] Loss: 0.5923 Acc:70.59%

The accuracy after training for 25 epochs: 70.59%.

The training loss curve is as follows:

Insert image description here

The loss value is always around 0.6, and the accuracy obtained is only 70%

(2) Use Finetune

Then we load the downloaded model parameters into the model:

path_pretrained_model = enviroments.resnet18_path
state_dict_load = torch.load(path_pretrained_model)
resnet18_ft.load_state_dict(state_dict_load)

Do not freeze convolutional layers

At this time, we do not freeze the convolutional layer, all layers use the same learning rate, and the output is as follows:

use device :cpu
Training:Epoch[000/025] Iteration[010/016] Loss: 0.6299 Acc:65.62%
...
Valid:  Epoch[024/025] Iteration[010/010] Loss: 0.1808 Acc:96.08%

The accuracy after training for 25 epochs: 96.08%.

The training loss curve is as follows:

It can be seen that the loss value finally converged to around 0.2, and the Accuracy reached 90% in the second Epoch.

Insert image description here

Insert image description here

2.2. Finetune in PyTorch


  • frozen convolution layer

  • Set requires_grad=False

Here, all parameters are first frozen, and then the fully connected layer is replaced, which is equivalent to freezing the parameters of the convolutional layer:

for param in resnet18_ft.parameters():
 param.requires_grad = False
 # 首先拿到 fc 层的输入个数
num_ftrs = resnet18_ft.fc.in_features
# 然后构造新的 fc 层替换原来的 fc 层
resnet18_ft.fc = nn.Linear(num_ftrs, classes)

Experimental results are not provided here.


  • Set learning rate to 0

Here, the learning rate of the convolutional layer is set to 0, and different learning rates need to be set in the optimizer. First obtain the address of the fully connected layer parameters, and then use filter to filter parameters that do not belong to the fully connected layer, that is, retain the parameters of the convolutional layer; then set the group learning rate of the optimizer and pass in a list containing 2 elements, each Each element is a dictionary, corresponding to 2 parameter groups. The learning rate of the convolutional layer is set to 0.1 times that of the fully connected layer.

# 首先获取全连接层参数的地址
fc_params_id = list(map(id, resnet18_ft.fc.parameters()))     # 返回的是parameters的 内存地址
# 然后使用 filter 过滤不属于全连接层的参数,也就是保留卷积层的参数
base_params = filter(lambda p: id(p) not in fc_params_id, resnet18_ft.parameters())
# 设置优化器的分组学习率,传入一个 list,包含 2 个元素,每个元素是字典,对应 2 个参数组
optimizer = optim.SGD([{
    
    'params': base_params, 'lr': 0}, {
    
    'params': resnet18_ft.fc.parameters(), 'lr': LR}], momentum=0.9)

Experimental results are not provided here.


  • Use group learning rate

Here, the convolutional layer is not frozen, but a smaller learning rate is used for the convolutional layer and a larger learning rate is used for the fully connected layer. Different learning rates need to be set in the optimizer. First obtain the address of the fully connected layer parameters, and then use filter to filter parameters that do not belong to the fully connected layer, that is, retain the parameters of the convolutional layer; then set the group learning rate of the optimizer and pass in a list containing 2 elements, each Each element is a dictionary, corresponding to 2 parameter groups. The learning rate of the convolutional layer is set to 0.1 times that of the fully connected layer.

# 首先获取全连接层参数的地址
fc_params_id = list(map(id, resnet18_ft.fc.parameters()))     # 返回的是parameters的 内存地址
# 然后使用 filter 过滤不属于全连接层的参数,也就是保留卷积层的参数
base_params = filter(lambda p: id(p) not in fc_params_id, resnet18_ft.parameters())
# 设置优化器的分组学习率,传入一个 list,包含 2 个元素,每个元素是字典,对应 2 个参数组
optimizer = optim.SGD([{
    
    'params': base_params, 'lr': LR*0}, {
    
    'params': resnet18_ft.fc.parameters(), 'lr': LR}], momentum=0.9)

Experimental results are not provided here.


  • Tips for using GPU

The PyTorch model uses GPU and can be divided into 3 steps:

  • First get the device:device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  • Load the model intodevice:model.to(device)
  • In the data_loader data-fetching loop, load the data and label of each mini-batch to the device:inputs, labels = inputs.to(device), labels.to(device)

3. Use GPU to train the model

This section mainly introduces the use of GPU.

3.1.CPU to GPU

  • CPU (Central Processing Unit, central processing unit): mainly includes controllers and arithmetic units
  • GPU (Graphics Processing Unit, graphics processor): handles unified, independent large-scale data operations

The structure diagram of the two is as follows. You can see that the green part (computing unit) GPU is obviously more than the CPU.

Insert image description here

The GPU has more ALUs (arithmetic operation units) than the CPU, and the CPU has more cache areas, which are used to accelerate the running of programs. Both are suitable for different tasks. Computationally intensive programs and programs that are easy to parallelize are usually completed on the GPU. .

3.2. Data migration to GPU

When two data are being processed, they must be stored on the same device at the same time, either the CPU or the GPU at the same time. And the data and model must be on the same device. Data and models can be transferred from one device to another using the to() method. The to() method of data can also convert data types.

Insert image description here

  • From CPU to GPU
    device = torch.device("cuda")
    tensor = tensor.to(device)
    module.to(device)
    
  • From GPU to CPU
    device = torch.device(cpu)
    tensor = tensor.to("cpu")
    module.to("cpu")
    

  • .to()Function: Convert data type or device
    Insert image description here
x=torch.ones((3,3))#定义一个张量
x=x.to(torch.float64)#把默认的float32转换为float64
x=torch.ones(3,3)#定义一个张量
x=x.to("cuda")#迁移到GPU

linear=nn.Linear(2,2)#定义一个module
linear.to(torch.double)#把module中所有的参数从默认的float32转换为float64(double就是float64)

gpu1=torch.device("cuda")#定义设备
linear.to(gpu1)#迁移到gpu
  • You can see that in the above two examples, tensoryou need to use the equal sign for assignment, modulebut you can directly execute the to function.

  • The difference between the to() methods of tensor and module is that tensor.to() does not perform an inplace operation, so assignment is required; module.to() performs an inplace operation.

  • tensor.to() and module.to()

First import the library to obtain the GPU device

import torch
import torch.nn as nn
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
  • The following code is to execute the to() method of Tensor
x_cpu = torch.ones((3, 3))
print("x_cpu:\ndevice: {} is_cuda: {} id: {}".format(x_cpu.device, x_cpu.is_cuda, id(x_cpu)))

x_gpu = x_cpu.to(device)
print("x_gpu:\ndevice: {} is_cuda: {} id: {}".format(x_gpu.device, x_gpu.is_cuda, id(x_gpu)))

The output is as follows:

x_cpu:
device: cpu is_cuda: False id: 1415020820304
x_gpu:
device: cpu is_cuda: True id: 2700061800153

You can see that Tensor's to() method is not an inplace operation, and the memory addresses of x_cpu and x_gpu are different.

  • The following code executes the to() method of Module
net = nn.Sequential(nn.Linear(3, 3))

print("\nid:{} is_cuda: {}".format(id(net), next(net.parameters()).is_cuda))

net.to(device)
print("\nid:{} is_cuda: {}".format(id(net), next(net.parameters()).is_cuda))

The output is as follows:

id:2325748158192 is_cuda: False
id:2325748158192 is_cuda: True

You can see that the to() method of Module is an inplace operation, and the memory address is the same.


torch.cuda common methods

  • torch.cuda.device_count(): Returns the number of currently visible and available GPUs

  • torch.cuda.get_device_name(): Get the GPU name

  • torch.cuda.manual_seed(): Set a random seed for the current GPU

  • torch.cuda.manual_seed_all(): Set random seeds for all visible GPUs

  • torch.cuda.set_device(): Set which physical GPU the main GPU is. This method is not recommended.

  • os.environ.setdefault("CUDA_VISIBLE_DEVICES", "2", "3"): Set visible GPU

In PyTorch, there are physical GPUs and logical GPUs, and the correspondence between them can be set.

Insert image description here

In the above figure, if executed os.environ.setdefault("CUDA_VISIBLE_DEVICES", "2", "3"), the number of visible GPUs is only 2. The corresponding relationship is as follows:

Insert image description here

If executed os.environ.setdefault("CUDA_VISIBLE_DEVICES", "0", "3", "2"), the number of visible GPUs is only 3. The corresponding relationship is as follows:

Insert image description here

The reason for the setting is that there may be many users and tasks using the GPU in the system. Setting the GPU number can allocate the GPU reasonably. Usually the default gpu0 is the main GPU. The concept of the main GPU is related to the distributed parallel mechanism of multiple GPUs.

3.3. Multi-GPU distribution parallelism

Generally speaking, there are three steps for multi-GPU parallel operation: distribution → parallel operation → result recovery

  • Distribution: Distribute data from the main GPU to each GPU
  • Parallel operation: Each GPU performs operations separately
  • Result recycling: Each GPU sends the results obtained by the operation back to the main GPU

PyTorch implementation:

  • torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)

  • Function: Packaging model to implement distribution parallel mechanism. The data can be evenly distributed to each GPU. The actual data volume of each GPU is batch _ size GPU number\frac {batch\_size}{GPU number}Number of GPUsbatch_size, to achieve parallel computing.

  • The main parameters:

    • module: model that needs to be packaged and distributed
    • device_ids: Distributable GPUs, distributed to all visible and available GPUs by default
    • output_device: result output device

It should be noted that DataParallelwhen using , deviceyou must specify a GPU as the main GPU, otherwise an error will be reported:

RuntimeError: module must have its parameters and buffers on device cuda:1 (device_ids[0]) but found one of them on device: cuda:2

This is because using multiple GPUs requires a main GPU to distribute the data of each batch to each GPU and collect the calculated results from each GPU. If the main GPU is not specified, the data will be distributed directly to each GPU, which will cause some data to be on a certain GPU and other data on other GPUs, causing calculation errors.

The following code sets two visible GPUs, batch_size is 2, then the number of data obtained by each GPU per batch is 8, and the number of data is printed in the forward propagation of the model.

# 设置 2 个可见 GPU
    gpu_list = [0,1]
    gpu_list_str = ','.join(map(str, gpu_list))
    os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
    # 这里注意,需要指定一个 GPU 作为主 GPU。
    # 否则会报错:module must have its parameters and buffers on device cuda:1 (device_ids[0]) but found one of them on device: cuda:2
    # 参考:https://stackoverflow.com/questions/59249563/runtimeerror-module-must-have-its-parameters-and-buffers-on-device-cuda1-devi
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    batch_size = 16

    # data
    inputs = torch.randn(batch_size, 3)
    labels = torch.randn(batch_size, 3)

    inputs, labels = inputs.to(device), labels.to(device)

    # model
    net = FooNet(neural_num=3, layers=3)
    net = nn.DataParallel(net)
    net.to(device)

    # training
    for epoch in range(1):

        outputs = net(inputs)

        print("model outputs.size: {}".format(outputs.size()))

    print("CUDA_VISIBLE_DEVICES :{}".format(os.environ["CUDA_VISIBLE_DEVICES"]))
    print("device_count :{}".format(torch.cuda.device_count()))

The output is as follows:

batch size in forward: 8
model outputs.size: torch.Size([16, 3])
CUDA_VISIBLE_DEVICES :0,1
device_count :2

The code below sorts based on GPU remaining memory.

 def get_gpu_memory():
        import platform
        if 'Windows' != platform.system():
            import os
            os.system('nvidia-smi -q -d Memory | grep -A4 GPU | grep Free > tmp.txt')
            memory_gpu = [int(x.split()[2]) for x in open('tmp.txt', 'r').readlines()]
            os.system('rm tmp.txt')
        else:
            memory_gpu = False
            print("显存计算功能暂不支持windows操作系统")
        return memory_gpu


    gpu_memory = get_gpu_memory()
    if not gpu_memory:
        print("\ngpu free memory: {}".format(gpu_memory))
        gpu_list = np.argsort(gpu_memory)[::-1]

        gpu_list_str = ','.join(map(str, gpu_list))
        os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Among them nvidia-smi -q -d Memory, it queries the memory information of all GPUs, -qindicating the query, -dand specifies the content of the query.

nvidia-smi -q -d Memory | grep -A4 GPUIt intercepts the first 4 lines of the GPU, as follows:

Attached GPUs                       : 2
GPU 00000000:1A:00.0
    FB Memory Usage
        Total                       : 24220 MiB
        Used                        : 845 MiB
        Free                        : 23375 MiB
--
GPU 00000000:68:00.0
    FB Memory Usage
        Total                       : 24217 MiB
        Used                        : 50 MiB
        Free                        : 24167 MiB

nvidia-smi -q -d Memory | grep -A4 GPU | grep FreeIt is to extract the line where Free is located, that is, to extract the remaining memory information, as follows:

        Free                        : 23375 MiB
        Free                        : 24167 MiB

nvidia-smi -q -d Memory | grep -A4 GPU | grep Free > tmp.txtIt saves the remaining memory information to tmp.txt.

[int(x.split()[2]) for x in open('tmp.txt', 'r').readlines()]Each row is processed using a list expression.

Assuming x=" Free : 23375 MiB", then x.split() splits by spaces by default, and the result is:

['Free', ':', '23375', 'MiB']

The result of x.split()[2] is 23375.

Assume gpu_memory=['5','9','3']that np.argsort(gpu_memory)the result of is array([2, 0, 1], dtype=int64)that the sorted index is taken from small to large. np.argsort(gpu_memory)[::-1]The result is array([1, 0, 2], dtype=int64)that the order of the elements is reversed.

In Python, list[<start>:<stop>:<step>]it means taking out elements from start to stop, and the interval is step. Step=-1 means taking out elements from stop to start. start defaults to the position of the first element, and stop defaults to the position of the last element.

The result of ','.join(map(str, gpu_list)) is '1,0,2'.

Finally, os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str) sets the corresponding relationship from large to small according to the remaining memory of the GPU, so that the GPU with the largest remaining memory is the main GPU by default.

Insert image description here

Use the Increase GPU Utilization
nvidia-smi command to view the GPU utilization, as shown in the figure below.

Insert image description here

In the screenshot above, there are two graphics cards (GPU). The upper part shows the information of the graphics cards , and the lower part shows the processes running on each graphics card . You can see that GPU number 0 is running the process with PID 14383. Memory UsageIndicates the usage rate of video memory. GPU numbered 0 uses 16555 MBvideo memory, and the utilization rate of video memory is about 70%. Volatile GPU-UtilIndicates the calculation of the actual computing power utilization of the GPU. The GPU numbered 0 has only 27% usage.

Although using the GPU can accelerate model training, if the Memory Usage and Volatile GPU-Util of the GPU are too low, it means that the GPU is not fully utilized.

Therefore, when using a GPU to train a model, you need to maximize the two indicators of GPU Memory Usage and Volatile GPU-Util, which can further speed up your training process.

Let’s talk about how to improve these two indicators.

Memory Usage
is an indicator that the amount of data is mainly determined by the size of the model and the size of the data amount.

The model size is determined by the parameters and network structure of the network. The larger the model, the slower the training.

What we mainly adjust is the size of the training data for each batch, which is batch_size.

When the model structure is fixed, try to set the batch size as large as possible to make full use of the GPU memory.


Setting a relatively large batch size on Volatile GPU-Util can increase the memory usage of the GPU, but it may not necessarily increase the usage of the GPU computing unit.

As you can see from the front, our data is first read into the CPU, and during loop training, it is loaded from the CPU to the CPU through the tensor.to() method, as shown in the following code.

# 遍历 train_loader 取数据
for i, data in enumerate(train_loader):
    inputs, labels = data
    inputs = inputs.to(device) # 把数据从 CPU 加载到 GPU
    labels = labels.to(device) # 把数据从 CPU 加载到 GPU
    .
    .
    .

If the batch size is relatively large, then in Dataset and DataLoader, the CPU will process a batch of data very slowly. At this time, you will find that the value of Volatile GPU-Util will be at 0%, 20%, 70%, 95%, 0 % keeps changing.

The nvidia-smi command can check the GPU utilization, but it cannot dynamically refresh the display. If you want to refresh the display of GPU information every second, you can use watch -n 1 nvidia-smi.

In fact, this is because the GPU processes data very quickly, while the CPU processes data slowly. Every time the GPU receives a batch of data, the usage rate jumps to a gradual increase. After processing the data of this batch, the usage rate gradually decreases until the CPU transmits the data of the next batch.

The solution is: set Dataloaderthe two parameters:

  • num_workers: By default, only one CPU is used to read and process data. Can be set to 4, 8, 16 and other parameters. But the number of threads is not always better. Because multi-core processing needs to distribute data to each CPU, after the processing is completed, data needs to be collected from multiple CPUs, and this process also takes time. If num_workers is set too large, operations such as distributing and collecting data will take up too much time, which will reduce efficiency.
  • pin_memory: If the memory is large, it is recommended to set it to True.
    • Set to True, which means that the data is directly mapped to the relevant memory block of the GPU, saving a little data transmission time.
    • Set to False, which means that the data is transferred from the CPU to the cache RAM and then transferred to the GPU.

Error reports and solutions when loading GPU models

  • Error 1:

If the model is saved on the GPU and torch.load(path_state_dict) is used to load the model on a device without a GPU, the following error will appear:

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Possible reason: After saving the model trained by GPU, it cannot be loaded directly on the device without GPU. The solution is to setmap_location="cpu":torch.load(path_state_dict, map_location="cpu")

  • Error 2:

If the model is packaged with net = nn.DataParallel(net), then mmodule. will be added in front of the names of all network layers. If you save the model and load it again without using nn.DataParallel() packaging, the loading will fail because the names of the parameters in state_dict do not match.

Missing key(s) in state_dict: xxxxxxxxxx

Unexpected key(s) in state_dict:xxxxxxxxxx

The solution is to traverse the parameters of state_dict after loading the parameters. If the name starts with module., remove module.. code show as below:


from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
    namekey = k[7:] if k.startswith('module.') else k
    new_state_dict[namekey] = v

Then load the parameters into the model.

4. Common errors in pytorch

Guess you like

Origin blog.csdn.net/fb_941219/article/details/130642201