pytorch multi-GPU distributed training code writing

This article mainly describes the simple use of single-machine single-card and single-machine multi-card;

Stand-alone single card

Stand-alone single card means that there is only one card on one machine, which is the simplest training method

For a single machine and a single card, all we need to do is to copy the model and data to a single GPU, but if the GPU memory is not enough, an error will occur. At this time, we can only reduce the amount of data sent to the GPU or use the CPU for training. ;

The main precautions when starting a single-machine single-card are as follows:
1) Determine whether the card exists and whether the data can be sent to the GPU smoothly;

torch.cuda.is_available()

2) Copy the model;

model.cuda()

3) copying of data;

data.cuda()

4) Model loading and saving

//模型的加载
torch.load(path, map_location= torch.device("cuda:0")) #也可以加载到cpu上
//模型的保存
torch.save()#模型、优化器、其他变量

Simple classification code to realize single machine single card

from torch import nn
import torch
import torchvision
from torch.utils.data import DataLoader
from torchvision import models

if __name__ == '__main__':
	train_data = torchvision.datasets.CIFAR10('../data',train=True,transform = transform1,download=True)
    test_data = torchvision.datasets.CIFAR10('../data',train=False,transform = transform2,download=True)
    #判断数据长度
    train_data_size = len(train_data)
    test_data_size = len(test_data)
    #加载数据集
    traindata = DataLoader(train_data,batch_size=128,pin_memory=True)
    testdata = DataLoader(test_data,batch_size=128,pin_memory=True)
    #创建网络模型
    model= models.resnet101() 
    #模型拷贝
    model= model.cuda() 
    #损失函数
    loss_fn = nn.CrossEntropyLoss()
    loss_fn = loss_fn.cuda()
    #优化器
    optim = torch.optim.SGD(cai.parameters(),lr=0.001,momentum=0.9)
    #训练次数
    total_train_steps=0
    #测试次数
    total_test_steps=0
    epoch=100
    for i in range(epoch):
        print('------第{}轮训练开始------'.format(i+1))

        #训练步骤开始
        for data in traindata:
            imgs,targets = data
            #数据拷贝
            imgs = imgs.cuda()
            targets = targets.cuda()

            outputs = model(imgs)
            loss = loss_fn(outputs,targets)
            #优化器进行调优
            optim.zero_grad()
            loss.backward()
            optim.step()

            total_train_steps =total_train_steps+1
            if total_train_steps % 100 == 0:
                print('训练次数:{},loss:{}'.format(total_train_steps,loss.item()))

        #测试步骤开始
        model.eval()
        total_accuracy = 0
        total_test_loss = 0
        with torch.no_grad():
            for data in testdata:
                imgs, targets = data
                #数据拷贝
                imgs = imgs.cuda()
                targets = targets.cuda()
                outputs = model(imgs)
                loss = loss_fn(outputs, targets)
                total_test_loss = total_test_loss + loss.item()
                accuracy = (outputs.argmax(1) == targets).sum()
                total_accuracy = total_accuracy +accuracy
        print('整体测试集上的LOSS:{}'.format(total_test_loss ))
        print('整体测试集上的正确率:{}'.format(total_accuracy/test_data_size))
        total_test_steps = total_test_steps +1

Stand-alone multi-card

There are mainly two types of single-machine multi-card:

DP

1). Single-process data parallelism, torch.nn.DataParallel (commonly known as DP)
The operating principle of DP is to divide the input data of a batch size into multiple GPUs for separate calculations. In DP mode, there is only one process in total (subject to GIL Strong restrictions), the master node is equivalent to the parameter server, which will broadcast its parameters to other cards; after the gradient backpropagation, each card will concentrate the gradient to the master node, and the master node will update the parameters after averaging the collected parameters. Then send the parameters to other cards uniformly. This parameter update method will lead to heavy computing tasks and communication traffic on the master node, which will lead to network congestion and reduce the training speed.
Compared with a single machine and a single card, the DP mode can be realized by adding only one line of code, model=torch.nn.DataParallel(model.cuda(), device_ids=[0,1,2,3]);

    # 创建网络模型
    model= models.resnet101() 
    #模型拷贝
    model=torch.nn.DataParallel(model.cuda() , device_ids=[0,1,2,3]) #值得注意的是,模型和数据都需要先load进GPU中,DataParallel中的module才能对其进行处理否则会报错

Due to the slow efficiency of DP, it is generally used less now, and the officially recommended DDP is generally used;

DDP

2). Multi-process data parallelism, torch.nn.parallel.DistributedDataParallel (commonly known as DDP)
DDP is that each process controls each GPU, which is different from DataParallel's single-process control of multiple GPUs; DDP mode will open N processes, each process Load the model on a graphics card, and if there are N cards, it will be copied N copies, which alleviates the limitation of GIL; and each process in DDP communicates with other processes to exchange their gradients through the Ring-Reduce method during the training phase, so Each of its processes only communicates with its own upstream and downstream processes, which greatly alleviates the communication blocking phenomenon of the parameter server.

This article introduces a common way to start DDP, using the torch.distributed.launch launcher officially provided by pytorch to execute python files on the command line; we need to pass in the corresponding parameters:

parser = argparse.ArgumentParser()
parser.add_argument('--local_rank', default=-1, type=int, help='node rank for distributed training')
parser.add_argument("--gpu_id", type=str, default='0,1,2,3,4,5', help='path log files')
args = parser.parse_args()
os.environ["CUDA_DEVICE_ORDER"] = 'PCI_BUS_ID'
os.environ["CUDA_VISIBLE_DEVICES"] = opt.gpu_id

Then initialize the process:

torch.distributed.init_process_group("nccl",world_size=n_gpu,rank=args.local_rank) # 第一参数nccl为GPU通信方式, world_size为当前机器GPU个数,rank为当前进程在哪个PGU上

Set the number of cards used by the process:

torch.cuda.set_device(args.local_rank)

Wrap the model:

model=torch.nn.DistributedDataParallel(model.cuda(args.local_rank), device_ids=[args.local_rank]),#这里device_ids传入一张卡即可,因为是多进程多卡,一个进程一个卡

Distribute data to different GPUs:

train_sampler = torch.util.data.distributed.DistributedSampler(train_dataset) # train_dataset为Dataset()

Pass train_sampler into DataLoader, do not need to pass in shuffle=True, because shuffle and sampler are mutually exclusive data_dataloader = DataLoader(..., sampler=train_sampler)

Full code example:

from torch import nn
import torch
import torchvision
from torch.utils.data import DataLoader
from torchvision import models
import argparse
import os
import torch.distributed as dist

parser = argparse.ArgumentParser()
parser.add_argument('--local_rank', default=-1, type=int, help='node rank for distributed training')
parser.add_argument("--gpu_id", type=str, default='0,1', help='path log files')
opt= parser.parse_args()
os.environ["CUDA_DEVICE_ORDER"] = 'PCI_BUS_ID'
os.environ["CUDA_VISIBLE_DEVICES"] = opt.gpu_id

if __name__ == '__main__':
   #初始化
    local_rank = opt.local_rank
    dist.init_process_group(backend="nccl", world_size=len(opt.gpu_id.split(',')), rank=local_rank)
    rank        = int(os.environ["RANK"])
	torch.cuda.set_device(local_rank)
	
	train_data = torchvision.datasets.CIFAR10('../data',train=True,download=True)
    test_data = torchvision.datasets.CIFAR10('../data',train=False,download=True)

    train_sampler   = torch.utils.data.distributed.DistributedSampler(train_data, shuffle=True,)
	test_sampler   = torch.utils.data.distributed.DistributedSampler(test_data, shuffle=True,)
    #判断数据长度
    train_data_size = len(train_data)
    test_data_size = len(test_data)
    #加载数据集
    traindata = DataLoader(train_data,batch_size=128,pin_memory=True,sample = train_sampler)
    testdata = DataLoader(test_data,batch_size=128,pin_memory=True,sample = test_sampler)
    #创建网络模型
    model= models.resnet101() 
    #模型拷贝
    model= model.cuda(local_rank) 
    model  = torch.nn.parallel.DistributedDataParallel(model_train, device_ids=[local_rank], find_unused_parameters=True)
    
    #损失函数
    loss_fn = nn.CrossEntropyLoss()
    loss_fn = loss_fn.cuda()
    #优化器
    optim = torch.optim.SGD(model.parameters(),lr=0.001,momentum=0.9)
    #训练次数
    total_train_steps=0
    #测试次数
    total_test_steps=0
    epoch=100
  
    for i in range(epoch):
        print('------第{}轮训练开始------'.format(i+1))
		train_sampler.set_epoch(epoch)
        #训练步骤开始
        for data in traindata:
            imgs,targets = data
            #数据拷贝
            imgs = imgs.cuda()
            targets = targets.cuda()			
            outputs = model(imgs)
            loss = loss_fn(outputs,targets)
            #优化器进行调优
            optim.zero_grad()
            loss.backward()
            optim.step()
            total_train_steps =total_train_steps+1
            if total_train_steps % 100 == 0:
                print('训练次数:{},loss:{}'.format(total_train_steps,loss.item()))

        #测试步骤开始
        model.eval()
        total_accuracy = 0
        total_test_loss = 0
        with torch.no_grad():
            for data in testdata:
                imgs, targets = data
                #数据拷贝
                imgs = imgs.cuda()
                targets = targets.cuda()
                outputs = model(imgs)
                loss = loss_fn(outputs, targets)
                total_test_loss = total_test_loss + loss.item()
                accuracy = (outputs.argmax(1) == targets).sum()
                total_accuracy = total_accuracy +accuracy
        print('整体测试集上的LOSS:{}'.format(total_test_loss ))
        print('整体测试集上的正确率:{}'.format(total_accuracy/test_data_size))
        total_test_steps = total_test_steps +1

Notes
1. At the beginning of each epoch, you need to call train_sampler.set_epoch(epoch) to fully scramble the data, otherwise the data returned by each epoch is the same;
2. When executing the command, you need to add -m torch.distributed. launch parameters, nproc_per_node execution process number/GPU number, launch will pass args.local_rank like xx.py, local_rank from 0 to n_gpus - 1 index
python -m torch.distributed.launch --nproc_per_node=n_gpus --master_port 22225 xx.py
3.launch will pass in args.local_rank like train.py, local_rank is from 0 to n_gpus - 1 index, train.py needs to accept the parameters of local_rank
4. Note that Batch_size is the Batch_size of each GPU
5.MASTER_PORT: The port number of the master node. When we use the server, its default node will be used by others, so others will report an error when using launch to start, so we need to randomly generate a port number to start.

Guess you like

Origin blog.csdn.net/qq_50027359/article/details/126779341