Pytorch multi-GPU training - a single operational node -All you need

Outline

Pytorch Training multi-GPU data is essentially parallel with the entire model parameters on each GPU, a batch of data are divided into N parts, each of a data processing GPU, and then on a gradient obtained across each GPU integration batch gradient update parameters on all GPU gradient consolidated, complete iteration.

Wherein the multi-gpu training program, there are two, one is to use nn.DataParallelimplementation of this method is the first to introduce the pytorch, easy to use and does not involve multiple processes. Another is to use torch.nn.parallel.DistributedDataParalleland
torch.utils.data.distributed.DistributedSamplercombine multiple processes to achieve higher efficiency of the second approach, a reference , but a little difficult to implement, the second approach also supports multi-node distributed implementation. Scheme II a higher efficiency than the embodiment, even on a single computing node, reference pytorch DOC :

In the single-machine synchronous case, torch.distributed or the torch.nn.parallel.DistributedDataParallel() wrapper may still have advantages over other approaches to data parallelism, including torch.nn.DataParallel():

This article will detail the realization of these two methods, limited to achieve on a stand-alone, distributed more complex, the next article will introduce.
reference:

Option One

step

  • The model with nn.DataParallelwrap.
model = nn.DataParallel(model)
  • By os.environ["CUDA_VISIBLE_DEVICES"]="0"using GPU device number specifies the current program, if not specified GPU device will use all the equipment.
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2" #使用3个GPU
  • model.cuda () or model.to ( "cuda") and data.cuda () or data.to ( "cuda") and model data into the GPU.

Training process consistent with the use of a single GPU, using this method, pytorch batch will automatically break the data into N parts (N is a os.environspecified quantity GPU), respectively, forward, backward, and then automatically integrated gradient on each GPU, update parameters on a single GPU, the last parameter will be broadcast to other GPU, complete iteration.

test

Code:


Expansion

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import os

# dataset
class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

# model define
class Model(nn.Module):
    # Our model

    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)
        print("\tIn Model: input size", input.size(),
              "output size", output.size())

        return output

if __name__=="__main__":
    # Parameters
    input_size = 5
    output_size = 2

    batch_size = 30
    data_size = 100

    dataset = RandomDataset(input_size, data_size)
    # dataloader define
    rand_loader = DataLoader(dataset=dataset,
                            batch_size=batch_size, shuffle=True)

    # model init
    model = Model(input_size, output_size)

    # cuda devices
    os.environ["CUDA_VISIBLE_DEVICES"]="0,1"
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    if torch.cuda.device_count() > 1:
        print("Let's use", torch.cuda.device_count(), "GPUs!")
        # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
        model = nn.DataParallel(model)

    model.to(device)

    for data in rand_loader:
        input = data.to(device)
        output = model(input)
        # loss

        # backward

        #update
        
        time.sleep(1)#模拟一个比较长的batch时间
        print("Outside: input size", input.size(),
            "output_size", output.size())

    torch.save(model.module.state_dict(), "model.pth")
  • If an a GPU, the following test results, you can see the internal model with the external input and output are the same.
        In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([30, 5]) output size torch.Size([30, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])
  • If two a GPU, the following test results, it can be seen automatic batch split.
        In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
        In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
        In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
        In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
        In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])

note

  • There are multiple GPU primary and secondary points, update and integrate the main parameters of the GPU responsible gradient, the updated parameters passed to the other GPU, complete one iteration, so the process parameters are passed between both GPU, there are parameters gradient transfer.
  • Questions about Batch Norm of reference . Because a large batch is split into multiple minibatch, so normalization only on minibatch to do with the final test parameters Noramlization layer is the main GPU. If you want to implement multiple GPU synchronization Normalization need sync norm implementation. The normalization of the problem still exists in the distributed training.

Option II

Option II is to achieve multi-process, multi-process is actually distributed the meaning of distributed processes on multiple machines, the use of network communication coordination with each other. Distributed processing on an article and then described in detail. This introduces a difference on a stand-alone program two and programs. First, each process has its own training process, after the first iteration share gradients, gradient integration, independent update parameters. Passing parameters (parameters on all synchronized initialization process) will not be an iterative process. Second, communication between processes using NCCL, of course, already NCCL internal pytorch support, so under normal circumstances do not care for this. Distributed details refer to the article, here only gives the simplest implementation.

step

  • You need to initialize the process group, this uses the default mode initialization, for a single node for this is the most convenient way to initialize the aim is to get all of the initialization process to establish contact with each other, that is, know each other's position, status and other information.
  • dataset prepare, increases torch.utils.data.distributed.DistributedSampler. Referring specifically use the code of the test section.
  • model prepare, increases torch.nn.parallel.DistributedDataParallel. Referring specifically use the code of the test section.
  • Training process with a consistent program, imagine there are multiple processes running simultaneously training the code to.

test

A similar program code, need to initialize the process group, says that this program is distributed training. Creating multiple processes by specifying python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1realized, nnodes 1, because here we are a compute node, nproc_per_node=2it expressed the need to create two processes to train, and then allocated to each process to get it rank number, rank uniquely identifies a process, rank 0 for the master, the other is the slave. Of course, generally require two GPU, the test program is used to specify the process according to the rank GPU, i.e. the use of rank 0 GPU0, rank 1 process uses GPU1. We need to create a data set distributed according sampler, dataloader initialization time to specify the sampler, see package distributed model code.
Code:


Expansion

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import os
import torch.distributed as dist
import torch.utils.data.distributed
import sys
import time


# dataset
class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

# model define
class Model(nn.Module):
    # Our model

    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)
        # print("\tIn Model: input size", input.size(),
        #       "output size", output.size())

        return output

if __name__=="__main__":
    # Parameters
    input_size = 5
    output_size = 2

    batch_size = 30
    data_size = 100

    # check the nccl backend
    if not dist.is_nccl_available():
        print("Error: nccl backend not available.")
        sys.exit(1)

    # init group
    dist.init_process_group(backend="nccl", init_method="env://")

    # get the process rank and the world size
    rank = dist.get_rank()
    world_size = dist.get_world_size()

    # prepare the dataset
    dataset = RandomDataset(input_size, data_size)
    train_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
    

    rand_loader = DataLoader(dataset, batch_size=batch_size//world_size, 
                              shuffle=(train_sampler is None), 
                              sampler=train_sampler)

    # dataloader define
    # rand_loader = DataLoader(dataset=dataset,
    #                         batch_size=batch_size, shuffle=True)

    # model init
    model = Model(input_size, output_size)

    # cuda devices
    # os.environ["CUDA_VISIBLE_DEVICES"]="0"
    # device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    # if torch.cuda.device_count() > 1:
    #     print("Let's use", torch.cuda.device_count(), "GPUs!")
    #     # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
    #     model = nn.DataParallel(model)
    # model.to(device)

    # distribute model define
    device = torch.device('cuda', rank)
    model = model.to(device)
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[rank], output_device=rank)
    print("From rank %d: start training, time:%s"%(rank, time.strftime("%Y-%m-%d %H:%M:%S")))
    for data in rand_loader:
        input = data.to(device)
        output = model(input)
        # loss

        # backward

        #update
        
        time.sleep(1)#模拟一个比较长的batch时间
        print("From rank %d: Outside: input size %s, output size %s"%(rank, str(input.size()), str(output.size())),flush=True)
    torch.save(model.module.state_dict(), "model_%d.pth"%rank)
    print("From rank %d: end training, time: %s"%(rank, time.strftime("%Y-%m-%d %H:%M:%S")))
  • Run command
    python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 simple_test.py
  • result
From rank 0: start training, time:2019-09-26 13:20:13
From rank 1: start training, time:2019-09-26 13:20:13
From rank 0: Outside: input size torch.Size([15, 5]), output size torch.Size([15, 2])
From rank 1: Outside: input size torch.Size([15, 5]), output size torch.Size([15, 2])
From rank 0: Outside: input size torch.Size([15, 5]), output size torch.Size([15, 2])
From rank 1: Outside: input size torch.Size([15, 5]), output size torch.Size([15, 2])
From rank 1: Outside: input size torch.Size([15, 5]), output size torch.Size([15, 2])From rank 0: Outside: input size torch.Size([15, 5]), output size torch.Size([15, 2])

From rank 0: Outside: input size torch.Size([5, 5]), output size torch.Size([5, 2])
From rank 0: end training, time: 2019-09-26 13:20:17
From rank 1: Outside: input size torch.Size([5, 5]), output size torch.Size([5, 2])
From rank 1: end training, time: 2019-09-26 13:20:17
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************

I will directly test results Tieshanglai, we can see a bit chaotic, because the problem is caused by multiple processes in parallel, look carefully can see there are two parallel processes of training, each process half a batch process data. The final OMP_NUM_THREADS information is pytorch lanch when printed, translated that I did not specify the number of OMP multi-threaded, which in order to prevent system overload, so help me intimate setting for 1, the original code reference .

One more Thing

Save and load model, with single GPU differences in the way. Here all the parameters in a manner cpu load into memory, if the subsequent need to put on the GPU, because of the time saved is stored in the parameter GPU, GPU PTH file records belonging to the number of parameters, are loaded into the respective loading on the GPU, which would lead to an error when loading the model will be when the number of GPU is not enough if you, like this:

RuntimeError: Attempting to deserialize object on CUDA device 1 but torch.cuda.device_count() is 1. Please use torch.load with map_location to map your storages to an existing device.

Save the model are the same, but always remember that you have a second scheme run multiple processes at the same time, it will save multiple models to the store, if you are using a shared storage file name should pay attention to the problem, of course, generally only rank0 parameters can be saved on the process, because the model parameters of all processes are synchronized.

torch.save(model.module.state_dict(), "model.pth")

Load Model:

param=torch.load("model.pth",map_location="cpu")

Well, I wrote here today, I have not so seriously write a blog up. Of course, there are still some places is not perfect, such as parametric tests on model synchronization. If you have any questions, or think where there is something wrong, please give in the comments area, crab ^ = ^.

Guess you like

Origin www.cnblogs.com/walter-xh/p/11586507.html