5 Pytorch parallel training methods that contemporary graduate students should master (single-machine multi-card)

Author丨Zongheng@zhihu

Source丨https://zhuanlan.zhihu.com/p/98535650

Edit丨Gokushi Platform

It's Friday, which is suitable for paddling. The machine is learning and people are very boring. Look at the half-empty graphics cards and decide to write something to feed them before opening the "learning" of station b. Therefore, take out 4 cards each from V100-PICE/V100/K80 and try out which distributed learning library Fastest ! Now I can finally use up the remaining video memory, and I am a diligent and good student of the teacher again (I am really a clever little ghost)!

Take-Away

The author uses PyTorch to write examples of the use of different acceleration libraries on ImageNet (single-machine multi-card), students who need it can use it as a quickstart to copy the required parts into their own projects (Github, please click the link below):

1. Simple and convenient  nn.DataParallel

https://github.com/tczhangzhi/pytorch-distributed/blob/master/dataparallel.py

2. Use  torch.distributed  to speed up parallel training

https://github.com/tczhangzhi/pytorch-distributed/blob/master/distributed.py

3. Use  torch.multiprocessing to  replace the launcher

https://github.com/tczhangzhi/pytorch-distributed/blob/master/multiprocessing_distributed.py

4. Use  apex  to re-accelerate

https://github.com/tczhangzhi/pytorch-distributed/blob/master/apex_distributed.py

5. Elegant implementation of horovod 

https://github.com/tczhangzhi/pytorch-distributed/blob/master/horovod_distributed.py

Here, the author recorded the running time test on ImageNet using 4 Tesla V100-PICEs. The test results found that  Apex has the best acceleration effect, but it is not much different from Horovod/Distributed , and the built-in Distributed can be used directly. Dataparallel is slower and deprecated . (The test results on V100/K80 will be added later, and some tests will be interspersed, so it will be interrupted)

7ae31e11e2e494970a734b151ccd9627.png

Briefly record the distributed training methods of different libraries, as the README of the code (I am really a clever ghost)~

Simple and convenient nn.DataParallel

DataParallel can help us load models and data into multiple GPUs (using single-process control), control the flow of data between GPUs, and collaborate with models on different GPUs for parallel training (fine-grained methods include scatter, gather, etc. ).

DataParallel is very convenient to use, we only need to wrap the model with DataParallel and set some parameters. The parameters that need to be defined include: which GPUs are involved in training, device_ids=gpus; which GPUs are used for summarizing gradients, output_device=gpus[0]. DataParallel will automatically help us split the data and load it to the corresponding GPU, copy the model to the corresponding GPU, perform forward propagation to calculate the gradient and summarize:

model = nn.DataParallel(model.cuda(), device_ids=gpus, output_device=gpus[0])

It is worth noting that both the model and the data need to be loaded into the GPU before the module of DataParallel can process it, otherwise an error will be reported:

# 这里要 model.cuda()
model = nn.DataParallel(model.cuda(), device_ids=gpus, output_device=gpus[0])

for epoch in range(100):
   for batch_idx, (data, target) in enumerate(train_loader):
      # 这里要 images/target.cuda()
      images = images.cuda(non_blocking=True)
      target = target.cuda(non_blocking=True)
      ...
      output = model(images)
      loss = criterion(output, target)
      ...
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

To summarize, the DataParallel parallel training part is mainly related to the following code snippets:

# main.py
import torch
import torch.distributed as dist

gpus = [0, 1, 2, 3]
torch.cuda.set_device('cuda:{}'.format(gpus[0]))

train_dataset = ...

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=...)

model = ...
model = nn.DataParallel(model.to(device), device_ids=gpus, output_device=gpus[0])

optimizer = optim.SGD(model.parameters())

for epoch in range(100):
   for batch_idx, (data, target) in enumerate(train_loader):
      images = images.cuda(non_blocking=True)
      target = target.cuda(non_blocking=True)
      ...
      output = model(images)
      loss = criterion(output, target)
      ...
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

When in use, use python to execute:

python main.py

The full training code on ImageNet is available on Github.

Speed ​​up parallel training with torch.distributed

After pytorch 1.0, the official finally encapsulated common distributed methods, supporting all-reduce, broadcast, send and receive, etc. CPU communication is realized through MPI, and GPU communication is realized through NCCL. The official also mentioned that the use of DistributedDataParallel to solve the problem of slow DataParallel and unbalanced GPU load is very mature now~

Unlike DataParallel's single-process control of multiple GPUs, with the help of distributed, we only need to write a code, and torch will automatically assign it to each process and run on each GPU.

At the API level, pytorch provides us with the torch.distributed.launch launcher for distributed execution of python files on the command line. During the execution process, the launcher will pass the index of the current process (in fact, the GPU) to python through parameters, we can get the index of the current process like this:

parser = argparse.ArgumentParser()
parser.add_argument('--local_rank', default=-1, type=int,
                    help='node rank for distributed training')
args = parser.parse_args()
print(args.local_rank)

Next, use init_process_group to set the backend and port used for communication between GPUs:

dist.init_process_group(backend='nccl')

After that, use DistributedSampler to partition the dataset. As we introduced before, it can help us divide each batch into several partitions. In the current process, we only need to obtain the partition corresponding to the rank for training:

train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)

Then, use DistributedDataParallel to wrap the model, which helps us to all reduce the gradients calculated on different GPUs (ie, sum up the gradients calculated by different GPUs and synchronize the calculation results). The gradients of the models in different GPUs after all reduce are the mean values ​​of the gradients of each GPU before all reduce:

model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])

Finally, load the data and model into the GPU used by the current process, and perform normal forward and backward propagation:

 
  
torch.cuda.set_device(args.local_rank)

model.cuda()

for epoch in range(100):
   for batch_idx, (data, target) in enumerate(train_loader):
      images = images.cuda(non_blocking=True)
      target = target.cuda(non_blocking=True)
      ...
      output = model(images)
      loss = criterion(output, target)
      ...
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

To sum up, the parallel training part of torch.distributed is mainly related to the following code segments:

# main.py
import torch
import argparse
import torch.distributed as dist

parser = argparse.ArgumentParser()
parser.add_argument('--local_rank', default=-1, type=int,
                    help='node rank for distributed training')
args = parser.parse_args()

dist.init_process_group(backend='nccl')
torch.cuda.set_device(args.local_rank)

train_dataset = ...
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)

model = ...
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])

optimizer = optim.SGD(model.parameters())

for epoch in range(100):
   for batch_idx, (data, target) in enumerate(train_loader):
      images = images.cuda(non_blocking=True)
      target = target.cuda(non_blocking=True)
      ...
      output = model(images)
      loss = criterion(output, target)
      ...
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

When in use, call torch.distributed.launch to launch the launcher:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 main.py

The full training code on ImageNet is available on Github.

Use torch.multiprocessing instead of starter

Some students may be familiar with torch.multiprocessing, and they can also use torch.multiprocessing manually for multi-process control. Bypass torch.distributed.launch to automatically control some minor problems of starting and exiting the process~

When using, just call torch.multiprocessing.spawn, and torch.multiprocessing will help us create processes automatically. As shown in the following code, spawn starts nprocs=4 threads, each thread executes main_worker and passes in local_rank (current process index) and args (ie 4 and myargs) as parameters:

import torch.multiprocessing as mp

mp.spawn(main_worker, nprocs=4, args=(4, myargs))

Here, we directly encapsulate the execution content that needs to be managed by torch.distributed.launch into the main_worker function, where proc corresponds to local_rank (current process index), ngpus_per_node corresponds to 4, and args corresponds to myargs:

def main_worker(proc, ngpus_per_node, args):

   dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=4, rank=gpu)
   torch.cuda.set_device(args.local_rank)

   train_dataset = ...
   train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)

   train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)

   model = ...
   model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])

   optimizer = optim.SGD(model.parameters())

   for epoch in range(100):
      for batch_idx, (data, target) in enumerate(train_loader):
          images = images.cuda(non_blocking=True)
          target = target.cuda(non_blocking=True)
          ...
          output = model(images)
          loss = criterion(output, target)
          ...
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()

It is worth noting in the above code that since there are no default environment variables read by torch.distributed.launch as configuration, we need to manually specify parameters for init_process_group:

dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=4, rank=gpu)

To sum up, the parallel training part after adding multiprocessing is mainly related to the following code segments:

# main.py
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

mp.spawn(main_worker, nprocs=4, args=(4, myargs))

def main_worker(proc, ngpus_per_node, args):

   dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=4, rank=gpu)
   torch.cuda.set_device(args.local_rank)

   train_dataset = ...
   train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)

   train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)

   model = ...
   model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])

   optimizer = optim.SGD(model.parameters())

   for epoch in range(100):
      for batch_idx, (data, target) in enumerate(train_loader):
          images = images.cuda(non_blocking=True)
          target = target.cuda(non_blocking=True)
          ...
          output = model(images)
          loss = criterion(output, target)
          ...
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()

When using it, you can run it directly with python:

python main.py

The full training code on ImageNet is available on Github.

Reaccelerate with Apex

Apex is NVIDIA's open source library for mixed-precision training and distributed training. Apex encapsulates the process of mixed-precision training, and can perform mixed-precision training by changing two or three lines of configuration, thereby greatly reducing memory usage and saving computing time. In addition, Apex also provides wrappers for distributed training, optimized for NVIDIA's NCCL communication library.

Apex's packaging is elegant for mixed-precision training. Directly use amp.initialize to package the model and optimizer, apex will automatically help us manage the accuracy of model parameters and optimizer, and other configuration parameters can be passed in according to different accuracy requirements.

from apex import amp

model, optimizer = amp.initialize(model, optimizer)

In the package of distributed training, Apex has not changed much in the glue layer, mainly to optimize the communication of NCCL. Therefore, most of the code remains consistent with torch.distributed. When using, just replace torch.nn.parallel.DistributedDataParallel with apex.parallel.DistributedDataParallel for wrapping the model. At the API level, compared to torch.distributed, it can automatically manage some parameters (you can pass a little less):

from apex.parallel import DistributedDataParallel

model = DistributedDataParallel(model)
# # torch.distributed
# model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])
# model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank], output_device=args.local_rank)

When calculating loss in forward propagation, Apex needs to use the amp.scale_loss package to automatically scale the accuracy according to the loss value:

with amp.scale_loss(loss, optimizer) as scaled_loss:
   scaled_loss.backward()

To summarize, the parallel training part of Apex is mainly related to the following code snippets:

# main.py
import torch
import argparse
import torch.distributed as dist

from apex.parallel import DistributedDataParallel

parser = argparse.ArgumentParser()
parser.add_argument('--local_rank', default=-1, type=int,
                    help='node rank for distributed training')
args = parser.parse_args()

dist.init_process_group(backend='nccl')
torch.cuda.set_device(args.local_rank)

train_dataset = ...
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)

model = ...
model, optimizer = amp.initialize(model, optimizer)
model = DistributedDataParallel(model, device_ids=[args.local_rank])

optimizer = optim.SGD(model.parameters())

for epoch in range(100):
   for batch_idx, (data, target) in enumerate(train_loader):
      images = images.cuda(non_blocking=True)
      target = target.cuda(non_blocking=True)
      ...
      output = model(images)
      loss = criterion(output, target)
      optimizer.zero_grad()
      with amp.scale_loss(loss, optimizer) as scaled_loss:
         scaled_loss.backward()
      optimizer.step()

When in use, call torch.distributed.launch to launch the launcher:

UDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 main.py

The full training code on ImageNet is available on Github.

An elegant implementation of Horovod

Horovod is Uber's open source deep learning tool. Its development draws on the advantages of Facebook's "Training ImageNet In 1 Hour" and Baidu's "Ring Allreduce". It can be painlessly combined with deep learning frameworks such as PyTorch/Tensorflow to achieve parallel training.

At the API level, Horovod is very similar to torch.distributed. On the basis of mpirun, Horovod provides its own encapsulated horovodrun as a launcher.

Similar to torch.distributed.launch, we only need to write a piece of code, and the horovodrun launcher will automatically assign it to processes to run on GPUs respectively. During execution, the launcher will inject the index of the current process (in fact, the GPU) into hvd, and we can get the index of the current process like this:

import horovod.torch as hvd

hvd.local_rank()

Similar to init_process_group, Horovod uses init to set the backend and ports used for communication between GPUs:

hvd.init()

Next, use DistributedSampler to partition the dataset. As we introduced before, it can help us divide each batch into several partitions. In the current process, we only need to obtain the partition corresponding to the rank for training:

 
  
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)

After that, wrap the model parameters with broadcast_parameters to copy the model parameters from the GPU numbered root_rank to all other GPUs:

hvd.broadcast_parameters(model.state_dict(), root_rank=0)

Then, use DistributedOptimizer to wrap the optimizer. It helps us to all reduce the gradients obtained on different GPUs (that is, sum up the gradients calculated by different GPUs and synchronize the calculation results). The gradients of the models in different GPUs after all reduce are the mean values ​​of the gradients of each GPU before all reduce:

hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters(), compression=hvd.Compression.fp16)

Finally, load the data into the current GPU. When writing the code, we just need to focus on doing forward and back propagation normally:

torch.cuda.set_device(args.local_rank)

for epoch in range(100):
   for batch_idx, (data, target) in enumerate(train_loader):
      images = images.cuda(non_blocking=True)
      target = target.cuda(non_blocking=True)
      ...
      output = model(images)
      loss = criterion(output, target)
      ...
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

To summarize, the parallel training part of Horovod is mainly related to the following code snippets:

# main.py
import torch
import horovod.torch as hvd

hvd.init()
torch.cuda.set_device(hvd.local_rank())

train_dataset = ...
train_sampler = torch.utils.data.distributed.DistributedSampler(
    train_dataset, num_replicas=hvd.size(), rank=hvd.rank())

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)

model = ...
model.cuda()

optimizer = optim.SGD(model.parameters())

optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
hvd.broadcast_parameters(model.state_dict(), root_rank=0)

for epoch in range(100):
   for batch_idx, (data, target) in enumerate(train_loader):
       images = images.cuda(non_blocking=True)
       target = target.cuda(non_blocking=True)
       ...
       output = model(images)
       loss = criterion(output, target)
       ...
       optimizer.zero_grad()
       loss.backward()
       optimizer.step()

When in use, call the horovodrun launcher to start:

CUDA_VISIBLE_DEVICES=0,1,2,3 horovodrun -np 4 -H localhost:4 --verbose python main.py

The full training code on ImageNet is available on Github.

Tail note

Configuration of the V100-PICE (first 4 GPUs) used in this article:

229ba7bbac9c714c09d3e4bb982f829a.png

Figure 2: Configuration Details

Configuration of the V100 (first 4 GPUs) used in this article:

e9129d6475cf2fe1372cb0c044f68aee.png

Figure 3: Configuration Details

Configuration of the K80 (first 4 GPUs) used in this article:

5bc05df65528ba4d0e8e74e81a6cc216.png

Figure 4: Configuration Details

The author himself is a CV graduate student. I researched it on a whim when I was fishing today, and I will gradually improve it later~ Students in the industry should have their own best practice, feel free to mention PR or leave a message~

This article is for academic sharing only, if there is any infringement, please contact to delete the article.

Heavy! Computer Vision Workshop - Learning Exchange Group has been established

Scan the code to add a WeChat assistant, and you can apply to join the 3D Vision Workshop - Academic Paper Writing and Submission WeChat exchange group, which aims to exchange writing and submission matters such as top conferences, top journals, SCI, and EI.

At the same time , you can also apply to join our subdivision direction exchange group. At present, there are mainly ORB-SLAM series source code learning, 3D vision , CV & deep learning , SLAM , 3D reconstruction , point cloud post-processing , automatic driving, CV introduction, 3D measurement, VR /AR, 3D face recognition, medical imaging, defect detection, pedestrian re-identification, target tracking, visual product landing, visual competition, license plate recognition, hardware selection, depth estimation, academic exchanges, job search exchanges and other WeChat groups, please scan the following WeChat account plus group, remarks: "research direction + school/company + nickname", for example: "3D vision + Shanghai Jiaotong University + Jingjing". Please remark according to the format, otherwise it will not be approved. After the addition is successful, the relevant WeChat group will be invited according to the research direction. Please contact for original submissions .

1ace1f6e6a7d1fc6f4816706f884bb02.png

▲Long press to add WeChat group or contribute

256fc2ddb3169b3012d9685548578bf7.png

▲Long press to follow the official account

3D vision from entry to proficient knowledge planet : video courses for the field of 3D vision ( 3D reconstruction series , 3D point cloud series , structured light series , hand-eye calibration , camera calibration , laser/vision SLAM, automatic driving, etc. ), summary of knowledge points , entry and advanced learning route, the latest paper sharing, and question answering for in-depth cultivation, and technical guidance from algorithm engineers from various large factories. At the same time, Planet will cooperate with well-known companies to release 3D vision-related algorithm development positions and project docking information, creating a gathering area for die-hard fans that integrates technology and employment. Nearly 4,000 Planet members make common progress and knowledge to create a better AI world. Planet Entrance:

Learn the core technology of 3D vision, scan and view the introduction, unconditional refund within 3 days

9437b6b265aed517a4d1309b9ecac916.png

 There are high-quality tutorial materials in the circle, which can answer questions and help you solve problems efficiently

I find it useful, please give a like and watch~

Guess you like

Origin blog.csdn.net/qq_29462849/article/details/123196405