Parallel training of distributed data based on Pytorch
Introduction: Using DistributedDataParallel in PyTorch for multi-GPU distributed model training
motivation
The easiest way to speed up neural network training is to use GPUs, which provide greater speedup than CPUs on types of computations common in neural networks (matrix multiplication and addition). As models or datasets get larger, one GPU can quickly become insufficient. For example, large language models like BERT and GPT-2 are trained on hundreds of GPUs. To perform multi-GPU training, we must have a way to split the model and data among the different GPUs and coordinate the training .
Why distribute data in parallel?
Many people prefer to implement their own deep learning models in Pytorch because it has the best balance between the control and ease of use of the neural network framework. Pytorch has two ways to split models and data across multiple GPUs : nn.DataParallel
and nn.DistributedDataParallel
.
nn.DataParallel is easier to use (just wrap the model and run the training script). However, since it uses one process to compute model weights and then distributes them to each GPU in each batch, the network quickly becomes a bottleneck and GPU utilization is often low . Also, nn.DataParallel requires all GPUs to be on the same node, and cannot be used with Apex for mixed-precision training .
Therefore, the main differences between nn.DataParallel
and nn.DistributedDataParallel
can be summarized as follows:
1. DistributedDataParallel supports model parallelism, while DataParallel does not, which means that if the model is too large and the memory of a single card is insufficient, only the former can be used;
2. DataParallel is a single-process multi-thread , only used for single-machine situations, while DistributedDataParallel is multi-process, suitable for single-machine and multi-machine situations, and truly realizes distributed training; 3. The
training of DistributedDataParallel is more efficient, because each process is an independent Python interpreter, avoiding GIL problem, and the communication cost is low, its training speed is faster, basically DataParallel has been abandoned ;
4. It must be explained that each process in DistributedDataParallel has an independent optimizer, which performs its own update process, but the gradient passes Passed to each process, the content is the same for all executions.
Insufficiency of available information
Overall, the Pytorch documentation is complete and clear, however, when trying to figure out how to use it DistributedDataParallel
, all the examples and tutorials are found to be a combination of inaccessible, incomplete or overloaded irrelevant functions.
Pytorch provides a tutorial on distributed training with AWS which does a good job of showing how to set it up on the AWS side. However, the rest of it is a bit of a mess, because for some reason it spends a lot of time showing how to calculate the metrics, and then goes back to showing how to wrap the model and start the process. It also doesn't describe nn.DistributedDataParallel
what it does, which makes related code blocks hard to follow.
Tutorials on writing distributed applications with Pytorch are much more detailed than needed for a first pass, and inaccessible to someone without a background in Python multiprocessing. It spends a lot of time duplicating nn.DistributedDataParallel
the functionality in it. However, it doesn't give a high-level overview of what it does, nor provide insight on how to use it ( https://pytorch.org/tutorials/intermediate/ddp_tutorial.html )
There is also a Pytorch tutorial on how to get started with distributed data parallelism . This one shows how to do some setup, but doesn't explain the purpose of the setup, and then shows some code to split the model between GPUs and do an optimization step . Unfortunately, I'm pretty sure the code written won't run (the function names don't match), and it doesn't tell you how to run the code. Like the previous tutorials, it doesn't give a high-level overview of how distributed training works.
The closest thing Pytorch provides to the MWE example is the Imagenet training example. Unfortunately, this example also demonstrates almost all other features of Pytorch, so it is difficult to find out what is relevant for distributed multi-GPU training.
Apex
Provides their own version of the Pytorch Imagenet example. Their version of nn.DistributedDataParallel is a Pytorch replacement that is only helpful after learning how to use Pytork.
This tutorial does a good job of describing what's going on under the hood and how it nn.DataParallel
differs from However, it has no nn.DataParallel
code samples on how to use it.
Outline
This tutorial is really aimed at those who are already familiar with training neural network models in Pytorch . Start by outlining the overall idea. Then, a minimal working example of training with MNIST on GPU is shown. I modified this example to train on multiple GPUs, possibly across multiple nodes, and explain the changes line by line. Importantly, it also explains how to run the code. As a bonus, it also demonstrates how to do Apex
simple mixed-precision distributed training with .
overall frame diagram
Using DistributedDataParallel
Multiprocessing replicates the model across multiple GPUs , each controlled by a single process. (A process is an instance of python running on a computer; by having multiple processes running in parallel, we can take advantage of processors that have multiple CPU cores. You could have each process control multiple GPUs if you wanted, but that's obviously less efficient than Having one GPU per process is slower. It is also possible to have multiple worker processes fetching data for each GPU, but this will be omitted for simplicity.) The GPUs can all be on the same node, or distributed across multiple nodes . (A node is a "computer", including all of its CPUs and GPUs. If you use AWS, a node is an EC2 instance.) Each process performs the same task, and each process communicates with all other processes . Only gradients are passed between processes/GPUs so that network communication does not become a bottleneck.
During training, each process loads its own mini-batches from disk and passes them to the GPU. Each GPU has its own forward pass, and then the gradients are reduced across GPUs. The gradient of each layer does not depend on the previous layer, so the gradient all-reduce is calculated simultaneously with the backward pass to further alleviate the network bottleneck. At the end of the reverse process, each node has an average gradient, ensuring that the model weights stay in sync .
All of these require multiple processes (possibly on multiple nodes) to synchronize and communicate . Pytorch distributed.init_process_group
achieves this through its functions. This function needs to know where to find process 0 so that all processes can sync, and the expected total number of processes . Each individual process also needs to know the total number of processes and their rank in the process and which GPU is used. world size
It is very common to refer to the total number of processes . Finally, each process needs to know which part of the data to process so that batches do not overlap. Pytorch provides nn.utils.data.DistributedSampler
to achieve this, which is to split the data for each process to ensure that the training data does not overlap.
For a more detailed internal mechanism of DDP, see the official documentation: DISTRIBUTED DATA PARALLEL
Minimal demo example with explanation
To demonstrate how to do this, we will create an example that trains on MNIST, then modify it to run on multiple GPUs across multiple nodes, and finally also allow for mixed-precision training.
no multiprocessing
First, import the required dependencies:
import os
import argparse
import torch
import torch.nn as nn
import torch.distributed as dist
import torch.multiprocessing as mp
import torchvision
import torchvision.transforms as transforms
from datetime import datetime
from apex.parallel import DistributedDataParallel as DDP
from apex import amp
We define a very simple convolutional model to predict MNIST.
class ConvNet(nn.Module):
def __init__(self, num_classes=10):
super(ConvNet, self).__init__()
self.layer1 = nn.Sequential(
nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
nn.BatchNorm2d(16),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.layer2 = nn.Sequential(
nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.fc = nn.Linear(7 * 7 * 32, num_classes)
def forward(self, x):
out = self.layer1(x)
out = self.layer2(out)
out = out.reshape(out.size(0), -1)
out = self.fc(out)
return out
The following is the training process:
def train(gpu, args):
torch.manual_seed(0)
model = ConvNet()
torch.cuda.set_device(gpu)
model.cuda(gpu)
# model = nn.DataParallel(model, device_ids=device_ids)
# model = model.cuda(device=gpu)
batch_size = 100
# define loss function (criterion) and optimizer
criterion = nn.CrossEntropyLoss().cuda(gpu)
optimizer = torch.optim.SGD(model.parameters(), 1e-4)
# Data loading code
train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(),
download=True)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True,
num_workers=0, pin_memory=True)
start = datetime.now()
total_step = len(train_loader)
for epoch in range(args.epochs):
for i, (images, labels) in enumerate(train_loader):
images = images.cuda(non_blocking=True)
labels = labels.cuda(non_blocking=True)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i + 1) % 100 == 0 and gpu == 0:
print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
loss.item()))
if gpu == 0:
print("Training complete in: " + str(datetime.now() - start))
The main() function will take some parameters and run the training function.
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-n", "--nodes", default=1, type=int, metavar='N')
parser.add_argument('-g', '--gpus', default=1, type=int, help='number of gpus per node')
parser.add_argument('-nr', '--nr', default=0, type=int, help='ranking within the nodes')
parser.add_argument('--epochs', default=2, type=int, metavar='N', help='number of total epochs to run')
args = parser.parse_args()
train(0, args)
Finally, make sure the main() function is called.
if __name__ == '__main__':
main()
You can run this code by opening a terminal and typing python src/mnist.py-n 1-g 1-nr 0
, and it will train on a single GPU on a single node.
Enable multiprocessing
To do this with multiprocessing, we need a script to start a process for each GPU . Each process needs to know which GPU to use, and its rank among all running processes . The script needs to be run on each node.
Take a look at the changes to each function. The new code has been isolated for ease of finding:
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-n", "--nodes", default=1, type=int, metavar='N')
parser.add_argument('-g', '--gpus', default=1, type=int, help='number of gpus per node')
parser.add_argument('-nr', '--nr', default=0, type=int, help='ranking within the nodes')
parser.add_argument('--epochs', default=2, type=int, metavar='N', help='number of total epochs to run')
args = parser.parse_args()
#########################################################
args.world_size = args.gpus * args.nodes
os.environ['MASTER_ADDR'] = '172.20.109.105'
os.environ['MASTER_PORT'] = '8888'
mp.spawn(train, nprocs=args.gpus, args=(args,))
#########################################################
# train(0, args)
in:
args.nodes
represents the total number of nodes,args.gpus
Indicates the total number of GPUs per node (the number of GPUs per node is the same)args.nr
Indicates the serial number of the current node among all nodes.
According to the total number of nodes and the number of GPUs per node, it can be calculated world_size
, that is, the total number of processes to be run. All processes need to know the IP address and port of process 0, so that all processes can be synchronized at the beginning. Generally, process 0 is called The master process , for example, we will print information or save the model in process 0.
PyTorch provides mp.spawn
to start all processes of the node on a node , each process runs train(i, args)
, where i is from 0 to args.gpus
-1. Remember to run the main() function on each node so there will be a total of args.nodes*args.gpus=args.world_size
processes.
Again, we want to modify the training function:
def train(gpu, args):
############################################################
rank = args.nr * args.gpus + gpu
dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank)
############################################################
torch.manual_seed(0)
model = ConvNet()
torch.cuda.set_device(gpu)
model.cuda(gpu)
batch_size = 100
# define loss function (criterion) and optimizer
criterion = nn.CrossEntropyLoss().cuda(gpu)
optimizer = torch.optim.SGD(model.parameters(), 1e-4)
############################################################
# Wrap the model
model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu])
############################################################
# Data loading code
train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(),
download=True)
############################################################
train_sampler = torch.utils.data.distributed.DistributedSampler(dataset=train_dataset, num_replicas=args.world_size,
rank=rank)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=False,
num_workers=0, pin_memory=True, sampler=train_sampler)
############################################################
start = datetime.now()
total_step = len(train_loader)
for epoch in range(args.epochs):
for i, (images, labels) in enumerate(train_loader):
images = images.cuda(non_blocking=True)
labels = labels.cuda(non_blocking=True)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i + 1) % 100 == 0 and gpu == 0:
print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
loss.item()))
if gpu == 0:
print("Training complete in: " + str(datetime.now() - start))
Here, first calculate the current program number: rank = args.nr * args.gpus + gpu
, and then dist.init_process_group
initialize the distributed environment, where
backend
The parameter specifies the communication backend , includingmpi
,gloo
,nccl
, and is selected herenccl
. It is the official multi-card communication framework provided by Nvidia, which is relatively efficient.mpi
It is also a common communication protocol for high-performance computing, but you need to install the MPI implementation framework yourself, for exampleOpenMPI
.gloo
It has a built-in communication backend, but it is not efficient enough.init_method
Refers to how to initialize to complete the process synchronization at the beginning. What we set here refersenv://
to the environment variable initialization method. Four parameters need to be configured in the environment variable:MASTER_PORT
,MASTER_ADDR
,WORLD_SIZE
,RANK
, we have already configured the first two parameters, The latter two parameters can also be configured throughdist.init_process_group
functionworld_size
neutralizationrank
parameters.
Other initialization methods include shared file system and TCP . For example, if TCP is used as the initialization methodinit_method='tcp://10.1.1.20:23456'
, the IP address and port of the master must be provided. Note that this call is blocking and must wait for all processes to synchronize, and will fail if any process fails.- For the model side, you only need to
DistributedDataParallel
wrap the original model, copy the model to the GPU for processing, and it will support gradient All-Reduce operations behind the scenes. - For the data side, use it
nn.utils.data.DistributedSampler
to split data for each process, you only need todataloader
use this insampler
, it is worth noting that it should be called at the beginning of each epoch in the training cycle processtrain_sampler.set_epoch(epoch)
, (mainly to ensure the division of each epoch are different) other training codes remain unchanged.
Finally, the code can be executed. For example, if we have 4 nodes and each node has 8 graphics cards, then we need to execute them on the terminals of the 4 nodes:
python src/mnist-distributed.py -n 4 -g 8 -nr i
For example, execute on node 0:
python src/mnist-distributed.py -n 4 -g 8 -nr 0
args.gpus
In other words, running this script on each node tells it to start processes that are in sync with each other before training begins .
It should be noted that what is effective at this time batch_size
is that batch_size_per_gpu * world_size
for models with BN, synchronous BN can also be used to obtain better results:
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
The above describes the distributed training process, which is also applicable to the evaluation or testing process . For example, we divide the data into different processes for prediction, which can speed up the prediction process. The implementation code is exactly the same as the above process, but if we want to calculate a certain indicator, we need to perform All-Reduce from the statistical results of each process , because each process only calculates part of the data content. For example, if we want to calculate the classification accuracy, we can count the total number of data total and the number of correct classification counts of each process, and then aggregate them.
One thing to mention here is that when initializing a distributed environment, a default distributed process groupdist.init_process_group
(distributed process group) is actually established . This group will also initialize the Pytorch package. In this way, the API we can directly use can perform distributed basic operations. The following is the specific implementation:torch.distributed
torch.distributed
# define tensor on GPU, count and total is the result at each GPU
t = torch.tensor([count, total], dtype=torch.float64, device='cuda')
dist.barrier() # synchronizes all processes
dist.all_reduce(t, op=torch.distributed.ReduceOp.SUM,) # Reduces the tensor data across all machines in such a way that all get the final result.
t = t.tolist()
all_count = int(t[0])
all_total = int(t[1])
acc = all_count / all_total
Distributed training start method
In the above process, the PyTorch torch.multiprocessing
package ( Multiprocessing package - torch.multiprocessing ) is used to start distributed training. Currently, the official ImageNet training example is used in this way, and the detectron2 library is also started in this way: https:// github.com/facebookresearch/detectron2/blob/main/detectron2/engine/launch.py.
If you use torch.multiprocessing.spawn
startup, you should pay attention that the input training function
must be in fn(i,*args)
this format, where the first parameter i refers to the process number of the current node, this parameter actually serves as the local_rank
so- local_rank
called training process in the current node Serial number, the rank mentioned above is actually the global program number. This parameter is very important, because the device device used by each process must be set according to this parameter. Generally, it is directly regarded as the GPU number used, and the setting is as local_rank
follows :
torch.cuda.set_device(args.local_rank) # before your code runs
# set DDP
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
# 或者
with torch.cuda.device(args.local_rank):
# your code to run
In addition to adopting mp.spawn
, it can also be used torch.distributed.launch
to start the program ( Distributed communication package - torch.distributed ), which is a more common way to start. For example, for single-machine multi-card training, the startup method is as follows:
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of your training script)
Among them NUM_GPUS_YOU_HAVE
is the total amount of GPU, YOUR_TRAINING_SCRIPT.py
but the training script, which is basically the same as the above, but the difference is torch.distributed.launch
that some environment variables ( ) will be automatically set when starting. https://github.com/pytorch/pytorch/blob/master/torch/distributed/run.py#L211
For example, what we need RANK
and WORLD_SIZE
can be obtained directly from the environment variables:
rank = int(os.environ["RANK"])
world_size = int(os.environ['WORLD_SIZE'])
There local_rank
are two ways to obtain :
1) One is to add a command line parameter to the training script, which will be automatically assigned when the program starts :
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)
args = parser.parse_args()
local_rank = args.local_rank
2) Another way is to torch.distributed.launch
start plus --use_env=True
. In this case, LOCAL_RANK
this environment variable will be set, and it can be obtained from the environment variable local_rank
:
"""
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE --use_env=True
YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of your training script)
"""
import os
local_rank = int(os.environ["LOCAL_RANK"])
For multi-machine multi-card training , such as two nodes, the startup command is as follows:
# Node 1: (IP: 192.168.1.1, and has a free port: 1234)
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE --nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of your training script)
# Node 2
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE --nnodes=2 --node_rank=1 --master_addr="192.168.1.1"
--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of your training script)
here
--nnodes
Indicates the number of incoming nodes--node_rank
Indicates the number of the incoming nodeworld_size=nnodes*nproc_per_node
。
However, the latest version of PyTorch has launched torchrun instead torch.distributed.launch
. torchrun
The usage of and torch.distributed.launch
is basically the same, but the command is abandoned --use_env
, and it is directly local_rank
set in the environment variable. The latest version torchvision
uses torchrun
the startup method, see vision/references/classification at main · pytorch/vision for details .
Mixed precision training (using apex)
Install Apex :
git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./
Apex official documentation: Apex (A PyTorch Extension)
Mixed-precision training (a combination of floating-point (FP32) and half-precision (FP16) training) allows us to use larger batch_sizes and leverage NVIDIA Tensor Cores for faster computation. AWS p3 instances use NVIDIA Tesla V100 GPUs with Tensor cores. It is very simple to use NVIDIA's apex for mixed precision training, only need to modify part of the code:
def train(gpu, args):
############################################################
rank = args.nr * args.gpus + gpu
dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank)
############################################################
torch.manual_seed(0)
model = ConvNet()
torch.cuda.set_device(gpu)
model.cuda(gpu)
batch_size = 100
# define loss function (criterion) and optimizer
criterion = nn.CrossEntropyLoss().cuda(gpu)
optimizer = torch.optim.SGD(model.parameters(), 1e-4)
############################################################
# Wrap the model
model, optimizer = amp.initialize(model, optimizer, opt_level='O2')
model = DDP(model)
############################################################
# Data loading code
train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(),
download=True)
############################################################
# train_sampler = torch.utils.data.distributed.DistributedSampler(dataset=train_dataset, num_replicas=args.world_size,
# rank=rank)
# train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=False,
# num_workers=0, pin_memory=True, sampler=train_sampler)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True,
num_workers=0, pin_memory=True)
############################################################
start = datetime.now()
total_step = len(train_loader)
for epoch in range(args.epochs):
for i, (images, labels) in enumerate(train_loader):
images = images.cuda(non_blocking=True)
labels = labels.cuda(non_blocking=True)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
############################################################
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
############################################################
optimizer.step()
if (i + 1) % 100 == 0 and gpu == 0:
print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
loss.item()))
if gpu == 0:
print("Training complete in: " + str(datetime.now() - start))
Actually there are two changes:
- The first is to use
amp.initialize
to packagemodel
andoptimizer
support mixed precision training, whichopt_level
refers to the optimization level . If it is O0 (use all floats) or O3 (use half-precision throughout), it is not a real mixed precision, but it can be used to determine the model effect and The baseline of speed, and O1 and O2 are two settings of mixed precision. You can choose one for mixed precision training. The details can be found in the Apex documentation . - Another point is that before updating the parameters according to the gradient, the gradient must be scaled through amp.scale_loss to prevent gradient underflowing. Also, you can replace nn.DistributedDataParallel with apex.parallel.DistributedDataParallel.
Yes, the first character in all of these codes is a capital "O" and the second character is a number. Yes, if you substitute zero, you get a puzzling error message.
apex.parallel.distributedDataParallel
Yes nn.distributedDataParallear
a replacement. It is no longer necessary to specify the GPU, since Apex only allows one GPU per process. It also assumes that the script is called before moving the model to the GPU torch.cuda.set_device(local_rank)
.
Mixed-precision training requires scaling the loss to prevent gradient underflow. Apex will do this automatically.
This script works in the same way as the distributed training script.
python without_multiprocessing.py -n 1 -g 4 -nr 0
In addition, the new version of PyTorch has built-in mixed precision training, see AUTOMATIC MIXED PRECISION PACKAGE - TORCH.AMP Add link description for details . Moreover, the official distributed implementation of PyTorch is relatively complete now, and its performance and effect are good. An alternative solution is to horovod
support not only PyTorch but also TensorFlow and MXNet frameworks. It is relatively easy to implement and the speed is also good.