[Distributed training] Pytorch-based distributed data parallel training


Introduction: Using DistributedDataParallel in PyTorch for multi-GPU distributed model training

motivation

The easiest way to speed up neural network training is to use GPUs, which provide greater speedup than CPUs on types of computations common in neural networks (matrix multiplication and addition). As models or datasets get larger, one GPU can quickly become insufficient. For example, large language models like BERT and GPT-2 are trained on hundreds of GPUs. To perform multi-GPU training, we must have a way to split the model and data among the different GPUs and coordinate the training .

Why distribute data in parallel?

Many people prefer to implement their own deep learning models in Pytorch because it has the best balance between the control and ease of use of the neural network framework. Pytorch has two ways to split models and data across multiple GPUs : nn.DataParalleland nn.DistributedDataParallel.

nn.DataParallel is easier to use (just wrap the model and run the training script). However, since it uses one process to compute model weights and then distributes them to each GPU in each batch, the network quickly becomes a bottleneck and GPU utilization is often low . Also, nn.DataParallel requires all GPUs to be on the same node, and cannot be used with Apex for mixed-precision training .

Therefore, the main differences between nn.DataParalleland nn.DistributedDataParallelcan be summarized as follows:
1. DistributedDataParallel supports model parallelism, while DataParallel does not, which means that if the model is too large and the memory of a single card is insufficient, only the former can be used;
2. DataParallel is a single-process multi-thread , only used for single-machine situations, while DistributedDataParallel is multi-process, suitable for single-machine and multi-machine situations, and truly realizes distributed training; 3. The
training of DistributedDataParallel is more efficient, because each process is an independent Python interpreter, avoiding GIL problem, and the communication cost is low, its training speed is faster, basically DataParallel has been abandoned ;
4. It must be explained that each process in DistributedDataParallel has an independent optimizer, which performs its own update process, but the gradient passes Passed to each process, the content is the same for all executions.

Insufficiency of available information

Overall, the Pytorch documentation is complete and clear, however, when trying to figure out how to use it DistributedDataParallel, all the examples and tutorials are found to be a combination of inaccessible, incomplete or overloaded irrelevant functions.

Pytorch provides a tutorial on distributed training with AWS which does a good job of showing how to set it up on the AWS side. However, the rest of it is a bit of a mess, because for some reason it spends a lot of time showing how to calculate the metrics, and then goes back to showing how to wrap the model and start the process. It also doesn't describe nn.DistributedDataParallelwhat it does, which makes related code blocks hard to follow.

Tutorials on writing distributed applications with Pytorch are much more detailed than needed for a first pass, and inaccessible to someone without a background in Python multiprocessing. It spends a lot of time duplicating nn.DistributedDataParallelthe functionality in it. However, it doesn't give a high-level overview of what it does, nor provide insight on how to use it ( https://pytorch.org/tutorials/intermediate/ddp_tutorial.html )

There is also a Pytorch tutorial on how to get started with distributed data parallelism . This one shows how to do some setup, but doesn't explain the purpose of the setup, and then shows some code to split the model between GPUs and do an optimization step . Unfortunately, I'm pretty sure the code written won't run (the function names don't match), and it doesn't tell you how to run the code. Like the previous tutorials, it doesn't give a high-level overview of how distributed training works.

The closest thing Pytorch provides to the MWE example is the Imagenet training example. Unfortunately, this example also demonstrates almost all other features of Pytorch, so it is difficult to find out what is relevant for distributed multi-GPU training.

ApexProvides their own version of the Pytorch Imagenet example. Their version of nn.DistributedDataParallel is a Pytorch replacement that is only helpful after learning how to use Pytork.

This tutorial does a good job of describing what's going on under the hood and how it nn.DataParalleldiffers from However, it has no nn.DataParallelcode samples on how to use it.

Outline

This tutorial is really aimed at those who are already familiar with training neural network models in Pytorch . Start by outlining the overall idea. Then, a minimal working example of training with MNIST on GPU is shown. I modified this example to train on multiple GPUs, possibly across multiple nodes, and explain the changes line by line. Importantly, it also explains how to run the code. As a bonus, it also demonstrates how to do Apexsimple mixed-precision distributed training with .

overall frame diagram

Using DistributedDataParallelMultiprocessing replicates the model across multiple GPUs , each controlled by a single process. (A process is an instance of python running on a computer; by having multiple processes running in parallel, we can take advantage of processors that have multiple CPU cores. You could have each process control multiple GPUs if you wanted, but that's obviously less efficient than Having one GPU per process is slower. It is also possible to have multiple worker processes fetching data for each GPU, but this will be omitted for simplicity.) The GPUs can all be on the same node, or distributed across multiple nodes . (A node is a "computer", including all of its CPUs and GPUs. If you use AWS, a node is an EC2 instance.) Each process performs the same task, and each process communicates with all other processes . Only gradients are passed between processes/GPUs so that network communication does not become a bottleneck.
Multiprocessing
During training, each process loads its own mini-batches from disk and passes them to the GPU. Each GPU has its own forward pass, and then the gradients are reduced across GPUs. The gradient of each layer does not depend on the previous layer, so the gradient all-reduce is calculated simultaneously with the backward pass to further alleviate the network bottleneck. At the end of the reverse process, each node has an average gradient, ensuring that the model weights stay in sync .

All of these require multiple processes (possibly on multiple nodes) to synchronize and communicate . Pytorch distributed.init_process_groupachieves this through its functions. This function needs to know where to find process 0 so that all processes can sync, and the expected total number of processes . Each individual process also needs to know the total number of processes and their rank in the process and which GPU is used. world sizeIt is very common to refer to the total number of processes . Finally, each process needs to know which part of the data to process so that batches do not overlap. Pytorch provides nn.utils.data.DistributedSamplerto achieve this, which is to split the data for each process to ensure that the training data does not overlap.

For a more detailed internal mechanism of DDP, see the official documentation: DISTRIBUTED DATA PARALLEL
DDP

Minimal demo example with explanation

To demonstrate how to do this, we will create an example that trains on MNIST, then modify it to run on multiple GPUs across multiple nodes, and finally also allow for mixed-precision training.

no multiprocessing

First, import the required dependencies:

import os
import argparse
import torch
import torch.nn as nn
import torch.distributed as dist
import torch.multiprocessing as mp
import torchvision
import torchvision.transforms as transforms
from datetime import datetime
from apex.parallel import  DistributedDataParallel as DDP
from apex import amp

We define a very simple convolutional model to predict MNIST.

class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.fc = nn.Linear(7 * 7 * 32, num_classes)

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out

The following is the training process:

def train(gpu, args):
    torch.manual_seed(0)
    model = ConvNet()
    torch.cuda.set_device(gpu)
    model.cuda(gpu)
    # model = nn.DataParallel(model, device_ids=device_ids)
    # model = model.cuda(device=gpu)
    batch_size = 100
    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda(gpu)
    optimizer = torch.optim.SGD(model.parameters(), 1e-4)
    # Data loading code
    train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(),
                                               download=True)
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True,
                                               num_workers=0, pin_memory=True)

    start = datetime.now()
    total_step = len(train_loader)
    for epoch in range(args.epochs):
        for i, (images, labels) in enumerate(train_loader):
            images = images.cuda(non_blocking=True)
            labels = labels.cuda(non_blocking=True)
            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)

            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            if (i + 1) % 100 == 0 and gpu == 0:
                print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
                                                                         loss.item()))
    if gpu == 0:
        print("Training complete in: " + str(datetime.now() - start))

The main() function will take some parameters and run the training function.

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-n", "--nodes", default=1, type=int, metavar='N')
    parser.add_argument('-g', '--gpus', default=1, type=int, help='number of gpus per node')
    parser.add_argument('-nr', '--nr', default=0, type=int, help='ranking within the nodes')
    parser.add_argument('--epochs', default=2, type=int, metavar='N', help='number of total epochs to run')
    args = parser.parse_args()
    train(0, args)

Finally, make sure the main() function is called.

if __name__ == '__main__':
    main()

You can run this code by opening a terminal and typing python src/mnist.py-n 1-g 1-nr 0, and it will train on a single GPU on a single node.
Train on a single GPU on a single node

Enable multiprocessing

To do this with multiprocessing, we need a script to start a process for each GPU . Each process needs to know which GPU to use, and its rank among all running processes . The script needs to be run on each node.

Take a look at the changes to each function. The new code has been isolated for ease of finding:

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-n", "--nodes", default=1, type=int, metavar='N')
    parser.add_argument('-g', '--gpus', default=1, type=int, help='number of gpus per node')
    parser.add_argument('-nr', '--nr', default=0, type=int, help='ranking within the nodes')
    parser.add_argument('--epochs', default=2, type=int, metavar='N', help='number of total epochs to run')
    args = parser.parse_args()
    #########################################################
    args.world_size = args.gpus * args.nodes
    os.environ['MASTER_ADDR'] = '172.20.109.105'
    os.environ['MASTER_PORT'] = '8888'
    mp.spawn(train, nprocs=args.gpus, args=(args,))
    #########################################################
    # train(0, args)

in:

  • args.nodesrepresents the total number of nodes,
  • args.gpusIndicates the total number of GPUs per node (the number of GPUs per node is the same)
  • args.nrIndicates the serial number of the current node among all nodes.

According to the total number of nodes and the number of GPUs per node, it can be calculated world_size, that is, the total number of processes to be run. All processes need to know the IP address and port of process 0, so that all processes can be synchronized at the beginning. Generally, process 0 is called The master process , for example, we will print information or save the model in process 0.

PyTorch provides mp.spawnto start all processes of the node on a node , each process runs train(i, args), where i is from 0 to args.gpus-1. Remember to run the main() function on each node so there will be a total of args.nodes*args.gpus=args.world_sizeprocesses.

Again, we want to modify the training function:

def train(gpu, args):
    ############################################################
    rank = args.nr * args.gpus + gpu
    dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank)
    ############################################################
    torch.manual_seed(0)
    model = ConvNet()
    torch.cuda.set_device(gpu)
    model.cuda(gpu)
    batch_size = 100
    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda(gpu)
    optimizer = torch.optim.SGD(model.parameters(), 1e-4)
    ############################################################
    # Wrap the model
    model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu])
    ############################################################

    # Data loading code
    train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(),
                                               download=True)
    ############################################################
    train_sampler = torch.utils.data.distributed.DistributedSampler(dataset=train_dataset, num_replicas=args.world_size,
                                                                    rank=rank)
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=False,
                                               num_workers=0, pin_memory=True, sampler=train_sampler)
    ############################################################

    start = datetime.now()
    total_step = len(train_loader)
    for epoch in range(args.epochs):
        for i, (images, labels) in enumerate(train_loader):
            images = images.cuda(non_blocking=True)
            labels = labels.cuda(non_blocking=True)
            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)

            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            if (i + 1) % 100 == 0 and gpu == 0:
                print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
                                                                         loss.item()))
    if gpu == 0:
        print("Training complete in: " + str(datetime.now() - start))

Here, first calculate the current program number: rank = args.nr * args.gpus + gpu, and then dist.init_process_groupinitialize the distributed environment, where

  • backendThe parameter specifies the communication backend , including mpi, gloo, nccl, and is selected here nccl. It is the official multi-card communication framework provided by Nvidia, which is relatively efficient. mpiIt is also a common communication protocol for high-performance computing, but you need to install the MPI implementation framework yourself, for example OpenMPI. glooIt has a built-in communication backend, but it is not efficient enough.
  • init_methodRefers to how to initialize to complete the process synchronization at the beginning. What we set here refers env://to the environment variable initialization method. Four parameters need to be configured in the environment variable: MASTER_PORT, MASTER_ADDR, WORLD_SIZE, RANK, we have already configured the first two parameters, The latter two parameters can also be configured through dist.init_process_groupfunction world_sizeneutralization rankparameters.
    Other initialization methods include shared file system and TCP . For example, if TCP is used as the initialization method init_method='tcp://10.1.1.20:23456', the IP address and port of the master must be provided. Note that this call is blocking and must wait for all processes to synchronize, and will fail if any process fails.
  • For the model side, you only need to DistributedDataParallelwrap the original model, copy the model to the GPU for processing, and it will support gradient All-Reduce operations behind the scenes.
  • For the data side, use it nn.utils.data.DistributedSamplerto split data for each process, you only need to dataloaderuse this in sampler, it is worth noting that it should be called at the beginning of each epoch in the training cycle process train_sampler.set_epoch(epoch), (mainly to ensure the division of each epoch are different) other training codes remain unchanged.

Finally, the code can be executed. For example, if we have 4 nodes and each node has 8 graphics cards, then we need to execute them on the terminals of the 4 nodes:

python src/mnist-distributed.py -n 4 -g 8 -nr i

For example, execute on node 0:

python src/mnist-distributed.py -n 4 -g 8 -nr 0

args.gpusIn other words, running this script on each node tells it to start processes that are in sync with each other before training begins .

It should be noted that what is effective at this time batch_sizeis that batch_size_per_gpu * world_sizefor models with BN, synchronous BN can also be used to obtain better results:

model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)

The above describes the distributed training process, which is also applicable to the evaluation or testing process . For example, we divide the data into different processes for prediction, which can speed up the prediction process. The implementation code is exactly the same as the above process, but if we want to calculate a certain indicator, we need to perform All-Reduce from the statistical results of each process , because each process only calculates part of the data content. For example, if we want to calculate the classification accuracy, we can count the total number of data total and the number of correct classification counts of each process, and then aggregate them.

One thing to mention here is that when initializing a distributed environment, a default distributed process groupdist.init_process_group (distributed process group) is actually established . This group will also initialize the Pytorch package. In this way, the API we can directly use can perform distributed basic operations. The following is the specific implementation:torch.distributedtorch.distributed

# define tensor on GPU, count and total is the result at each GPU
t = torch.tensor([count, total], dtype=torch.float64, device='cuda')
dist.barrier()  # synchronizes all processes
dist.all_reduce(t, op=torch.distributed.ReduceOp.SUM,)  # Reduces the tensor data across all machines in such a way that all get the final result.
t = t.tolist()
all_count = int(t[0])
all_total = int(t[1])
acc = all_count / all_total

Distributed training start method

In the above process, the PyTorch torch.multiprocessingpackage ( Multiprocessing package - torch.multiprocessing ) is used to start distributed training. Currently, the official ImageNet training example is used in this way, and the detectron2 library is also started in this way: https:// github.com/facebookresearch/detectron2/blob/main/detectron2/engine/launch.py.

If you use torch.multiprocessing.spawnstartup, you should pay attention that the input training functionmust be in fn(i,*args)this format, where the first parameter i refers to the process number of the current node, this parameter actually serves as the local_rankso- local_rankcalled training process in the current node Serial number, the rank mentioned above is actually the global program number. This parameter is very important, because the device device used by each process must be set according to this parameter. Generally, it is directly regarded as the GPU number used, and the setting is as local_rankfollows :

torch.cuda.set_device(args.local_rank)  # before your code runs

# set DDP
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)

# 或者
with torch.cuda.device(args.local_rank):
    # your code to run

In addition to adopting mp.spawn, it can also be used torch.distributed.launchto start the program ( Distributed communication package - torch.distributed ), which is a more common way to start. For example, for single-machine multi-card training, the startup method is as follows:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
           YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of your training script)

Among them NUM_GPUS_YOU_HAVEis the total amount of GPU, YOUR_TRAINING_SCRIPT.pybut the training script, which is basically the same as the above, but the difference is torch.distributed.launchthat some environment variables ( ) will be automatically set when starting. https://github.com/pytorch/pytorch/blob/master/torch/distributed/run.py#L211For example, what we need RANKand WORLD_SIZEcan be obtained directly from the environment variables:

rank = int(os.environ["RANK"])
world_size = int(os.environ['WORLD_SIZE'])

There local_rankare two ways to obtain :
1) One is to add a command line parameter to the training script, which will be automatically assigned when the program starts :

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)
args = parser.parse_args()

local_rank = args.local_rank

2) Another way is to torch.distributed.launchstart plus --use_env=True. In this case, LOCAL_RANKthis environment variable will be set, and it can be obtained from the environment variable local_rank:

"""
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE --use_env=True
           YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of your training script)
"""
import os
local_rank = int(os.environ["LOCAL_RANK"]) 

For multi-machine multi-card training , such as two nodes, the startup command is as follows:

# Node 1: (IP: 192.168.1.1, and has a free port: 1234)
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE --nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
           --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of your training script)

# Node 2
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE --nnodes=2 --node_rank=1 --master_addr="192.168.1.1"
           --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of your training script)

here

  • --nnodesIndicates the number of incoming nodes
  • --node_rankIndicates the number of the incoming node
  • world_size=nnodes*nproc_per_node

However, the latest version of PyTorch has launched torchrun instead torch.distributed.launch. torchrunThe usage of and torch.distributed.launchis basically the same, but the command is abandoned --use_env, and it is directly local_rankset in the environment variable. The latest version torchvisionuses torchrunthe startup method, see vision/references/classification at main · pytorch/vision for details .

Mixed precision training (using apex)

Install Apex :

git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Apex official documentation: Apex (A PyTorch Extension)

Mixed-precision training (a combination of floating-point (FP32) and half-precision (FP16) training) allows us to use larger batch_sizes and leverage NVIDIA Tensor Cores for faster computation. AWS p3 instances use NVIDIA Tesla V100 GPUs with Tensor cores. It is very simple to use NVIDIA's apex for mixed precision training, only need to modify part of the code:

def train(gpu, args):
    ############################################################
    rank = args.nr * args.gpus + gpu
    dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank)
    ############################################################
    torch.manual_seed(0)
    model = ConvNet()
    torch.cuda.set_device(gpu)
    model.cuda(gpu)
    batch_size = 100
    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda(gpu)
    optimizer = torch.optim.SGD(model.parameters(), 1e-4)
    ############################################################
    # Wrap the model
    model, optimizer = amp.initialize(model, optimizer, opt_level='O2')
    model = DDP(model)
    ############################################################

    # Data loading code
    train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(),
                                               download=True)
    ############################################################
    # train_sampler = torch.utils.data.distributed.DistributedSampler(dataset=train_dataset, num_replicas=args.world_size,
    #                                                                 rank=rank)
    # train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=False,
    #                                            num_workers=0, pin_memory=True, sampler=train_sampler)
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True,
                                               num_workers=0, pin_memory=True)
    ############################################################

    start = datetime.now()
    total_step = len(train_loader)
    for epoch in range(args.epochs):
        for i, (images, labels) in enumerate(train_loader):
            images = images.cuda(non_blocking=True)
            labels = labels.cuda(non_blocking=True)
            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)

            # Backward and optimize
            optimizer.zero_grad()
            ############################################################
            with amp.scale_loss(loss, optimizer) as scaled_loss:
                scaled_loss.backward()
            ############################################################
            optimizer.step()
            if (i + 1) % 100 == 0 and gpu == 0:
                print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
                                                                         loss.item()))
    if gpu == 0:
        print("Training complete in: " + str(datetime.now() - start))

Actually there are two changes:

  • The first is to use amp.initializeto package modeland optimizersupport mixed precision training, which opt_levelrefers to the optimization level . If it is O0 (use all floats) or O3 (use half-precision throughout), it is not a real mixed precision, but it can be used to determine the model effect and The baseline of speed, and O1 and O2 are two settings of mixed precision. You can choose one for mixed precision training. The details can be found in the Apex documentation .
  • Another point is that before updating the parameters according to the gradient, the gradient must be scaled through amp.scale_loss to prevent gradient underflowing. Also, you can replace nn.DistributedDataParallel with apex.parallel.DistributedDataParallel.

Yes, the first character in all of these codes is a capital "O" and the second character is a number. Yes, if you substitute zero, you get a puzzling error message.

apex.parallel.distributedDataParallelYes nn.distributedDataParalleara replacement. It is no longer necessary to specify the GPU, since Apex only allows one GPU per process. It also assumes that the script is called before moving the model to the GPU torch.cuda.set_device(local_rank).

Mixed-precision training requires scaling the loss to prevent gradient underflow. Apex will do this automatically.

This script works in the same way as the distributed training script.

python without_multiprocessing.py -n 1 -g 4 -nr 0

training output
In addition, the new version of PyTorch has built-in mixed precision training, see AUTOMATIC MIXED PRECISION PACKAGE - TORCH.AMP Add link description for details . Moreover, the official distributed implementation of PyTorch is relatively complete now, and its performance and effect are good. An alternative solution is to horovodsupport not only PyTorch but also TensorFlow and MXNet frameworks. It is relatively easy to implement and the speed is also good.

References

  1. Distributed data parallel training in Pytorch
  2. Using DistributedDataParallel in PyTorch for multi-GPU distributed model training
  3. PyTorch Distributed Training Concise Course (2022 Update)
  4. Distributed training framework Horovod

Guess you like

Origin blog.csdn.net/ARPOSPF/article/details/131729615