[Distributed training] Multi-GPU distributed model training based on PyTorch (supplement)


Introduction: Using DistributedDataParallel in PyTorch for multi-GPU distributed model training.
Original link: https://towardsdatascience.com/distributed-model-training-in-pytorch-using-distributeddataparallel-d3d3864dc2a7

With the continuous emergence of large models represented by ChatGPT, how to train large models within a reasonable time has gradually become an important research topic. To solve this problem, more and more practitioners are turning to distributed training. Distributed training is a technique for training deep learning models using multiple GPUs and/or multiple machines . Distributed training jobs can overcome single-GPU memory bottlenecks and develop larger, more powerful models by utilizing multiple GPUs simultaneously.

torch.nn.parallel.DistributedDataParallelThis article presents an introduction to distributed training in pure PyTorch using the API. The main contents are:

  • Discuss distributed training approaches in general, and data parallelism in particular
  • Introduces the related functions of torch.distand DistributedDataParallel, and illustrates how to use them with examples
  • Test real training scripts to save time

background knowledge

Before delving into distributed and data parallelism, some background knowledge about distributed training is needed . There are basically two different forms of distributed training in common use today : 数据并行化and 模型并行化.

  1. In data parallelization, the model training job is divided on the data . Each GPU in the job receives its own independent batch slice of data . Each GPU uses this data to compute gradient updates independently . For example, if you were to use two GPUs and a batch size of 32, one GPU would handle the forward and backward pass of the first 16 records, and the second handle the backward and forward pass of the next 16 records. These gradient updates are then synchronized across GPUs, averaged together , and finally applied to the model (the synchronization step is technically optional, but theoretically faster asynchronous update strategies are still an active area of ​​research).

  2. In model parallelization, the model training job is divided on the model . Each GPU at work receives a slice of the model , i.e. a subset of its layers. For example, one GPU is responsible for its output header, another for the input layer, and another for the hidden layer in between.

While both techniques have their pros and cons, data parallelization is easier to implement of the two (it doesn't require knowledge of the underlying network architecture), so this strategy is usually tried first (these techniques can also be combined, e.g. at the same time using model and data parallelization, but this is an advanced topic not covered here).

Note: This article is an introduction to the DistributedDataParallel parallel API, so the details of model parallelization will not be discussed further.

How Data Parallel Works

The first widely adopted data parallel technique was the parameter server strategy in TensorFlow. This feature is actually earlier than the first version of TensorFlow, as early as 2012, Google's internal predecessor DistBelief has been implemented. This strategy is well illustrated in the figure below:
data parallelism
In the parameter server strategy, the number of worker and parameter processes is variable, and each worker process maintains its own independent copy of the model in GPU memory . The gradient update is calculated as follows:

  1. After receiving the start signal, each worker process accumulates gradients for its specific batch slice.
  2. These workers send updates to the parameter server in a fan-out fashion.
  3. Parameter servers wait until they have all worker updates , and then average the gradients for the portion of the parameter space they are responsible for gradient updates .
  4. Gradient updates are distributed across the workers, and they are summed up and applied to a copy of the model weights in memory (thus keeping the worker models in sync).
  5. Once each worker has the updates applied, a new batch of training can begin.

While easy to implement, this strategy has some major limitations . The most important point is that each additional parameter server requires n_workers extra network calls at each synchronization step —an O(n) complexity cost. The overall speed of computation is dependent on the slowest connection , so large-parameter server-based model training jobs are very inefficient in practice, pushing network GPU utilization to 50% or below.

The more modern distributed training strategy abolishes the parameter server . In the DistributedDataParallel parallel strategy, each process is a work process. Each process still maintains a full copy of the model weights in memory, but batch slice gradient updates are now synchronous and averaged directly on the worker processes themselves. This is achieved using a technique borrowed from the field of high-performance computing: 全归约算法( all-reduce algorithm )
all reduce algorithm
The figure shows a specific implementation of the all-reduce algorithm, the loop all-reduce . This algorithm provides an elegant way to synchronize the state of a set of variables (in this case tensors) across a set of processes. Vectors are passed directly in a sequence of direct worker-to-worker connections . This removes the network bottleneck created by the connection between workers and the parameter server, greatly improving performance. In this scheme, gradient updates are computed as follows:

  1. Each worker maintains its own copy of the model weights and its own copy of the dataset .
  2. After receiving the start signal, each worker process extracts a separated batch from the dataset and computes a gradient for that batch.
  3. Workers synchronize their individual gradients using an all-reduce algorithm, locally computing the same average gradient across all nodes .
  4. Each worker applies gradient updates to its local copy of the model.
  5. The next batch of training begins.

In a 2017 Baidu paper "Bringing HPC Techniques to Deep Learning", this full streamlining strategy was proposed. It's important because it's based on well-known HPC techniques as well as long-standing open source implementations. All-reduce is included in the standard for the Message Passing Interface (MPI), which is why PyTorch has no less than three different backend implementations: Open MPI, NVIDIA NCCL, and Facebook Gloo (NVIDIA NCCL is generally recommended)

Data distribution process

1. Process initialization

It is important to know that modifying the training script to use the DistributedDataParallel parallel strategy is not a simple one-line change. For details, please refer to another blog: Pytorch-based distributed data parallel training

So the first and most complicated new thing to handle is process initialization . A normal PyTorch training script executes a single copy of its code in a single process. With a data-parallel model, the situation is even more complicated: there are now as many simultaneous copies of the training script as there are GPUs in the training cluster, each running in a different process . A sample script is as follows:

import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp


def init_process(rank, size, backend='gloo'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)


def train(rank, num_epochs, world_size):
    init_process(rank, world_size)
    print(f"Rank {
      
      rank + 1}/{
      
      world_size} process initialized.\n")


# rest of the training script goes here!WORLD_SIZE = torch.cuda.device_count()
if __name__ == "__main__":
    mp.spawn(train, args=(NUM_EPOCHS, WORLD_SIZE), nprocs=WORLD_SIZE, join=True)

where WORLD_SIZE is the number of processes orchestrated and the (global) rank is the position of the current process in that MPI. If this script is to be executed on a machine with 4 GPUs, WORLD_SIZE should be 4 (because torch.cuda.device_count() == 4), so mp.spawn4 different processes will be spawned, and their ranks are 0, 1, 2 or 3 respectively. A process with a rank of 0 is given some additional responsibilities and is therefore called the master process.

The rank of the current process will be passed as its first argument to the forked entry point (train function in this case). Before it can perform any work while training, it needs to first establish a peer-to-peer connection . This is the job of dist.init_process_group. When running in the main process, this method sets up a socket listener on MASTER_ADDR:MASTER_PORT and starts handling connections from other processes . Once all processes are connected, this method handles establishing a peer connection to allow the processes to communicate .

Note that this code is only suitable for training on a multi-GPU machine ! The same machine is used to launch each process in the job, so training can only utilize the GPU attached to that specific machine. This makes things easy: setting up IPC is as easy as finding a free port on localhost that is immediately visible to all processes on that machine. IPC across computers is more complicated because it requires configuring an external IP address that is visible to all computers .

2. Process synchronization

After understanding the initialization process, we can now start to complete the main body of the train method of all workers.

def train(rank, num_epochs, world_size):
    init_process(rank, world_size)
    for epoch in range(num_epochs):
        print(f"Rank {
      
      rank + 1}/{
      
      world_size} process initialized.\n")

Each of the four training processes runs this function until completion, and then exits when complete. If you run this code now (via python multi_init.py), you will see something like this on the console:

$ python multi_init.py
Rank 4/4 process initialized.
Rank 2/4 process initialized.
Rank 4/4 process initialized.
Rank 2/4 process initialized.
Rank 3/4 process initialized.
Rank 4/4 process initialized.
Rank 2/4 process initialized.
Rank 3/4 process initialized.
Rank 4/4 process initialized.
Rank 2/4 process initialized.
Rank 4/4 process initialized.
Rank 3/4 process initialized.
Rank 2/4 process initialized.
Rank 3/4 process initialized.
Rank 3/4 process initialized.
Rank 1/4 process initialized.
Rank 1/4 process initialized.
Rank 1/4 process initialized.
Rank 1/4 process initialized.
Rank 1/4 process initialized.

These processes are performed independently, and there is no guarantee of what state is at any point in the training loop. So some careful changes to the initialization process are required here.
(1) Any method of downloading data should be isolated to the main process .
Otherwise the data download process would be duplicated across all processes, resulting in four processes writing to the same file at the same time, which would cause data corruption .

def train(rank, num_epochs, world_size):
    init_process(rank, world_size)
    for epoch in range(num_epochs):
        if rank == 0:
            downloading_dataset()
            downloading_model_weights()
            dist.barrier()
            print(f"Rank {
      
      rank + 1}/{
      
      world_size} training process passed data download barrier.\n")

in the example dist.barrierwill block the call until the main process (rank == 0) downloading_datasetand downloading_model_weightsuntil finished. This isolates all network I/O to one process and prevents other processes from proceeding.
(2) The data loader needs to use DistributedSampler .

def get_dataloader(rank, world_size):
    dataset = PascalVOCSegmentationDataset()
    sampler = DistributedSampler(dataset, rank=rank, num_replicas=world_size, shuffle=True)
    dataloader = DataLoader(dataset, batch_size=8, sampler=sampler)

DistributedSampler uses rank and world_size to figure out how to split the dataset into non-overlapping batches throughout the process . Each training step of a worker process retrieves batch_size observations from its local copy of the dataset. In the example case of four GPUs, this means that the effective batch size is 8*4=32.
(3) Load tensors in the correct device .
Please parameterize the call with the rank of the device the process is managing .cuda(), for example:

batch = batch.cuda(rank)
segmap = segmap.cuda(rank)
model = model.cuda(rank)

(4) Any randomness in the model initialization must be disabled .
It is very important that the model is started and kept in sync throughout the training process. Otherwise, incorrect gradients will be obtained and the model will fail to converge. torch.nn.init.kaiming_normal_A random initialization method like this can be made deterministic with the following code :

torch.manual_seed(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(0)

(5) Any method that performs file I/O should be isolated from the main process .
Inefficiency and potential data corruption due to concurrent writes to the same file. Again, this is easy to do with simple conditional logic:

if rank == 0:
	if not os.path.exists('/spell/checkpoints/'):
		os.mkdir('/spell/checkpoints/')
		torch.save(model.state_dict(), f'/spell/checkpoints/model_{
      
      epoch}.pth')

(6) Wrap the model in DistributedDataParallel .

model = DistributedDataParallel(model, device_ids=[rank])

This step is necessary and the last step, after which the model can be trained in distributed data parallel mode.

data parallelism

There is another data parallelization strategy in PyTorch , ie torch.nn.DataParallel. The API is easy to use. All you have to do is wrap the model initialization like so:

model = nn.DataParallel(model)

All changes are only one line! Why not use it? Because behind the scenes of the program, DataParallel uses multithreading instead of multiprocessing to manage its GPU workers . This greatly simplifies the implementation: since worker processes are all different threads of the same process , they all have access to the same shared state without any additional synchronization steps. However, multithreading in Python works poorly for computational jobs due to the presence of the Global Interpreter Lock (GIL) . As shown in the benchmarks in the next section, models parallelized with DataParallel are significantly slower than models parallelized with DistributedDataParallel. Nevertheless, if you don't want to spend extra time and effort to use multi-GPU training, DataParallel can be considered.

Benchmarks

BenchmarksDistributedDataParallel is significantly more efficient than DataParallel, but it is far from perfect. Switching from V100x1 to V100x4 is 4 times the power consumption of the original GPU, but only 3 times the model training speed. Further doubling the computation by upgrading to V100x8 only increases the training speed by about 30%, but the efficiency of DataParallel almost reaches the level of DistributedDataParallel.

Relevant information

  1. Distributed TensorFlow Getting Started Tutorial
  2. Distributed Training with Horovod Estimators and PyTorch
  3. Distributed Parallel Training: Data Parallelism and Model Parallelism
  4. Distributed Parallel Training — Model Parallel Training
  5. torch.nn.parallel.DistributedDataParallel: Fast skill
  6. GETTING STARTED WITH DISTRIBUTED DATA PARALLEL
  7. Distributed Communication Package PyTorch文档
  8. randomness

Guess you like

Origin blog.csdn.net/ARPOSPF/article/details/131805949