[PyTorch Tutorial] How to use PyTorch distributed parallel module DistributedDataParallel (DDP) for multi-card training

Contents of this issue

1. Import the core library
2. Initialize the distributed process group
3. Packaging model
4. Distribute input data
5. Save model parameters
6. Run distributed training
7. DDP complete training code

The focus of this chapter is to learn how to use the Distributed Data Parallel (DDP) library in PyTorch for efficient distributed parallel training. to improve the training speed of the model.

1. Import the core library

The libraries that need to be imported for DDP multi-card training are:

Library	effect
`torch.multiprocessing` as mp	Wrapper for the native Python multiprocess library
`from torch.utils.data.distributed import DistributedSampler`	The DistributedSampler mentioned in the previous section divides different input data into the GPU
`from torch.nn.parallel import DistributedDataParallel as DDP`	Protagonist, core, DDP module
`from torch.distributed import init_process_group, destroy_process_group`	Two functions, the former initializes the distributed process group, and the latter destroys the distributed process group

2. Initialize the distributed process group

Distributed Process Group Distributed Process Group. It contains all processes on all GPUs. Because DDP is based on multi-process parallel computing, and each GPU corresponds to a process, a process group must be created and defined first so that processes can discover and communicate with each other.

First write a function ddp_setup():

import torch
import os
from torch.utils.data import Dataset, DataLoader

# 以下是分布式DDP需要导入的核心库
import torch.multiprocessing as mp
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group


# 初始化DDP的进程组
def ddp_setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    init_process_group(backend="nccl", rank=rank, world_size=world_size)

It contains two input parameters:

Entering	meaning
rank	The unique ID of eachprocess in the `world_size`-1] process group, the range is [0,
world_size	The total number of processes in a process group

In the function, we first set the environment variables:

environment variables	meaning
MASTER_ADDR	The IP address of the host running on the rank 0 process. For stand-alone training, just write “localhost”
MASTER_PORT	The idle port of the host does not conflict with the system port.

It is called a host because it is responsible for coordinating communication between all processes.

Finally, we call the init_process_group() function to initialize the default distributed process group. It contains the following input parameters:

Entering	meaning
backend	Backend, usually nccl, NCCL is the Nvidia Collective Communications Library, which is used for distributed communication between CUDA GPUs
rank	The unique ID of each process in the process group, the range is [0, `world_size`-1]
world_size	The total number of processes in a process group

In this way, the initialization function of the process group is ready.

【Notice】

If your neural network model contains BatchNorm layers, you need to modify it to SyncBatchNorm layers in order to synchronize across multiple model copies< /span> layers with one click.) layers in the neural network into function to convert all BatchNorm The running status of the layer. (You can call the torch.nn.SyncBatchNorm.convert_sync_batchnorm(model: torch.nn.Module)BatchNormSyncBatchNorm

3. Packaging model

There is one thing to note when writing the trainer. Before starting to use the model, we need to use DDP to wrap our model:
```
self.model = DDP(self.model, device_ids=[gpu_id])
```
modelIn addition to , device_ids: List[int] or torch.device, which is usually a list consisting of the GPU ID of the host where the model is located, < /span> also needs to pass in

4. Distribute input data

DistributedSamplerChunk the input data across all distributed processes to ensure that the input data does not have overlapping samples.

Each process will receive input data of the specified batch_size size. For example, when you specify batch_size as 32 and you have 4 GPUs, the effective batch size is:
$32 \times 4 = 128$

train_loader = torch.utils.data.DataLoader(
	dataset=train_set,
    batch_size=32,
    shuffle=False,	# 必须关闭洗牌
    sampler=DistributedSampler(train_set)	# 指定分布式采样器
)

Then, the DistributedSampler method of set_epoch(epoch: int) is called at the beginning of each epoch, so that the shuffle mechanism can be enabled normally in multiple epochs, thus Avoid using the same sample order in every epoch.

def _run_epoch(self, epoch: int):
    b_sz = len(next(iter(self.train_loader))[0])
    self.train_loader.sampler.set_epoch(epoch)	# 调用
    for x, y in self.train_loader:
        ...
        self._run_batch(x, y)

5. Save model parameters

Since we have used DDP(model) to wrap the model before, self.model now points to the DDP-wrapped object instead of the model model object itself. If we want to read the underlying parameters of the model at this time, we need to call model.module.

Since the neural network model parameters are the same in all GPU processes, we only need to save the model parameters from one of the GPU processes.

ckp = self.model.module.state_dict()	# 注意需要添加.module
...
...
if self.gpu_id == 0 and epoch % self.save_step == 0:	# 从gpu:0进程处保存1份模型参数
    self._save_checkpoint(epoch)

6. Run distributed training

Inclusion 2 New entry rank (Alternative device) Sum world_size.
When mp.spawn is called, the rank parameter will be assigned automatically.

world_sizeis the number of processes during the entire training process. For GPU training, it refers to the number of GPUs that can be used, and each GPU only runs 1 process.

def main(rank: int, world_size: int, total_epochs: int, save_step: int):
    ddp_setup(rank, world_size)	# 初始化分布式进程组
    train_set, model, optimizer = load_train_objs()
    train_loader = prepare_dataloader(train_set, batch_size=32)
    trainer = Trainer(
        model=model,
        train_loader=train_loader,
        optimizer=optimizer,
        gpu_id=rank,	# 这里变了
        save_step=save_step
    )
    trainer.train(total_epochs)
    destroy_process_group()	# 最后销毁进程组
    
if __name__ == "__main__":
    import sys
    total_epochs = int(sys.argv[1])
    save_step = int(sys.argv[2])
    world_size = torch.cuda.device_count()
    mp.spawn(main, args=(world_size, total_epochs, save_step), nprocs=world_size)

The function of torch.multiprocessing is called here. The main function of this function is to execute the specified function in multiple processes, each process running in an independent Python interpreter. This avoids issues that limit multi-threaded concurrency performance due to the presence of the Python Global Interpreter Lock (GIL). In distributed training, each GPU or computing node usually runs an independent process, and synchronization of model parametersis achieved through communication between processesand Gradient aggregation. spawn()

You can see that when calling the spawn() function, the args parameter is not passed when passing the rank. This is because it will Automatic allocation, please see the table belowfn for details.

Entering	meaning
fn: function	Function to be executed in each process. This function will be called in the form of `fn(i, *args)`, where `i` is the unique process ID automatically assigned by the system, and `args` is passed to The function’s argument tuple
args: tuple	Parameters to be passed to the function `fn`
nprocs: int	The number of processes to start
join: bool	Whether to wait for all processes to complete before continuing execution of the main process (default value is True)
daemon: bool	Whether to set all spawned child processes as daemons (default False)

7. DDP complete training code

First, a trainer Trainer class is created.

import torch
import os
from torch.utils.data import Dataset, DataLoader

# 以下是分布式DDP需要导入的核心库
import torch.multiprocessing as mp
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group


# 初始化DDP的进程组
def ddp_setup(rank: int, world_size: int):
    """
    Args:
    	rank: Unique identifier of each process.
    	world_size: Total number of processes.
    """
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    init_process_group(backend="nccl", rank=rank, world_size=world_size)
    
 
class Trainer:
    def __init__(
        self,
        model: torch.nn.Module,
        train_loader: DataLoader,
        optimizer: torch.optim.Optimizer,
        gpu_id: int,
        save_step: int	# 保存点(以epoch计)
    ) -> None:
        self.gpu_id = gpu_id,
        self.model = DDP(model, device_ids=[self.gpu_id])	# DDP包装模型
        self.train_loader = train_loader,
        self.optimizer = optimizer,
        self.save_step = save_step
        
	def _run_batch(self, x: torch.Tensor, y: torch.Tensor):
        self.optimizer.zero_grad()
        output = self.model(x)
        loss = torch.nn.CrossEntropyLoss()(output, y)
        loss.backward()
        self.optimizer.step()
        
	def _run_epoch(self, epoch: int):
    b_sz = len(next(iter(self.train_loader))[0])
    self.train_loader.sampler.set_epoch(epoch)	# 调用set_epoch(epoch)洗牌
    print(f'[GPU{
      
      self.gpu_id}] Epoch {
      
      epoch} | Batchsize: {
      
      b_sz} | Steps: {
      
      len(self.train_loader)}')
    for x, y in self.train_loader:
        x = x.to(self.gpu_id)
        y = y.to(self.gpu_id)
        self._run_batch(x, y)
        
	def _save_checkpoint(self, epoch: int):
        ckp = self.model.module.state_dict()
        torch.save(ckp, './checkpoint.pth')
        print(f'Epoch {
      
      epoch} | Training checkpoint saved at ./checkpoint.pth')
        
	def train(self, max_epochs: int):
        for epoch in range(max_epochs):
            self._run_epoch(epoch)
            if self.gpu_id == 0 and epoch % self.save_step == 0:
                self._save_checkpoint(epoch)

Then, build your own datasets, data loaders, neural network models, and optimizers.

def load_train_objs():
    train_set = MyTrainDataset(2048)
    model = torch.nn.Linear(20, 1)	# load your model
    optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
    return train_set, model, optimizer

def prepare_dataloader(dataset: Dataset, batch_size: int):
    return DataLoader(
        dataset=dataset,
        batch_size=batch_size,
        shuffle=False,	# 必须关闭
        pin_memory=True,
     	sampler=DistributedSampler(dataset=train_set)	# 指定DistributedSampler采样器
    )

Finally, define the main function.

def main(rank: int, world_size: int, total_epochs: int, save_step: int):
    ddp_setup(rank, world_size)	# 初始化分布式进程组
    train_set, model, optimizer = load_train_objs()
    train_loader = prepare_dataloader(train_set, batch_size=32)
    trainer = Trainer(
        model=model,
        train_loader=train_loader,
        optimizer=optimizer,
        gpu_id=rank,	# 这里变了
        save_step=save_step
    )
    trainer.train(total_epochs)
    destroy_process_group()	# 最后销毁进程组
    
if __name__ == "__main__":
    import sys
    total_epochs = int(sys.argv[1])
    save_step = int(sys.argv[2])
    world_size = torch.cuda.device_count()
    mp.spawn(main, args=(world_size, total_epochs, save_step), nprocs=world_size)

At this point, you have successfully mastered the core usage of DDP distributed training.