Contents of this issue
- The focus of this chapter is to learn how to use the Distributed Data Parallel (DDP) library in PyTorch for efficient distributed parallel training. to improve the training speed of the model.
1. Import the core library
-
The libraries that need to be imported for DDP multi-card training are:
Library effect torch.multiprocessing
as mpWrapper for the native Python multiprocess library from torch.utils.data.distributed import DistributedSampler
The DistributedSampler mentioned in the previous section divides different input data into the GPU from torch.nn.parallel import DistributedDataParallel as DDP
Protagonist, core, DDP module from torch.distributed import init_process_group, destroy_process_group
Two functions, the former initializes the distributed process group, and the latter destroys the distributed process group
2. Initialize the distributed process group
-
Distributed Process Group Distributed Process Group. It contains all processes on all GPUs. Because DDP is based on multi-process parallel computing, and each GPU corresponds to a process, a process group must be created and defined first so that processes can discover and communicate with each other.
-
First write a function
ddp_setup()
:import torch import os from torch.utils.data import Dataset, DataLoader # 以下是分布式DDP需要导入的核心库 import torch.multiprocessing as mp from torch.utils.data.distributed import DistributedSampler from torch.nn.parallel import DistributedDataParallel as DDP from torch.distributed import init_process_group, destroy_process_group # 初始化DDP的进程组 def ddp_setup(rank, world_size): os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12355" init_process_group(backend="nccl", rank=rank, world_size=world_size)
-
It contains two input parameters:
Entering meaning rank The unique ID of eachprocess in the world_size
-1] process group, the range is [0,world_size The total number of processes in a process group -
In the function, we first set the environment variables:
environment variables meaning MASTER_ADDR The IP address of the host running on the rank 0 process. For stand-alone training, just write “localhost” MASTER_PORT The idle port of the host does not conflict with the system port. It is called a host because it is responsible for coordinating communication between all processes.
-
Finally, we call the
init_process_group()
function to initialize the default distributed process group. It contains the following input parameters:Entering meaning backend Backend, usually nccl, NCCL is the Nvidia Collective Communications Library, which is used for distributed communication between CUDA GPUs rank The unique ID of each process in the process group, the range is [0, world_size
-1]world_size The total number of processes in a process group -
In this way, the initialization function of the process group is ready.
【Notice】
- If your neural network model contains
BatchNorm
layers, you need to modify it toSyncBatchNorm
layers in order to synchronize across multiple model copies< /span> layers with one click.) layers in the neural network into function to convert allBatchNorm
The running status of the layer. (You can call thetorch.nn.SyncBatchNorm.convert_sync_batchnorm(model: torch.nn.Module)
BatchNorm
SyncBatchNorm
3. Packaging model
-
There is one thing to note when writing the trainer. Before starting to use the model, we need to use DDP to wrap our model:
self.model = DDP(self.model, device_ids=[gpu_id])
-
model
In addition to ,device_ids: List[int] or torch.device
, which is usually a list consisting of the GPU ID of the host where the model is located, < /span> also needs to pass in
4. Distribute input data
-
DistributedSampler
Chunk the input data across all distributed processes to ensure that the input data does not have overlapping samples. -
Each process will receive input data of the specified
batch_size
size. For example, when you specifybatch_size
as 32 and you have 4 GPUs, the effective batch size is:
32 × 4 = 128 32 \times 4 = 128 32×4=128train_loader = torch.utils.data.DataLoader( dataset=train_set, batch_size=32, shuffle=False, # 必须关闭洗牌 sampler=DistributedSampler(train_set) # 指定分布式采样器 )
-
Then, the
DistributedSampler
method ofset_epoch(epoch: int)
is called at the beginning of each epoch, so that the shuffle mechanism can be enabled normally in multiple epochs, thus Avoid using the same sample order in every epoch.def _run_epoch(self, epoch: int): b_sz = len(next(iter(self.train_loader))[0]) self.train_loader.sampler.set_epoch(epoch) # 调用 for x, y in self.train_loader: ... self._run_batch(x, y)
5. Save model parameters
-
Since we have used
DDP(model)
to wrap the model before,self.model
now points to the DDP-wrapped object instead of the model model object itself. If we want to read the underlying parameters of the model at this time, we need to callmodel.module
. -
Since the neural network model parameters are the same in all GPU processes, we only need to save the model parameters from one of the GPU processes.
ckp = self.model.module.state_dict() # 注意需要添加.module ... ... if self.gpu_id == 0 and epoch % self.save_step == 0: # 从gpu:0进程处保存1份模型参数 self._save_checkpoint(epoch)
6. Run distributed training
-
Inclusion 2 New entry
rank
(Alternativedevice
) Sumworld_size
. -
When
mp.spawn
is called, therank
parameter will be assigned automatically. -
world_size
is the number of processes during the entire training process. For GPU training, it refers to the number of GPUs that can be used, and each GPU only runs 1 process.def main(rank: int, world_size: int, total_epochs: int, save_step: int): ddp_setup(rank, world_size) # 初始化分布式进程组 train_set, model, optimizer = load_train_objs() train_loader = prepare_dataloader(train_set, batch_size=32) trainer = Trainer( model=model, train_loader=train_loader, optimizer=optimizer, gpu_id=rank, # 这里变了 save_step=save_step ) trainer.train(total_epochs) destroy_process_group() # 最后销毁进程组 if __name__ == "__main__": import sys total_epochs = int(sys.argv[1]) save_step = int(sys.argv[2]) world_size = torch.cuda.device_count() mp.spawn(main, args=(world_size, total_epochs, save_step), nprocs=world_size)
-
The function of
torch.multiprocessing
is called here. The main function of this function is to execute the specified function in multiple processes, each process running in an independent Python interpreter. This avoids issues that limit multi-threaded concurrency performance due to the presence of the Python Global Interpreter Lock (GIL). In distributed training, each GPU or computing node usually runs an independent process, and synchronization of model parametersis achieved through communication between processesand Gradient aggregation.spawn()
-
You can see that when calling the
spawn()
function, theargs
parameter is not passed when passing therank
. This is because it will Automatic allocation, please see the table belowfn
for details.Entering meaning fn: function Function to be executed in each process. This function will be called in the form of fn(i, *args)
, wherei
is the unique process ID automatically assigned by the system, andargs
is passed to The function’s argument tupleargs: tuple Parameters to be passed to the function fn
nprocs: int The number of processes to start join: bool Whether to wait for all processes to complete before continuing execution of the main process (default value is True) daemon: bool Whether to set all spawned child processes as daemons (default False)
7. DDP complete training code
First, a trainer Trainer
class is created.
import torch
import os
from torch.utils.data import Dataset, DataLoader
# 以下是分布式DDP需要导入的核心库
import torch.multiprocessing as mp
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group
# 初始化DDP的进程组
def ddp_setup(rank: int, world_size: int):
"""
Args:
rank: Unique identifier of each process.
world_size: Total number of processes.
"""
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"
init_process_group(backend="nccl", rank=rank, world_size=world_size)
class Trainer:
def __init__(
self,
model: torch.nn.Module,
train_loader: DataLoader,
optimizer: torch.optim.Optimizer,
gpu_id: int,
save_step: int # 保存点(以epoch计)
) -> None:
self.gpu_id = gpu_id,
self.model = DDP(model, device_ids=[self.gpu_id]) # DDP包装模型
self.train_loader = train_loader,
self.optimizer = optimizer,
self.save_step = save_step
def _run_batch(self, x: torch.Tensor, y: torch.Tensor):
self.optimizer.zero_grad()
output = self.model(x)
loss = torch.nn.CrossEntropyLoss()(output, y)
loss.backward()
self.optimizer.step()
def _run_epoch(self, epoch: int):
b_sz = len(next(iter(self.train_loader))[0])
self.train_loader.sampler.set_epoch(epoch) # 调用set_epoch(epoch)洗牌
print(f'[GPU{
self.gpu_id}] Epoch {
epoch} | Batchsize: {
b_sz} | Steps: {
len(self.train_loader)}')
for x, y in self.train_loader:
x = x.to(self.gpu_id)
y = y.to(self.gpu_id)
self._run_batch(x, y)
def _save_checkpoint(self, epoch: int):
ckp = self.model.module.state_dict()
torch.save(ckp, './checkpoint.pth')
print(f'Epoch {
epoch} | Training checkpoint saved at ./checkpoint.pth')
def train(self, max_epochs: int):
for epoch in range(max_epochs):
self._run_epoch(epoch)
if self.gpu_id == 0 and epoch % self.save_step == 0:
self._save_checkpoint(epoch)
Then, build your own datasets, data loaders, neural network models, and optimizers.
def load_train_objs():
train_set = MyTrainDataset(2048)
model = torch.nn.Linear(20, 1) # load your model
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
return train_set, model, optimizer
def prepare_dataloader(dataset: Dataset, batch_size: int):
return DataLoader(
dataset=dataset,
batch_size=batch_size,
shuffle=False, # 必须关闭
pin_memory=True,
sampler=DistributedSampler(dataset=train_set) # 指定DistributedSampler采样器
)
Finally, define the main function.
def main(rank: int, world_size: int, total_epochs: int, save_step: int):
ddp_setup(rank, world_size) # 初始化分布式进程组
train_set, model, optimizer = load_train_objs()
train_loader = prepare_dataloader(train_set, batch_size=32)
trainer = Trainer(
model=model,
train_loader=train_loader,
optimizer=optimizer,
gpu_id=rank, # 这里变了
save_step=save_step
)
trainer.train(total_epochs)
destroy_process_group() # 最后销毁进程组
if __name__ == "__main__":
import sys
total_epochs = int(sys.argv[1])
save_step = int(sys.argv[2])
world_size = torch.cuda.device_count()
mp.spawn(main, args=(world_size, total_epochs, save_step), nprocs=world_size)
At this point, you have successfully mastered the core usage of DDP distributed training.