[Deep Learning Framework] pytorch's distributed data parallelization DDP

Article directory

1 Introduction

DistributedDataParallel (DDP) is a deep learning engineering method that supports multi-machine multi-card and distributed training. It improves the communication efficiency through the Ring-Reduce data exchange method, and alleviates the limitation of Python GIL by starting multiple processes, thereby increasing the training speed. That is, the data is divided into multiple processes in parallel (generally one process is a card), each process initializes the model and is trained by its own data, and then performs gradient exchange and merging through Ring-Reduce to achieve the efficiency of multiple processes.

2. Quick Start

This is the normal single card code:

## main.py文件
import torch

# 构造模型
model = nn.Linear(10, 10).to(local_rank)

# 前向传播
outputs = model(torch.randn(20, 10).to(rank))
labels = torch.randn(20, 10).to(rank)
loss_fn = nn.MSELoss()
loss_fn(outputs, labels).backward()
# 后向传播
optimizer = optim.SGD(model.parameters(), lr=0.001)
optimizer.step()

## Bash运行
python main.py

Here is the multi-card code to join DDP:

## main.py文件
import torch
# 新增：
import torch.distributed as dist

# 新增：从外面得到local_rank参数
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", default=-1)
args = parser.parse_args()
local_rank = args.local_rank

# 新增：DDP backend初始化
torch.cuda.set_device(local_rank)
dist.init_process_group(backend='nccl')  # nccl是GPU设备上最快、最推荐的后端

# 构造模型
device = torch.device("cuda", local_rank)
model = nn.Linear(10, 10).to(device)
# 新增：构造DDP model
model = DDP(model, device_ids=[local_rank], output_device=local_rank)

# 前向传播
outputs = model(torch.randn(20, 10).to(rank))
labels = torch.randn(20, 10).to(rank)
loss_fn = nn.MSELoss()
loss_fn(outputs, labels).backward()
# 后向传播
optimizer = optim.SGD(model.parameters(), lr=0.001)
optimizer.step()


## Bash运行
# 改变：使用torch.distributed.launch启动DDP模式，
#   其会给main.py一个local_rank的参数。这就是之前需要"新增:从外面得到local_rank参数"的原因
python -m torch.distributed.launch --nproc_per_node 4 main.py

3. Basic concepts

In two machines, each with 8 graphics cards, a total of 16 graphics cards, and a parallel number of 16, DDP will start 16 processes at the same time. The following introduces some distributed concepts.

group: the process group. By default, there is only one group.

world size: Indicates the global parallel number, which is 2x8=16.
# 获取world size，在不同进程里都是一样的，得到16
torch.distributed.get_world_size()

rank: Indicates the serial number of the current process, used for inter-process communication. For a world size of 16, it is 0,1,2,...,15. Among them, the process with rank=0 is the master process.
# 获取rank，每个进程都有自己的序号，各不相同
torch.distributed.get_rank()

local_rank: The serial number of the process on each machine. Machine 1 has 0,1,2,3,4,5,6,7 and machine 2 also has 0,1,2,3,4,5,6,7
# 获取local_rank
torch.distributed.local_rank()

4. DDP usage process

4.1 launch start

## main.py文件
import torch
import argparse

# 新增1:依赖
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# 新增2：从外面得到local_rank参数，在命令行用launch启动时会提供
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", default=-1)
FLAGS = parser.parse_args()
local_rank = FLAGS.local_rank

# 新增3：DDP backend初始化
#   a.根据local_rank来设定当前使用哪块GPU
torch.cuda.set_device(local_rank)
#   b.初始化DDP，使用默认backend(nccl)就行。如果是CPU模型运行，需要选择其他后端。
dist.init_process_group(backend='nccl')

# 新增4：定义并把模型放置到单独的GPU上，需要在调用`model=DDP(model)`前。
device = torch.device("cuda", local_rank)
model = nn.Linear(10, 10).to(device)

# 新增5：之后才是初始化DDP模型
model = DDP(model, device_ids=[local_rank], output_device=local_rank)


## 数据集
my_trainset = torchvision.datasets.CIFAR10(root='./data', train=True)
# 新增1：使用DistributedSampler进行各进程的采样
train_sampler = torch.utils.data.distributed.DistributedSampler(my_trainset)
# 需要注意的是，这里的batch_size指的是每个进程下的batch_size。也就是说，总batch_size是这里的batch_size再乘以并行数(world_size)。
trainloader = torch.utils.data.DataLoader(my_trainset, batch_size=batch_size, sampler=train_sampler)

for epoch in range(num_epochs):
    # 新增2：设置sampler的epoch，DistributedSampler需要这个来维持各个进程之间的相同随机数种子
    trainloader.sampler.set_epoch(epoch)

    for data, label in trainloader:
    	data = data.to(local_rank)
    	label = label.to(local_rank)
        prediction = model(data)
        loss = loss_fn(prediction, label)
        loss.backward()
        optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
        optimizer.step()


# 1. save模型的时候，和DP模式一样，有一个需要注意的点：保存的是model.module而不是model。
#    因为model其实是DDP model，参数是被`model=DDP(model)`包起来的。
# 2. 我只需要在进程0上保存一次就行了，避免多次保存重复的东西。
if dist.get_rank() == 0:
    torch.save(model.module, "saved_model.ckpt")

Call method, under stand-alone:

CUDA_VISIBLE_DEVICES="0,1,2,3" python -m torch.distributed.launch --nproc_per_node 4 --master_port 53453 main.py

Under multi-machine:

## Bash运行
# 假设我们在2台机器上运行，每台可用卡数是8
#    机器1：
python -m torch.distributed.launch --nnodes=2 --node_rank=0 --nproc_per_node 8 \
  --master_adderss $my_address --master_port $my_port main.py
#    机器2：
python -m torch.distributed.launch --nnodes=2 --node_rank=1 --nproc_per_node 8 \
  --master_adderss $my_address --master_port $my_port main.py

4.2 spawn start

def run(rank, world_size):
    dist.init_process_group(backend='nccl', init_method=args.dist_url, rank=local_rank, world_size=args.world_size)
    # training code...

def run_demo(demo_fn, world_size):
	parser.add_argument("--gpus", "-G", default='0', type=str)
    args = parser.parse_args()

    os.environ["CUDA_VISIBLE_DEVICES"] = args.gpus
    
    args.world_size = len(args.gpus.split(','))
    port_id = 10000 + np.random.randint(0, 1000)
    args.dist_url = 'tcp://127.0.0.1:' + str(port_id)
    mp.spawn(run,
        args=(args,),
        nprocs=args.world_size,
        join=True)

Call method:

python main.py --gpus 0,1,2,3

5. Some bugs that are not very relevant

When using mp.spawn to train the model, the following error is reported.
RuntimeError: Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries. If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).

After debugging the bug, it was found that when constructing the dataset, the import of the pre-trained model was included. The original intention was to use only one of the functions, but it caused these parameters to be neither leaf nodes nor gradients, so an error was reported. Finally, only the required functions are intercepted, and the import of the pre-trained model is removed.

references

https://zhuanlan.zhihu.com/p/178402798