Notes on using DistributedDataParallel in PyTorch

1. Basic concepts

DistributedDataParallelThere are some probabilities that must be grasped when using

Multiple machines and multiple cards meaning
world_size It represents how many machines there are, which can be understood as how many servers there are.
rank The first machine, that is, the first server
local_rank Which GPU is in a certain machine?
Single machine with multiple cards meaning
world_size Represents how many GPUs the machine has in total
rank Which GPU
local_rank Which GPU is the same as the rank?

2. How to use

2.1. Modify the main function

When running, DistributedDataParallel will add a parameter local_rank to your program, so you need to parse this parameter in your code now, such as:

parser.add_argument("--local_rank", type=int, default=1, help="number of cpu threads to use during batch generation")

2.2. Initialization

torch.distributed.init_process_group(backend="nccl")

os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1, 2"  # 有几块GPU写多少

2.3. Set device

local_rank = torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
global device
device = torch.device("cuda", local_rank)

I did not use arg.local_rank, but defined a new local_rank variable because I trust the distributed.get_rank() function more. I wrote it
using torch.device and added global because it will be used later in the model and data. device, no error

2.4. Loading models to multiple GPUs

model.to(device)  # 这句不能少,最好不要用model.cuda()
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank, find_unused_parameters=True)  # 这句加载到多GPU上

2.5. Loading data into gpu

数据.to(device)

2.6. Startup

torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:12345 train_cylinder_asym.py

references

Pytorch Parallel Computing (2): Introduction to DistributedDataParallel_dist.barrier_harry_tea's blog-CSDN blog

Summary of the whole process of DistributedDataParallel multi-GPU distributed training with 90% success_BRiAq's Blog-CSDN Blog 

Guess you like

Origin blog.csdn.net/xhtchina/article/details/133164311