Single card training is changed to DistributedDataParallel training

1. Two distributed training methods

1.DataParallel (DP): Simple to implement, less code, and faster startup. But the speed is slow and there is a problem of load imbalance. Single process, multi-threaded. The main card takes up much more video memory than other cards. Mixed precision training with Apex is not supported. It is a solution given by Pytorch official a long time ago. Limited by Python GIL, the operating principle of DP is to divide the input data of a batch size evenly into multiple GPUs for calculation respectively (note here that the batch size must be larger than the number of GPUs to be divided).


2. DistributedDataParallel (DDP): All-Reduce mode, originally intended for distributed training (multiple machines and multiple cards), but can also be used for single machines and multiple cards. The configuration is slightly complex. multi-Progress. The data distribution is relatively balanced. It is a new generation of multi-card training method. Use the torch.distributed library to achieve parallelism. Distributed support, including distributed training support for GPUs and CPUs, is provided by the torch.distributed library, which provides an MPI-like interface for exchanging tensor data across multi-machine networks. It supports several different backends and initialization methods. DDP improves communication efficiency through Ring-Reduce's data exchange method and alleviates the limitations of Python GIL by starting multiple processes, thereby increasing training speed.

2. DDP implementation steps

1. Packages that need to be imported

Among them, dist is responsible for multi-thread communication, and DDP is responsible for model delivery.

2. Communication process initialization

Where local_rank is set to -1

3. Use DistributedSampler to encapsulate data 

4. Put the model on cuda and encapsulate it

 5. During training, put the data on the device

Excuting an order:

python3 -m torch.distributed.launch --nproc_per_node=4 --master_port=60055 train.py --GPUS 4,5,6,7

Guess you like

Origin blog.csdn.net/m0_62278731/article/details/134185975