This is relatively simple, please move to another article for details: https://blog.csdn.net/qq_36276587/article/details/123913384
A brief summary of using pytorch for single-machine multi-card distributed training, mainly the use of some key APIs, and the distributed training process, pytorch version 1.2.0 is available
Initialize the GPU communication method (NCCL)
import torch.distributed as dist
torch.cuda.set_device(FLAGS.local_rank)
dist.init_process_group(backend='nccl')
device = torch.device("cuda", FLAGS.local_rank) #自己设置
Distributed Data Loading
train_sampler = torch.utils.data.distributed.DistributedSampler(traindataset)
train_loader = torch.utils.data.DataLoader(
traindataset, batch_size=batchSize,
sampler=train_sampler,
num_workers=4, pin_memory=True,#drop_last=False,
collate_fn=alignCollate(imgH=imgH, imgW=imgW, keep_ratio=FLAGS.keep_ratio))
#pytorch的DataLoader格式处理训练标签
Distributed training model
#初始化后的模型使用分布式训练
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) ## 同步bn
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[FLAGS.local_rank],
output_device=FLAGS.local_rank)
start training
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 train_distributed.py