PyTorch training acceleration skills

Since recent programs have relatively high requirements for speed and want to produce results quickly, I specially learned mixed-precision arithmetic and parallel operations. Since there have been many articles introducing related principles, this article only describes how to apply PyTorch to achieve hybrid Precision calculations, data parallelism and distributed calculations do not specifically introduce the principles.

Mixed precision

Automatic mixed precision training (auto Mixed Precision, AMP) can greatly reduce the cost of training and increase the speed of training. Prior to this, automatic mixed-precision calculations used Apex tools developed by NVIDIA. Starting from PyTorch1.6.0, PyTorch has already brought its own AMP module, so the next step is to briefly introduce the use of PyTorch's own amp module.

## 导入amp工具包 
from torch.cuda.amp import autocast, GradScaler

model.train()

## 对梯度进行scale来加快模型收敛，
## 因为float16梯度容易出现underflow（梯度过小）
scaler = GradScaler()

batch_size = train_loader.batch_size
num_batches = len(train_loader)
end = time.time()
for i, (images, target) in tqdm.tqdm(
    enumerate(train_loader), ascii=True, total=len(train_loader)
):
    # measure data loading time
    data_time.update(time.time() - end)
    optimizer.zero_grad()
    if args.gpu is not None:
        images = images.cuda(args.gpu, non_blocking=True)

    target = target.cuda(args.gpu, non_blocking=True)
    # 自动为GPU op选择精度来提升训练性能而不降低模型准确度
    with autocast():
    # compute output
        output = model(images)

        loss = criterion(output, target)

    scaler.scale(loss).backward()
    # optimizer.step()
    scaler.step(optimizer)
    scaler.update()

Data parallel

When the server has a single machine with multiple cards, in order to achieve model acceleration (maybe because one GPU is not enough), a single machine with multiple cards can be used to train the model. In order to achieve this goal, we must find a way to allow a model to be distributed on multiple GPUs for training.

In PyTorch , nn.DataParallel provides us with a simple interface, which can easily realize the parallelization of the model. We only need to use nn.DataParallel to package the model. After setting some parameters, the model can be easily implemented. Multi-card parallel.

# multigpu表示显卡的号码
multigpu = [0,1,2,3,4,5,6,7] 
# 设置主GPU,用来汇总模型的损失函数并且求导，对梯度进行更新
torch.cuda.set_device(args.multigpu[0])
# 模型的梯度全部汇总到gpu[0]上来
model = torch.nn.DataParallel(model, device_ids=args.multigpu).cuda(
		args.multigpu[0]
      	)

nn.DataParallel uses mixed precision arithmetic

    nn.DataParallel requires some special configuration to perform mixed-precision operations on the model, otherwise the model cannot achieve data parallelization.
    Autocast is designed to be "thread local", so only setting the autocast area on the main thread does not work. Borrowed from the reference
    here to give the wrong operation:

model = MyModel() 
dp_model = nn.DataParallel(model)

with autocast():     # dp_model's internal threads won't autocast.
     #The main thread's autocast state has no effect.     
     output = dp_model(input)     # loss_fn still autocasts, but it's too late...
     loss = loss_fn(output)

There are two solutions, which are introduced below:
1. Add a decorative function to the forward function of the model module

MyModel(nn.Module):
    ...
    @autocast()
    def forward(self, input):
       ...

2. Another correct posture is to set the autocast area in the forward:

MyModel(nn.Module):
   ...
   def forward(self, input):
       with autocast():
           ...

After operating the forward function, use autocast in the main thread

model = MyModel()
dp_model = nn.DataParallel(model)

with autocast():
   output = dp_model(input)
   loss = loss_fn(output)

nn.DataParallel disadvantages

In each training batch, the nn.DataParallel module will back-transmit all the losses to gpu[0]. Data transmission of several Gs and loss calculations need to be completed on a graphics card, which is very easy The load of the graphics card is uneven, and it can often be seen that the load of gpu[0] is significantly higher than that of other gpus. In addition, the data transmission speed of the graphics card will cause a big bottleneck to the training speed of the model, which is obviously unreasonable.
So next we will introduce, the specific principle can refer to single machine multi-card operation (distributed DataParallel, mixed precision, Horovod)

Distributed computing

nn.DistributedDataParallel : Multi-process control multi-GPU, train the model together.

advantage

Each process controls a GPU, which can ensure that the calculation of the model is not affected by the communication between the graphics cards, and can make the load of each graphics card relatively even. But compared to a single card or a single card ( nn.DataParallel ), there are several problems

1. Synchronize the model parameters on different GPUs, especially BatchNormalization
2. Tell each process its own location, which GPU to use, and specify with the args.local_rank parameter
3. When fetching data, each process must ensure that what it gets is Different data (DistributedSampler)

Usage introduction

Startup procedure
Since the blogger has only practiced the single-machine multi-card operation at present, he mainly introduces the single-machine multi-card operation. Different from the usual simple running of python programs, we need to use the launcher torch.distributed.launch that comes with PyTorch to start the program.

# 其中CUDA_VISIBLE_DEVICES指定机器上显卡的数量
# nproc_per_node程序进程的数量
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 main.py

Configure the main program

parser.add_argument('--local_rank', type=int, default=0，help='node rank for distributed training')
# 配置local_rank参数，告诉每个进程自己的位置，要使用哪张GPU

Initialize the way of graphics card communication and parameter acquisition

# 为这个进程指定GPU
torch.cuda.set_device(args.local_rank)
# 初始化GPU通信方式NCLL和参数的获取方式，其中env表示环境变量
# PyTorch实现分布式运算是通过NCCL进行显卡通信的
torch.distributed.init_process_group(
    backend='nccl',
    rank=args.local_rank
)

Reconfigure DataLoader


kwargs = {
    
    "num_workers": args.workers, "pin_memory": True} if use_cuda else {
    
    }

train_sampler = DistributedSampler(train_dataset)
self.train_loader = torch.utils.data.DataLoader(
            train_dataset, 
            batch_size=args.batch_size, 
            sampler=train_sampler,  
            **kwargs
        )

# 注意，由于使用了Sampler方法，dataloader中就不能加shuffle、drop_last等参数了
'''
PyTorch dataloader.py 192-197 代码
        if batch_sampler is not None:
            # auto_collation with custom batch_sampler
            if batch_size != 1 or shuffle or sampler is not None or drop_last:
                raise ValueError('batch_sampler option is mutually exclusive '
                                 'with batch_size, shuffle, sampler, and '
                                 'drop_last')'''

pin_memory就是锁页内存，创建DataLoader时，设置pin_memory=True，则意味着生成的Tensor数据最开始是属于内存中的锁页内存，这样将内存的Tensor转义到GPU的显存就会更快一些。

Initialization of the model

torch.cuda.set_device(args.local_rank)
device = torch.device('cuda', args.local_rank)
model.to(device)
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
model = torch.nn.parallel.DistributedDataParallel(
        model,
        device_ids=[args.local_rank],
        output_device=args.local_rank,
        find_unused_parameters=True,
        )
torch.backends.cudnn.benchmark=True 
# 将会让程序在开始时花费一点额外时间，为整个网络的每个卷积层搜索最适合它的卷积实现算法，进而实现网络的加速
# DistributedDataParallel可以将不同GPU上求得的梯度进行汇总，实现对模型GPU的更新

DistributedDataParallel can summarize the gradients obtained on different GPUs to update the model GPU

Synchronize the BatchNormalization layer

For training tasks that consume more video memory, the relative batch size on a single card is often too small, which affects the convergence effect of the model. Cross-card synchronization Batch Normalization can use global samples for normalization, which is equivalent to'increasing' the batch size, so that the training effect is no longer affected by the number of GPUs used. Refer to the stand-alone multi-card operation (distributed DataParallel, mixed precision, Horovod).
Fortunately, in the recent Pytorch version, PyTorch has begun to natively support the synchronization of the BatchNormalization layer.

torch.nn.SyncBatchNorm
torch.nn.SyncBatchNorm.convert_sync_batchnorm : Automatically convert the BatchNorm-alization layer to torch.nn.SyncBatchNorm to achieve synchronization of the BatchNormalization layer on different GPUs

For specific implementation, please refer to the initialization code of the model

model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)

Random seed for synchronization model initialization

At present, I haven't tried the situation of using different random seeds on different processes. To be on the safe side, it is recommended to ensure that the random seed for each model initialization is the same to ensure that the models on each GPU process are synchronized.

to sum up

Standing on the shoulders of giants, I accelerated the self-learning model some time ago, stepped on many pits, and finally added the parade. Finally, I summarized some specific codes and referred to many other blogs. Hope to be of some help to everyone.

Quote (regardless of front and back)