Pytorch Doka Training Principle and Implementation

This article has participated in the "Newcomer Creation Ceremony" activity, and started the road of Nuggets creation together

Pytorch Doka Training

1. Principles of Doka Training

The Doka training process is generally as follows:

  1. Specify the host node
  2. The host node divides the data, and a batch data is evenly distributed to each machine
  3. Models are copied from the host to each machine
  4. forward propagation per machine
  5. Calculate loss for each machine
  6. The host collects all loss results and updates parameters
  7. Copy the updated parameter model to each machine

image

image

2. Single machine multi-card training

Use the torch.nn.DataParallel (module, device_ids) module, where module is the model and device_ids is the list of parallel GPU ids

How to use: Call the model to perform operations on this interface

model = torch.nn.DataParallel(model)

Example: We assume that the model input is (32, input_dim), where 32 represents batch_size, the model output is (32, output_dim), and it is trained with 4 GPUs. The function of nn.DataParallel is to split these 32 samples into 4 parts, send them to 4 GPUs for forward respectively, and then generate 4 outputs of size (8, output_dim), and then collect these 4 outputs onto cuda:0 and merge into (32, output_dim).

It can be seen that nn.DataParallel does not change the input and output of the model, so other parts of the code do not need to be changed, which is very convenient. But the disadvantage is that the subsequent loss calculation will only be performed on cuda:0 and cannot be parallelized, which will lead to the problem of unbalanced load.

The above load imbalance can be solved by the built-in loss calculation in the model, and the final loss is averaged.

class Net:
    def __init__(self,...):
        # code
    
    def forward(self, inputs, labels=None)
        # outputs = fct(inputs)
        # loss_fct = ...
        if labels is not None:
            loss = loss_fct(outputs, labels)  # 在训练模型时直接将labels传入模型,在forward过程中计算loss
            return loss
        else:
            return outputs
复制代码

According to the model parallel logic we mentioned above, a loss will be calculated on each GPU, and these losses will be collected on cuda:0 and merged into a tensor of length 4. At this time, before doing backward, the loss tensor must be merged into a scalar, usually just take the mean directly. This is mentioned in the official Pytorch documentation nn.DataParallel function:

When module returns a scalar (i.e., 0-dimensional tensor) in forward(), this wrapper will return a vector of length equal to number of devices used in data parallelism, containing the result from each device.
复制代码

3. Multi-machine multi-card training

This method can also realize single-machine multi-card

使用torch.nn.parallel.DistributedDataParalleltorch.utils.data.distributed.DistributedSampler结合多进程实现。

  1. 从一开始就会启动多个进程(进程数小于等于GPU数),每个进程独享一个GPU,每个进程都会独立地执行代码。这意味着每个进程都独立地初始化模型、训练,当然,在每次迭代过程中会通过进程间通信共享梯度,整合梯度,然后独立地更新参数。

  2. 每个进程都会初始化一份训练数据集,当然它们会使用数据集中的不同记录做训练,这相当于同样的模型喂进去不同的数据做训练,也就是所谓的数据并行。这是通过torch.utils.data.distributed.DistributedSampler函数实现的,不过逻辑上也不难想到,只要做一下数据partition,不同进程拿到不同的parition就可以了,官方有一个简单的demo,感兴趣的可以看一下代码实现:Distributed Training

  3. 进程通过local_rank变量来标识自己,local_rank为0的为master,其他是slave。这个变量是torch.distributed包帮我们创建的,使用方法如下:

import argparse  # 必须引入 argparse 包
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int, default=-1)
args = parser.parse_args()
复制代码

必须以如下方式运行代码:

python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py
复制代码

这样的话,torch.distributed.launch就以命令行参数的方式将args.local_rank变量注入到每个进程中,每个进程得到的变量值都不相同。比如使用 4 个GPU的话,则 4 个进程获得的args.local_rank值分别为0、1、2、3。

上述命令行参数nproc_per_node表示每个节点需要创建多少个进程(使用几个GPU就创建几个);nnodes表示使用几个节点,做单机多核训练设为1。

  1. 因为每个进程都会初始化一份模型,为保证模型初始化过程中生成的随机权重相同,需要设置随机种子。方法如下:
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
复制代码

使用方式如下:

from torch.utils.data.distributed import DistributedSampler  # 负责分布式dataloader创建,也就是实现上面提到的partition。

# 负责创建 args.local_rank 变量,并接受 torch.distributed.launch 注入的值
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int, default=-1)
args = parser.parse_args()

# 每个进程根据自己的local_rank设置应该使用的GPU
torch.cuda.set_device(args.local_rank)
device = torch.device('cuda', args.local_rank)

# 初始化分布式环境,主要用来帮助进程间通信
torch.distributed.init_process_group(backend='nccl')

# 固定随机种子
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

# 初始化模型
model = Net()
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# 只 master 进程做 logging,否则输出会很乱
if args.local_rank == 0:
    tb_writer = SummaryWriter(comment='ddp-training')

# 分布式数据集
train_sampler = DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(train_dataset, sampler=train_sampler, batch_size=batch_size)  # 注意这里的batch_size是每个GPU上的batch_size

# 分布式模型
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True)
复制代码

torch.distributed.init_process_group()包含四个常用的参数:

  • backend: 后端, 实际上是多个机器之间交换数据的协议
  • init_method: 机器之间交换数据, 需要指定一个主节点, 而这个参数就是指定主节点的
  • world_size: 介绍都是说是进程, 实际就是机器的个数, 例如两台机器一起训练的话, world_size就设置为2
  • rank: 区分主节点和从节点的, 主节点为0, 剩余的为了1-(N-1), N为要使用的机器的数量, 也就是world_size

后端初始化

pytorch提供下列常用后端:

image

初始化init_method

  1. TCP初始化
import torch.distributed as dist

dist.init_process_group(backend, init_method='tcp://10.1.1.20:23456',
                        rank=rank, world_size=world_size)
复制代码

注意这里使用格式为tcp://ip:端口号, 首先ip地址是你的主节点的ip地址, 也就是rank参数为0的那个主机的ip地址, 然后再选择一个空闲的端口号, 这样就可以初始化init_method了.

  1. 共享文件系统初始化
import torch.distributed as dist

dist.init_process_group(backend, init_method='file:///mnt/nfs/sharedfile',
                        rank=rank, world_size=world_size)
复制代码

初始化rank和world_size

这里其实没有多难, 你需要确保, 不同机器的rank值不同, 但是主机的rank必须为0, 而且使用init_method的ip一定是rank为0的主机, 其次world_size是你的主机数量, 你不能随便设置这个数值, 你的参与训练的主机数量达不到world_size的设置值时, 代码是不会执行的.

四、模型保存

模型的保存与加载,与单GPU的方式有所不同。这里通通将参数以cpu的方式save进存储, 因为如果是保存的GPU上参数,pth文件中会记录参数属于的GPU号,则加载时会加载到相应的GPU上,这样就会导致如果你GPU数目不够时会在加载模型时报错

Or control the process when the model is saved, and only save it in the main process. Models are saved in the same way, but there are multiple processes running at the same time in distributed operation, so multiple models will be saved to the storage. If you use shared storage, you must pay attention to the file name. Of course, parameters are generally only saved on the rank0 process. That is, because the model parameters of all processes are synchronized.

torch.save(model.module.cpu().state_dict(), "model.pth")
复制代码

Parameter loading:

param=torch.load("model.pth")
复制代码

The following is the model saving code used in huggingface/transformers code

if torch.distributed.get_rank() == 0:
    model_to_save = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
    model_to_save.save_pretrained(args.output_dir)
    tokenizer.save_pretrained(args.output_dir)
复制代码

Reference link

pytorch multi-gpu parallel training

PyTorch single-machine multi-GPU training method and principle

Guess you like

Origin juejin.im/post/7082591377581670431