The start and join functions in python multi-process processing and the initialization process group init_process_group function in pytorch.distributed

background

When learning the data parallel training that comes with pytorch, there are two libraries, torch.nn.DataParallel and torch.nn.parallel.DistributedDataParallel , the first library is multi-threaded, that is, one thread controls a GPU, and the second It is multi-process, and one process controls one GPU.
If a process controls a GPU, we will use the torch.multiprocessing library, use it to generate multiple threads, and connect each thread to each GPU, such as a thread to control a GPU. And this connection needs to use torch.distributed.init_process_group().

reference

multi-Progress

torch.processing is similar to python's processing library, with some support for tensor added. A simple example of creating multiple threads is as follows:

import torch.multiprocessing as mp

def func(name):
    print('hello, ', name)

if __name__ == '__main__':
    names = ['bob', 'amy', 'sam']
    for name in names:
        p = mp.Process(target=func, args=(name,))
        p.start()

result:

hello,  bob
hello,  amy
hello,  sam

The above is to create three processes, each process will execute the function func, but the parameters passed in nameare different.
Note that it can only be written in py files, not ipynb files.

start

start()A function is a process that starts execution.

join

The join function can be understood as, if a child process executes the join function, the parent process will wait until the child process executes the join.
For example, the above code can be locally modified as:

if __name__ == '__main__':
    names = ['bob', 'amy', 'sam']
    for name in names:
        p = mp.Process(target=func, args=(name,))
        p.start()
    print('it\'s over')

The result is:

hello,  bob
hello,  amy
it's over
hello,  sam

If you modify the code to:

if __name__ == '__main__':
    processes = []
    names = ['bob', 'amy', 'sam']
    for name in names:
        p = mp.Process(target=func, args=(name,))
        p.start()
        processes.append(p)
    for p in processes:
        p.join()
    print('it\'s over')

The result is:

hello,  bob
hello,  amy
hello,  sam
it's over

it's over will be printed strictly after the child process finishes executing.

How to create a child process

In Unix-like platforms, there are three ways to create subprocesses: spawn, fork, forkserverand Window only has one spawn. Can be multiprocessing.set_start_method(xxx)changed by. For details, please refer to: multiprocessing — Process-based parallelism .

torch.distributed.init_process_group

The function prototype is torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(seconds=1800), world_size=- 1, rank=- 1, store=None, group_name='', pg_options=None).

  • backend: The backend of GPU communication, there are ncclthree glootypes mpi, ncclwhich are better.
  • init_method: Used to initialize the process group, associate it with the GPU, and talk about it later.
  • world_size: How many processes we use in total, for example, 4 GPUs can be set to 4.
  • rank: the label of the current process, between 0-world_dize-1.

A slightly more complex example is shown below, for example: WRITING DISTRIBUTED APPLICATIONS WITH PYTORCH

### 分布式应用example:
#### https://pytorch.org/tutorials/intermediate/dist_tuto.html
import random
import torch
import torch.nn as nn
from torchvision import datasets, transforms
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.optim.lr_scheduler import StepLR
import torch.nn.functional as F
import torch.optim as optim
import math
import os


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output


""" 数据集分割 """
class Partition(object):

    def __init__(self, data, index):
        self.data = data
        self.index = index
    
    def __len__(self):
        return len(self.index)

    def __getitem__(self, index):
        data_idx = self.index[index]
        return self.data[data_idx]
# 这里就给了一个data,和一个index数组,等于把data提取出
# index中的部分。


class DataPartitioner(object):

    def __init__(self, data, sizes=[0.7, 0.2, 0.1], seed=1234):
        self.data = data
        self.partitions = []
        rng = random.Random()
        rng.seed(seed)
        data_len = len(data)
        indexes = [x for x in range(0, data_len)]
        rng.shuffle(indexes)  # 对数据下标随机排序

        for frac in sizes:
            part_len = int(frac * data_len)
            self.partitions.append(indexes[0:part_len])
            indexes = indexes[part_len:]
    # 这里最后就根据sizes,把data分成了几份,每一份的index作为一个
    # 数组放到self.partitions中


    # 这个use就是返回当前第几个进程的数据集
    def use(self, partition):
        return Partition(self.data, self.partitions[partition])


""" 将 MNIST 数据集分割 """
# size是GPU数目
def partition_dataset(rank, size):
    dataset = datasets.MNIST('./data', train=True, download=False,
                             transform=transforms.Compose([
                                 transforms.ToTensor(),
                                 transforms.Normalize((0.1307,), (0.3081,))
                            ]))
    # size = dist.get_world_size()  # 进程数目,也就是GPU数目
    bsz = math.ceil(128 / float(size))
    # 这就是按照GPU或者进程数目,把1分了一下
    partition_size = [1.0 / size for _ in range(size)]
    partition = DataPartitioner(dataset, partition_size)
    partition = partition.use(rank)
    # partition是当前进程的子数据集
    train_set = torch.utils.data.DataLoader(partition, batch_size=bsz, shuffle=True)

    return train_set, bsz
    # 返回当前进程的子数据集的dataloader和batch_size

# 虽然是分布式,但是总的batch_size是128


""" Gradient averaging. """
def average_gradients(model):
    size = float(dist.get_world_size())  # 进程数目,也就是GPU数目
    for param in model.parameters():
        # param.grad.data是每个参数的梯度
        dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM)  # 先求和
        param.grad.data /= size  # 再除,等于求平均


""" Distributed Synchronous SGD Example """
# rank 是 当前进程号
def run(rank, size):
    torch.manual_seed(1234)
    train_set, bsz = partition_dataset(rank, size)

    # 使用GPU
    device = torch.device("cuda:{}".format(rank))
    model = Net().to(device)
    optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5)

    num_batches = math.ceil(len(train_set.dataset) / float(bsz))
    # 对于该进程,一共有多少个batch
    # 由于数据集均分了,batch_size也均分了,所以batch的数目与单进程一样。
    for epoch in range(10):
        epoch_loss = 0.0
        # num = 0
        for data, target in train_set:
            data, target = data.to(device), target.to(device)
            # num += 1
            # print('Rank', rank, 'is dealing no.', num)
            optimizer.zero_grad()
            # 先把梯度归零
            output = model(data)
            loss = F.nll_loss(output, target)
            epoch_loss += loss.item()
            loss.backward()   # 反向传播求梯度
            average_gradients(model)
            # 这是分布式里不一样的!!!!!
            optimizer.step()   # 更新参数
        print('Rank', rank, ', epoch', epoch, ': ', epoch_loss/num_batches)


# rank是本进程的进程号,下标(0--size-1)?size是一共开几个进程分解工作
def init_processes(rank, size, fn, backend='gloo'):
    """初始化分布式环境"""
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)

if __name__ == "__main__":
    size = 4  # GPU的总数    不能用get_world_size,因为还没调用init_process_group!!!!!
    processes = []
    mp.set_start_method("fork")
    for rank in range(size):
        p = mp.Process(target=init_processes, args=(rank, size, run))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

The result of the operation is as follows:

Rank 1 , epoch 0 :  0.5456569447859264
Rank 3 , epoch 0 :  0.5368310891425432
Rank 2 , epoch 0 :  0.519815386723735
Rank 0 , epoch 0 :  0.5490584967614237
Rank 3 , epoch 1 :  0.2695276010860957
Rank 2 , epoch 1 :  0.25867786008253024
Rank 0 , epoch 1 :  0.26852701605160606
Rank 1 , epoch 1 :  0.2737363768117959

As you can see, 4 GPUs run successfully.

We only focus on init_processesfunctions and mainfunctions.

Under the main function, first set the number of GPUs size, and you can set as many as there are several GPUs. Then, by mp.Process()creating a process, each process will first execute init_processes()a function. The function first passes: os.environ['MASTER_ADDR'] = '127.0.0.1' os.environ['MASTER_PORT'] = '29500'setting some things, and then calling dist.init_process_group()the function. We can think that this corresponds to the GPU and the process, and the GPU can communicate with each other. Then execute run(rank, size)the function, which performs data parallel training through the two parameters passed in (respectively, the index of the GPU and the total number of GPUs). dist.init_process_group()We don't care after this function.

Three initialization (GPU communication) methods after creating a process:

Here's how we associate a process with the GPU once we've created it. Reference: DISTRIBUTED COMMUNICATION PACKAGE - TORCH.DISTRIBUTED

  • Environment variable initialization :
    This is the method of the above code. By os.environ['MASTER_ADDR']setting some environment variables like this.
    • MASTER_PORTIt is the ip address of the rank0 process. Multiple GPUs always need a GPU to pick the head. The default is the rank0 GPU, so we set the ip address of the machine where it is located. It is useful to use socket communication for multi-machine training, but I have not tried it. A single machine setup localhostdoes the trick.
    • MASTER_PORTIt is a free port of the machine where the node rank0 is located.
    • WORLD_SIZE: The total number of GPUs, which can be set here or in distributed.init_process_groupthe function. The above instance distributed.init_process_groupis set in the function by passing in two parameters.
    • RANK: The subscript of the current GPU (process), which can be set here or in distributed.init_process_groupa function. The above instance distributed.init_process_groupis set in the function by passing in two parameters.
  • TCP initialization : Similar to the previous method. We can init_processesmodify the function as follows, and the code also works.
def init_processes(rank, size, fn, backend='nccl'):
    """初始化分布式环境"""
    dist.init_process_group(backend, init_method='tcp://localhost:29500', rank=rank, world_size=size)
    fn(rank, size)
  • Shared file-system initialization : Use shared files for initialization. This file must be shared by all machines in the group. init_process_groupIn the setting function init_method='file://xxxx', the file must not exist, but the parent folder must exist. After execution, the shared files will not be deleted automatically, so we need to delete them manually. I tried it without success, maybe I need to set some things, so I won't get it.

Guess you like

Origin blog.csdn.net/qq_43219379/article/details/123561012