background
When learning the data parallel training that comes with pytorch, there are two libraries, torch.nn.DataParallel and torch.nn.parallel.DistributedDataParallel , the first library is multi-threaded, that is, one thread controls a GPU, and the second It is multi-process, and one process controls one GPU.
If a process controls a GPU, we will use the torch.multiprocessing library, use it to generate multiple threads, and connect each thread to each GPU, such as a thread to control a GPU. And this connection needs to use torch.distributed.init_process_group().
reference
- PYTORCH DISTRIBUTED OVERVIEW
- GETTING STARTED WITH DISTRIBUTED DATA PARALLEL
- WRITING DISTRIBUTED APPLICATIONS WITH PYTORCH
- DISTRIBUTED COMMUNICATION PACKAGE - TORCH.DISTRIBUTED
- multiprocessing — Process-based parallelism
multi-Progress
torch.processing is similar to python's processing library, with some support for tensor added. A simple example of creating multiple threads is as follows:
import torch.multiprocessing as mp
def func(name):
print('hello, ', name)
if __name__ == '__main__':
names = ['bob', 'amy', 'sam']
for name in names:
p = mp.Process(target=func, args=(name,))
p.start()
result:
hello, bob
hello, amy
hello, sam
The above is to create three processes, each process will execute the function func
, but the parameters passed in name
are different.
Note that it can only be written in py files, not ipynb files.
start
start()
A function is a process that starts execution.
join
The join function can be understood as, if a child process executes the join function, the parent process will wait until the child process executes the join.
For example, the above code can be locally modified as:
if __name__ == '__main__':
names = ['bob', 'amy', 'sam']
for name in names:
p = mp.Process(target=func, args=(name,))
p.start()
print('it\'s over')
The result is:
hello, bob
hello, amy
it's over
hello, sam
If you modify the code to:
if __name__ == '__main__':
processes = []
names = ['bob', 'amy', 'sam']
for name in names:
p = mp.Process(target=func, args=(name,))
p.start()
processes.append(p)
for p in processes:
p.join()
print('it\'s over')
The result is:
hello, bob
hello, amy
hello, sam
it's over
it's over will be printed strictly after the child process finishes executing.
How to create a child process
In Unix-like platforms, there are three ways to create subprocesses: spawn
, fork
, forkserver
and Window only has one spawn
. Can be multiprocessing.set_start_method(xxx)
changed by. For details, please refer to: multiprocessing — Process-based parallelism .
torch.distributed.init_process_group
The function prototype is torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(seconds=1800), world_size=- 1, rank=- 1, store=None, group_name='', pg_options=None)
.
- backend: The backend of GPU communication, there are
nccl
threegloo
typesmpi
,nccl
which are better. - init_method: Used to initialize the process group, associate it with the GPU, and talk about it later.
- world_size: How many processes we use in total, for example, 4 GPUs can be set to 4.
- rank: the label of the current process, between 0-world_dize-1.
A slightly more complex example is shown below, for example: WRITING DISTRIBUTED APPLICATIONS WITH PYTORCH
### 分布式应用example:
#### https://pytorch.org/tutorials/intermediate/dist_tuto.html
import random
import torch
import torch.nn as nn
from torchvision import datasets, transforms
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.optim.lr_scheduler import StepLR
import torch.nn.functional as F
import torch.optim as optim
import math
import os
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout(0.25)
self.dropout2 = nn.Dropout(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
x = self.conv2(x)
x = F.relu(x)
x = F.max_pool2d(x, 2)
x = self.dropout1(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = F.relu(x)
x = self.dropout2(x)
x = self.fc2(x)
output = F.log_softmax(x, dim=1)
return output
""" 数据集分割 """
class Partition(object):
def __init__(self, data, index):
self.data = data
self.index = index
def __len__(self):
return len(self.index)
def __getitem__(self, index):
data_idx = self.index[index]
return self.data[data_idx]
# 这里就给了一个data,和一个index数组,等于把data提取出
# index中的部分。
class DataPartitioner(object):
def __init__(self, data, sizes=[0.7, 0.2, 0.1], seed=1234):
self.data = data
self.partitions = []
rng = random.Random()
rng.seed(seed)
data_len = len(data)
indexes = [x for x in range(0, data_len)]
rng.shuffle(indexes) # 对数据下标随机排序
for frac in sizes:
part_len = int(frac * data_len)
self.partitions.append(indexes[0:part_len])
indexes = indexes[part_len:]
# 这里最后就根据sizes,把data分成了几份,每一份的index作为一个
# 数组放到self.partitions中
# 这个use就是返回当前第几个进程的数据集
def use(self, partition):
return Partition(self.data, self.partitions[partition])
""" 将 MNIST 数据集分割 """
# size是GPU数目
def partition_dataset(rank, size):
dataset = datasets.MNIST('./data', train=True, download=False,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
]))
# size = dist.get_world_size() # 进程数目,也就是GPU数目
bsz = math.ceil(128 / float(size))
# 这就是按照GPU或者进程数目,把1分了一下
partition_size = [1.0 / size for _ in range(size)]
partition = DataPartitioner(dataset, partition_size)
partition = partition.use(rank)
# partition是当前进程的子数据集
train_set = torch.utils.data.DataLoader(partition, batch_size=bsz, shuffle=True)
return train_set, bsz
# 返回当前进程的子数据集的dataloader和batch_size
# 虽然是分布式,但是总的batch_size是128
""" Gradient averaging. """
def average_gradients(model):
size = float(dist.get_world_size()) # 进程数目,也就是GPU数目
for param in model.parameters():
# param.grad.data是每个参数的梯度
dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM) # 先求和
param.grad.data /= size # 再除,等于求平均
""" Distributed Synchronous SGD Example """
# rank 是 当前进程号
def run(rank, size):
torch.manual_seed(1234)
train_set, bsz = partition_dataset(rank, size)
# 使用GPU
device = torch.device("cuda:{}".format(rank))
model = Net().to(device)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
num_batches = math.ceil(len(train_set.dataset) / float(bsz))
# 对于该进程,一共有多少个batch
# 由于数据集均分了,batch_size也均分了,所以batch的数目与单进程一样。
for epoch in range(10):
epoch_loss = 0.0
# num = 0
for data, target in train_set:
data, target = data.to(device), target.to(device)
# num += 1
# print('Rank', rank, 'is dealing no.', num)
optimizer.zero_grad()
# 先把梯度归零
output = model(data)
loss = F.nll_loss(output, target)
epoch_loss += loss.item()
loss.backward() # 反向传播求梯度
average_gradients(model)
# 这是分布式里不一样的!!!!!
optimizer.step() # 更新参数
print('Rank', rank, ', epoch', epoch, ': ', epoch_loss/num_batches)
# rank是本进程的进程号,下标(0--size-1)?size是一共开几个进程分解工作
def init_processes(rank, size, fn, backend='gloo'):
"""初始化分布式环境"""
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
dist.init_process_group(backend, rank=rank, world_size=size)
fn(rank, size)
if __name__ == "__main__":
size = 4 # GPU的总数 不能用get_world_size,因为还没调用init_process_group!!!!!
processes = []
mp.set_start_method("fork")
for rank in range(size):
p = mp.Process(target=init_processes, args=(rank, size, run))
p.start()
processes.append(p)
for p in processes:
p.join()
The result of the operation is as follows:
Rank 1 , epoch 0 : 0.5456569447859264
Rank 3 , epoch 0 : 0.5368310891425432
Rank 2 , epoch 0 : 0.519815386723735
Rank 0 , epoch 0 : 0.5490584967614237
Rank 3 , epoch 1 : 0.2695276010860957
Rank 2 , epoch 1 : 0.25867786008253024
Rank 0 , epoch 1 : 0.26852701605160606
Rank 1 , epoch 1 : 0.2737363768117959
As you can see, 4 GPUs run successfully.
We only focus on init_processes
functions and main
functions.
Under the main function, first set the number of GPUs size
, and you can set as many as there are several GPUs. Then, by mp.Process()
creating a process, each process will first execute init_processes()
a function. The function first passes: os.environ['MASTER_ADDR'] = '127.0.0.1' os.environ['MASTER_PORT'] = '29500'
setting some things, and then calling dist.init_process_group()
the function. We can think that this corresponds to the GPU and the process, and the GPU can communicate with each other. Then execute run(rank, size)
the function, which performs data parallel training through the two parameters passed in (respectively, the index of the GPU and the total number of GPUs). dist.init_process_group()
We don't care after this function.
Three initialization (GPU communication) methods after creating a process:
Here's how we associate a process with the GPU once we've created it. Reference: DISTRIBUTED COMMUNICATION PACKAGE - TORCH.DISTRIBUTED
- Environment variable initialization :
This is the method of the above code. Byos.environ['MASTER_ADDR']
setting some environment variables like this.MASTER_PORT
It is the ip address of the rank0 process. Multiple GPUs always need a GPU to pick the head. The default is the rank0 GPU, so we set the ip address of the machine where it is located. It is useful to use socket communication for multi-machine training, but I have not tried it. A single machine setuplocalhost
does the trick.MASTER_PORT
It is a free port of the machine where the node rank0 is located.WORLD_SIZE
: The total number of GPUs, which can be set here or indistributed.init_process_group
the function. The above instancedistributed.init_process_group
is set in the function by passing in two parameters.RANK
: The subscript of the current GPU (process), which can be set here or indistributed.init_process_group
a function. The above instancedistributed.init_process_group
is set in the function by passing in two parameters.
- TCP initialization : Similar to the previous method. We can
init_processes
modify the function as follows, and the code also works.
def init_processes(rank, size, fn, backend='nccl'):
"""初始化分布式环境"""
dist.init_process_group(backend, init_method='tcp://localhost:29500', rank=rank, world_size=size)
fn(rank, size)
- Shared file-system initialization : Use shared files for initialization. This file must be shared by all machines in the group.
init_process_group
In the setting functioninit_method='file://xxxx'
, the file must not exist, but the parent folder must exist. After execution, the shared files will not be deleted automatically, so we need to delete them manually. I tried it without success, maybe I need to set some things, so I won't get it.