pytorch multi-GPU training

Code base address:  mnist

directory

Ordinary stand-alone single-card training process

Changes needed for distributed training

horovod way

Ordinary stand-alone single-card training process

Taking mnist as an example, it mainly includes data loading, model building, optimizer and iterative training, etc. 

import argparse
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from datetime import datetime
from tqdm import tqdm

class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7*7*32, num_classes)

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out

def train(gpu, args):
    torch.manual_seed(0)
    model = ConvNet()
    torch.cuda.set_device(gpu)
    model.cuda(gpu)
    batch_size = 100
    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda(gpu)
    optimizer = torch.optim.SGD(model.parameters(), 1e-4)
    # Data loading code
    train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True)
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size,shuffle=True, num_workers=0,pin_memory=True)

    start = datetime.now()
    for epoch in range(args.epochs):
        if gpu == 0:
            print("Epoch: {}/{}".format(epoch+1, args.epochs))
        pbar = tqdm(train_loader)
        for images, labels in pbar:
            images = images.cuda(non_blocking=True)
            labels = labels.cuda(non_blocking=True)
            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)

            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            if gpu == 0:
                msg = 'Loss: {:.4f}'.format(loss.item())
                pbar.set_description(msg)
    
    if gpu == 0:
        print("Training complete in: " + str(datetime.now() - start))

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N')
    parser.add_argument('-g', '--gpus', default=2, type=int, help='number of gpus per node')
    parser.add_argument('-nr', '--nr', default=0, type=int, help='ranking within the nodes')
    parser.add_argument('--epochs', default=10, type=int, metavar='N', help='number of total epochs to run')
    args = parser.parse_args()
    train(0, args)

if __name__ == '__main__':
    main()

 It takes 1 minute and 12 seconds to train 2 epochs on 2080Ti.

DDP distributed training

Required changes 0. Introduce necessary libraries

import os
import torch.multiprocessing as mp
import torch.distributed as dist

 1. Modify the main function

def main():
    parser = argparse.ArgumentParser()
    ...
    args = parser.parse_args()
    args.world_size = args.gpus * args.nodes 
    if args.world_size > 1:
        os.environ['MASTER_ADDR'] = '127.0.0.1'                 #
        os.environ['MASTER_PORT'] = '8889'                      #
        mp.spawn(train, nprocs=args.gpus, args=(args,))         #
    else:
        train(0, args)

2. Initialize the communication library

def train(gpu, args):
    if args.world_size > 1:
        rank = args.nr * args.gpus + gpu
        dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank)

3. The data sent to each node needs to be disrupted, please DistributedSampler 

    if args.world_size > 1:
        model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu])
        train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset,num_replicas=args.world_size,rank=rank)
        shuffle = False
    else:
        train_sampler = None
        shuffle = True
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size,shuffle=shuffle, num_workers=0,pin_memory=True,sampler=train_sampler)

Here we first calculate the current program number: rank = args.nr * args.gpus + gpu, and then dist.init_process_groupinitialize the distributed environment, where backendthe parameters specify the communication backend, including mpi, gloo, nccl, select here nccl, which is the official multi-card communication framework provided by Nvidia, which is relatively efficient. mpiIt is also a common communication protocol for high-performance computing, but you need to install the MPI implementation framework yourself, such as OpenMPI. glooIt has a built-in communication backend, but it is not efficient enough. init_methodIt refers to how to initialize to complete the process synchronization at the beginning; what we set here refers env://to the environment variable initialization method, and four parameters need to be configured in the environment variable: MASTER_PORT, MASTER_ADDR, WORLD_SIZE, RANK, the first two parameters We have already configured, and the latter two parameters can also be configured through dist.init_process_groupthe function world_sizeand rankparameters. Other initialization methods include shared file system ( https://pytorch.org/docs/stable/distributed.html#shared-file-system-initialization ) and TCP ( https://pytorch.org/docs/stable/distributed .html#tcp-initialization ), such as using TCP as the initialization method init_method='tcp://10.1.1.20:23456', in fact, it is also necessary to provide the IP address and port of the master. Note that this call is blocking and must wait for all processes to synchronize, and will fail if any process fails.

Complete code, search for add to quickly go directly to the modified place

import argparse
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from datetime import datetime
from tqdm import tqdm

# add 0
import os
import torch.multiprocessing as mp
import torch.distributed as dist

class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7*7*32, num_classes)

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out

def train(gpu, args):
    # add 2
    if args.world_size > 1:
        rank = args.nr * args.gpus + gpu
        dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank)
    torch.manual_seed(0)
    model = ConvNet()
    torch.cuda.set_device(gpu)
    model.cuda(gpu)
    batch_size = 100
    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda(gpu)
    optimizer = torch.optim.SGD(model.parameters(), 1e-4)
    # Data loading code
    train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True)
    # add 3
    if args.world_size > 1:
        model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu])
        train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset,num_replicas=args.world_size,rank=rank)
        shuffle = False
    else:
        train_sampler = None
        shuffle = True
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size,shuffle=shuffle, num_workers=0,pin_memory=True,sampler=train_sampler)

    start = datetime.now()
    for epoch in range(args.epochs):
        if gpu == 0:
            print("Epoch: {}/{}".format(epoch+1, args.epochs))
        pbar = tqdm(train_loader)
        for i, (images, labels) in enumerate(pbar):
            images = images.cuda(non_blocking=True)
            labels = labels.cuda(non_blocking=True)
            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)

            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            if gpu == 0:
                msg = 'Loss: {:.4f}'.format(loss.item())
                pbar.set_description(msg)
    
    if gpu == 0:
        print("Training complete in: " + str(datetime.now() - start))

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N')
    parser.add_argument('-g', '--gpus', default=2, type=int, help='number of gpus per node')
    parser.add_argument('-nr', '--nr', default=0, type=int, help='ranking within the nodes')
    parser.add_argument('--epochs', default=10, type=int, metavar='N', help='number of total epochs to run')
    args = parser.parse_args()
    # add 1
    args.world_size = args.gpus * args.nodes 
    if args.world_size > 1:
        os.environ['MASTER_ADDR'] = '127.0.0.1'              #
        os.environ['MASTER_PORT'] = '8889'                      #
        mp.spawn(train, nprocs=args.gpus, args=(args,))         #
    else:
        train(0, args)

if __name__ == '__main__':
    main()

 The time-consuming has been reduced to 48 seconds, which is still very significant.

horovod way

Horovod is a deep learning framework that focuses on distributed training. Through Horovod, it can provide distributed training capabilities for Tensorflow, Keras, Pytorch and MXNet. Using Horovod for distributed training can bring us two main benefits:

  • Ease of use: With only a few lines of code modification, the code of Tensorflow, Keras, Pytorch or MXNet on a single machine can be changed to a training code that supports single-machine single-card, single-machine multi-card, multi-machine multi-card
  • High performance: High performance is mainly reflected in scalability. The general training framework will gradually decrease in training performance with the increase of workers (especially in nodes) (mainly due to the relatively low performance of network communication between nodes); Horovod in 128 Under the condition of machine 4 P100 25Gbit/s RoCE network, both Inception V3 and ResNet-101 can achieve 90% expansion efficiency.

Its installation is very simple, just pip install horovod

Writing distributed training code through Horovod generally involves 6 steps:

  • Add hvd.init()to initialize Horovod;
  • A GPU is assigned to each worker. Generally, a worker process corresponds to a GPU, and the corresponding relationship is mapped by rank id. For example, torch.cuda.set_device(hvd.local_rank()) in pytorch
  • As the world_size changes, the batch_size also changes, so we also need to adjust lr as the world_size changes, generally multiplying the original lr value by world_size;
  • Encapsulate the optimizer of the original deep learning framework through hvd.DistributedOptimizer in horovod;
  • rank 0 broadcasts the initial variable to all workers: hvd.broadcast_parameters(model.state_dict(), root_rank=0)
  • Save the checkpoint only on worker 0

Complete code, you can search for add one-click direct modification

import argparse
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from datetime import datetime
from tqdm import tqdm
# add 0
import horovod.torch as hvd

class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7*7*32, num_classes)

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out

def train(gpu, args):
    torch.manual_seed(0)
    model = ConvNet()
    torch.cuda.set_device(gpu)
    model.cuda(gpu)
    batch_size = 100
    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda(gpu)
    optimizer = torch.optim.SGD(model.parameters(), 1e-4)
    # Data loading code
    train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True)
    # add 2
    if hvd.size() > 1:
        train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
        shuffle = False
        optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters(), op=hvd.Average)
        hvd.broadcast_parameters(model.state_dict(), root_rank=0)
        hvd.broadcast_optimizer_state(optimizer, root_rank=0)
    else:
        train_sampler = None
        shuffle = True
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size,shuffle=shuffle, num_workers=0, sampler=train_sampler, pin_memory=True)

    start = datetime.now()
    for epoch in range(args.epochs):
        if gpu == 0:
            print("Epoch: {}/{}".format(epoch+1, args.epochs))
        if gpu == 0:
            pbar = tqdm(train_loader)
        else:
            pbar = train_loader
        for images, labels in pbar:
            images = images.cuda(non_blocking=True)
            labels = labels.cuda(non_blocking=True)
            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)

            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            if gpu == 0:
                msg = 'Loss: {:.4f}'.format(loss.item())
                pbar.set_description(msg)
    
    if gpu == 0:
        print("Training complete in: " + str(datetime.now() - start))

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N')
    parser.add_argument('-g', '--gpus', default=2, type=int, help='number of gpus per node')
    parser.add_argument('-nr', '--nr', default=0, type=int, help='ranking within the nodes')
    parser.add_argument('--epochs', default=10, type=int, metavar='N', help='number of total epochs to run')
    args = parser.parse_args()
    # add 1
    hvd.init()
    train(hvd.local_rank(), args)

if __name__ == '__main__':
    main()

Start command:

# Horovod 1 单机训练
#horovodrun -np 1 -H localhost:1 python train_horovod.py
# Horovod 2 双卡训练
horovodrun -np 2 -H localhost:2 python train_horovod.py
# Horovod 4 四卡训练
#horovodrun -np 4 -H localhost:4 python train_horovod.py

In this way, it only takes 41 seconds for dual cards 

Although the speed of a single card is not faster, but there are more workers processing tasks, the throughput will increase, and it only takes 20 seconds for 4 cards 


Distributed Training Tour    PyTorch Distributed Training Concise Course

PyTorch Distributed Training Basics--DDP Use

The main changes include:
1. The startup method introduces a multi-process mechanism; 
2. Several environment variables are introduced; 
3. DataLoader adds a sampler parameter; 
4. The network is wrapped by a DistributedDataParallel(net) ; 
5. The saving method of ckpt has changed.

Pytorch - distributed communication primitives (with source code) 

Pytorch - Multi-machine multi-card minimalist implementation (with source code)

Pytorch - DDP implementation analysis

Pytorch - Distributed training with Horovod

Pytorch-Horovod Distributed Training Source Code Analysis 

Guess you like

Origin blog.csdn.net/minstyrain/article/details/127731963