ddp pytoch multi-card distributed training

python -m torch.distributed.launch --nproc_per_node=2 main.py

This is the start command for PyTorch distributed training. It uses PyTorch's built-in distributed training tools to start multiple training processes and chain them together for parallel training. The parameters are explained as follows:

python -m torch.distributed.launch: Use PyTorch's built-in distributed training tool to start the training process.
--nproc_per_node=2: Number of GPUs used by each node. In this example, 2 GPUs are used per node.
main.py: The Python script to run.

When you run distributed training on multiple computers, you need to run this command on each node in order to start multiple training processes and chain them together for parallel training. When starting the command, you need to use different parameters to specify information such as the IP address and port number of each node so that they can communicate with each other and share training data.

what is ddp

DDP (Distributed Data Parallel) is a distributed training strategy provided by PyTorch, which aims to accelerate model training and improve the efficiency and scalability of model training. DDP uses data parallelism to distribute models and data to multiple GPUs or machines for calculation, and uses gradient accumulation and synchronization methods to update model parameters.

In DDP, each process has its own copy of the model and subset of data, and computes the loss and gradient of the model in each iteration. Then, use the reduce operation to accumulate and synchronize the gradients of each process, and use the average gradient to update the model parameters. In this way, DDP can significantly improve the speed and efficiency of model training, and enable model training to be extended to multiple GPUs or machines.

DDP also provides a variety of optimization strategies, such as synchronous BN (Batch Normalization) and random seed fixation, to solve common problems in a distributed environment, such as BN out-of-sync and inconsistent randomness.

In addition to providing data parallel methods to accelerate model training and improve scalability, DDP also provides the following functions and optimization strategies:

synchronous BN

When using BN (Batch Normalization), the mean and standard deviation on each GPU are usually calculated, and then normalized using the mean and standard deviation on all GPUs. However, this approach may cause the mean and standard deviation of BN to be out of sync, thus affecting the performance of the model. In order to solve this problem, DDP provides the function of synchronizing BN, which can keep the BN parameters on each GPU in sync. torch.nn.SyncBatchNormBy replacing standard classes with classes in DDP BatchNorm, the function of synchronous BN can be realized.

random seed fixed

In distributed training, since each process has its own random number generator, it may lead to inconsistent randomness, which affects the performance and convergence speed of the model. In order to solve this problem, DDP provides the function of fixed random seed, which can make each process use the same random number generator and random seed. torch.manual_seedBy using the function to fix the random seed in DDP , the function of fixing the random seed can be realized.

adaptive optimizer

DDP also provides the function of adaptive optimizer, which can automatically adjust the learning rate and momentum according to the gradient size and variance on each GPU. torch.optim.AdaptiveLRBy using and torch.optim.AdaptiveMomemtumoptimizer in DDP , the function of adaptive optimizer can be realized.

In conclusion, DDP is an excellent strategy for distributed training, which can significantly improve the speed and efficiency of model training, and also provides a variety of functions and optimization strategies to solve common problems that arise in distributed environments.

Principle of DDP

DDP (Distributed Data Parallel) is a distributed training strategy in PyTorch, which can distribute and synchronize data among multiple GPUs or multiple machines, thereby accelerating the training process.

The basic principles of DDP are as follows:

Divide the model and data into multiple parts and distribute them to different GPUs or machines.
Each GPU or machine does forward and backpropagation using local data and computes gradients.
The gradients are aggregated into a global gradient using an All-Reduce operation, and the model parameters are updated using the global gradient.
Repeat steps 2-3 until the training is over.

The advantage of DDP is that it can compute and summarize gradients in parallel across multiple GPUs or machines, thereby speeding up the training process. In addition, DDP can also automatically handle details such as data partitioning and gradient synchronization, making distributed training more convenient and easy to use.

In PyTorch, torch.nn.parallel.DistributedDataParallelDDP can be implemented using classes. When using DDP, it is necessary to initialize the distributed environment in each process, specify parameters such as the unique identifier of the process and the communication method, and distribute the model and data to each process. Then, use DistributedDataParallelthe class to wrap the model and use torch.distributed.launchthe command to start multiple processes for training. During the training process, DDP will automatically handle details such as gradient synchronization and update, and save the training results in the specified directory.

Multiple graphics cards on the same machine can use dp

Yes, DDP can run on multiple GPUs in one machine. In this case, each GPU will be assigned to a different process, and use local data for forward and backward propagation, and then use the All-Reduce operation to summarize and synchronize the gradients. In this way, the computing resources of multiple GPUs can be utilized and the training process of the model can be accelerated.

When using DDP in PyTorch, you need to specify the GPU number used by each process, which can be CUDA_VISIBLE_DEVICEScontrolled using environment variables. For example, if 4 GPUs are available, DDP can be started in 4 processes with the following command:

CUDA_VISIBLE_DEVICES=0,1,2,3  python -m torch.distributed.launch --nproc_per_node=4 your_training_script.py

This enables distributed training using DDP on 4 GPUs on one machine. During the training process, each process will use the designated GPU for calculation, and automatically handle details such as gradient synchronization and update.

Learn DDP from the perspective of use

Handwriting AI

# 1. 导包：一些需要导入的库
# 模型相关
from torch.nn.parallel import DistributedDataParallel as DDP
# 数据相关
from torch.utils.data.distributed import DistributedSampler
# ddp自身的机制相关
import torch.distributed as dist

# 2.后端多卡通讯及GPU序号（RANK）
if DDP_ON:
    init_process_group(backend="nccl")
    LOCAL_RANK = device_id = int(os.environ["LOCAL_RANK"])
    WORLD_SIZE = torch.cuda.device_count()

    device = torch.device('cuda', device_id) # note that device_id is an integer but device is a datetype.
    print(f"Start running basic DDP on rank {
      
      LOCAL_RANK}.")
    logging.info(f'Using device {
      
      device_id}')

# 3. DDP model
net = DDP(net, device_ids = [device_id], output_device=device_id)


# 4.喂数据给多卡
loader_args = dict(batch_size=batch_size, num_workers=WORLD_SIZE*4, pin_memory=True) # batchsize is for a single proc
if DDP_ON:
    train_sampler = DistributedSampler(train_set)
    train_loader = DataLoader(train_set, sampler=train_sampler, **loader_args)
else:
    train_loader = DataLoader(train_set, shuffle=True, **loader_args)
    
# no need for distributed sampler for val
val_loader = DataLoader(val_set, shuffle=False, drop_last=True, **loader_args)


# 5.set_epoch 防止每次数据都是一样的（如下图）
# ref: https://blog.csdn.net/weixin_41978699/article/details/121742647
for epoch in range(start, start+epochs):
    if LOCAL_RANK == 0:
        print('lr: ', optimizer.param_groups[0]['lr']) 

    net.train()
    epoch_loss = 0

    # To avoid duplicated data sent to multi-gpu
    train_loader.sampler.set_epoch(epoch)

torchrun --nproc_per_node=4 \
          multigpu_torchrun.py \
          --batch_size 4 \
          --lr 1e-3

python -m torch.distributed.launch \
      --nproc_per_node = 4 \
        train.py \
      --batch_size 4

import argparse
import logging
import sys
from pathlib import Path

import torch
import torch.nn as nn
import torch.nn.functional as F
import wandb
from torch import optim
from torch.utils.data import DataLoader, random_split
from tqdm import tqdm

from utils.data_loading import BasicDataset, CarvanaDataset
from utils.dice_score import dice_loss
from evaluate import evaluate
from unet import UNet
import os
import torch.distributed as dist

# for reproducibility
import random
import numpy as np
import torch.backends.cudnn as cudnn

# ABOUT DDP
# for model loading in ddp mode
from torch.nn.parallel import DistributedDataParallel as DDP
# for data loading in ddp mode
from torch.utils.data.distributed import DistributedSampler

import torch.multiprocessing as mp
from torch.distributed import init_process_group, destroy_process_group



def init_seeds(seed=0, cuda_deterministic=True):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    # Speed-reproducibility tradeoff https://pytorch.org/docs/stable/notes/randomness.html
    if cuda_deterministic:  # slower, more reproducible
        cudnn.deterministic = True
        cudnn.benchmark = False
    else:  # faster, less reproducible
        cudnn.deterministic = False
        cudnn.benchmark = True

def train_net(net,
              device,
              start: int = 0,
              epochs: int = 5,
              batch_size: int = 1,
              learning_rate: float = 1e-5,
              val_percent: float = 0.1,
              save_checkpoint: bool = True,
              img_scale: float = 0.5,
              amp: bool = False):
    

    if DDP_ON: # modify the net's attributes when using ddp
        net.n_channels = net.module.n_channels
        net.n_classes  = net.module.n_classes

    # 1. Create dataset
    try:
        dataset = CarvanaDataset(dir_img, dir_mask, img_scale)
    except (AssertionError, RuntimeError):
        dataset = BasicDataset(dir_img, dir_mask, img_scale)

    # 2. Split into train / validation partitions
    n_val = int(len(dataset) * val_percent)
    n_train = len(dataset) - n_val
    train_set, val_set = random_split(dataset, [n_train, n_val], generator=torch.Generator().manual_seed(0))

    # 3. Create data loaders
    loader_args = dict(batch_size=batch_size, num_workers=WORLD_SIZE*4, pin_memory=True) # batchsize is for a single process(GPU)

    if DDP_ON:
        train_sampler = DistributedSampler(train_set)
        train_loader = DataLoader(train_set, sampler=train_sampler, **loader_args)
    else:
        train_loader = DataLoader(train_set, shuffle=True, **loader_args)
    
    
    # no need for distributed sampler for val
    val_loader = DataLoader(val_set, shuffle=False, drop_last=True, **loader_args)
    
    # (Initialize logging)
    if LOCAL_RANK == 0:
        experiment = wandb.init(project='U-Net-DDP', resume='allow', anonymous='must')
        experiment.config.update(dict(epochs=epochs, batch_size=batch_size, learning_rate=learning_rate,
                                  val_percent=val_percent, save_checkpoint=save_checkpoint, img_scale=img_scale,
                                  amp=amp))
            
        logging.info(f'''Starting training:
                Epochs:          {
      
      epochs}
                Start from:      {
      
      start}
                Batch size:      {
      
      batch_size}
                Learning rate:   {
      
      learning_rate}
                Training size:   {
      
      n_train}
                Validation size: {
      
      n_val}
                Checkpoints:     {
      
      save_checkpoint}
                Device:          {
      
      device.type}
                Images scaling:  {
      
      img_scale}
                Mixed Precision: {
      
      amp}
            ''')

    # 4. Set up the optimizer, the loss, the learning rate scheduler and the loss scaling for AMP
    criterion = nn.CrossEntropyLoss() 
    
    optimizer = optim.AdamW(net.parameters(), lr=learning_rate, weight_decay=1e-8)
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs, eta_min=1e-7)
    grad_scaler = torch.cuda.amp.GradScaler(enabled=amp)
    global_step = 0

    # 5. Begin training
    for epoch in range(start, start+epochs):
        if LOCAL_RANK == 0:
            print('lr: ', optimizer.param_groups[0]['lr']) 
        
        net.train()
        epoch_loss = 0

        # To avoid duplicated data sent to multi-gpu
        train_loader.sampler.set_epoch(epoch)

        disable = False if LOCAL_RANK == 0 else True

        with tqdm(total=n_train, desc=f'Epoch {
      
      epoch}/{
      
      epochs+start}', unit='img', disable=disable) as pbar:
            for batch in train_loader:
                images = batch['image']
                true_masks = batch['mask']
                    
                assert images.shape[1] == net.n_channels, \
                    f'Network has been defined with {
      
      net.n_channels} input channels, ' \
                    f'but loaded images have {
      
      images.shape[1]} channels. Please check that ' \
                    'the images are loaded correctly.'

                images = images.to(device=device, dtype=torch.float32)
                true_masks = true_masks.to(device=device, dtype=torch.long)

                with torch.cuda.amp.autocast(enabled=amp):
                    masks_pred = net(images)
                    loss = criterion(masks_pred, true_masks) \
                           + dice_loss(F.softmax(masks_pred, dim=1).float(),
                                       F.one_hot(true_masks, net.n_classes).permute(0, 3, 1, 2).float(),
                                       multiclass=True)

                optimizer.zero_grad(set_to_none=True)
                grad_scaler.scale(loss).backward()
                grad_scaler.step(optimizer)
                grad_scaler.update()

                pbar.update(images.shape[0])
                global_step += 1
                epoch_loss += loss.item()

                if LOCAL_RANK == 0:
                    experiment.log({
    
    
                        'train loss': loss.item(),
                        'step': global_step,
                        'epoch': epoch
                    })
                pbar.set_postfix(**{
    
    'loss (batch)': loss.item()})

                # Evaluation round
                division_step = (n_train // (5 * batch_size))
                if division_step > 0:
                    if global_step % division_step == 0:
                        histograms = {
    
    }
                        for tag, value in net.named_parameters():
                            tag = tag.replace('/', '.')
                            if not torch.isinf(value).any():
                                histograms['Weights/' + tag] = wandb.Histogram(value.data.cpu())
                            if not torch.isinf(value.grad).any():
                                histograms['Gradients/' + tag] = wandb.Histogram(value.grad.data.cpu())

                        val_score = evaluate(net, val_loader, device, disable_log = disable)

                        if LOCAL_RANK == 0:
                            logging.info('Validation Dice score: {}'.format(val_score))
                            experiment.log({
    
    
                                'learning rate': optimizer.param_groups[0]['lr'],
                                'validation Dice': val_score,
                                'images': wandb.Image(images[0].cpu()),
                                'masks': {
    
    
                                    'true': wandb.Image(true_masks[0].float().cpu()),
                                    'pred': wandb.Image(masks_pred.argmax(dim=1)[0].float().cpu()),
                                },
                                'step': global_step,
                                'epoch': epoch,
                                **histograms
                            })
        scheduler.step()
        if save_checkpoint and LOCAL_RANK == 0 and (epoch % args.save_every == 0):
            Path(dir_checkpoint).mkdir(parents=True, exist_ok=True)
            torch.save(net.module.state_dict(), str(dir_checkpoint / 'DDP_checkpoint_epoch{}.pth'.format(epoch)))
            
            logging.info(f'Checkpoint {
      
      epoch} saved!')


##################################### arguments ###########################################
parser = argparse.ArgumentParser(description='Train the UNet on images and target masks')
parser.add_argument('--epochs', '-e', metavar='E', type=int, default=5, help='Number of epochs')
parser.add_argument('--batch-size', '-b', dest='batch_size', metavar='B', type=int, default=1, help='Batch size')
parser.add_argument('--learning-rate', '-l', metavar='LR', type=float, default=1e-5,
                    help='Learning rate', dest='lr')
parser.add_argument('--load', '-f', type=str, default=False, help='Load model from a .pth file')
parser.add_argument('--scale', '-s', type=float, default=0.5, help='Downscaling factor of the images')
parser.add_argument('--validation', '-v', dest='val', type=float, default=10.0,
                    help='Percent of the data that is used as validation (0-100)')
parser.add_argument('--amp', action='store_true', default=False, help='Use mixed precision')
parser.add_argument('--bilinear', action='store_true', default=False, help='Use bilinear upsampling')
parser.add_argument('--classes', '-c', type=int, default=2, help='Number of classes')
parser.add_argument('--exp_name', type=str, default='hgb_exp')
parser.add_argument('--ddp_mode', action='store_true')
parser.add_argument('--save_every', type=int, default=5)
parser.add_argument('--start_from', type=int, default=0)




args = parser.parse_args()

dir_img = Path('./data/imgs/')
dir_mask = Path('./data/masks/')
dir_checkpoint = Path('./checkpoints/')

DDP_ON = True if args.ddp_mode else False

#########################################################################################

if DDP_ON:
    init_process_group(backend="nccl")
    LOCAL_RANK = device_id = int(os.environ["LOCAL_RANK"])
    WORLD_SIZE = torch.cuda.device_count()

    device = torch.device('cuda', device_id) # note that device_id is an integer but device is a datetype.
    print(f"Start running basic DDP on rank {
      
      LOCAL_RANK}.")
    logging.info(f'Using device {
      
      device_id}')


if __name__ == '__main__':
    #!highly recommended]
    # ref: pytorch org ddp tutorial 
    # 1. https://pytorch.org/tutorials/beginner/ddp_series_multigpu.html#multi-gpu-training-with-ddp
    # 2. https://pytorch.org/tutorials/beginner/ddp_series_multigpu.html
    
    init_seeds(0)
    # Change here to adapt to your data
    # n_channels=3 for RGB images
    # n_classes is the number of probabilities you want to get per pixel
    net = UNet(n_channels=3, n_classes=args.classes, bilinear=args.bilinear)
    
    if LOCAL_RANK == 0:
        print(f'Network:\n'
            f'\t{
      
      net.n_channels} input channels\n'
            f'\t{
      
      net.n_classes} output channels (classes)\n'
            f'\t{
      
      "Bilinear" if net.bilinear else "Transposed conv"} upscaling')

    if args.load:
        # ref: https://blog.csdn.net/hustwayne/article/details/120324639  use method 2 with module
        # net.load_state_dict(torch.load(args.load, map_location=device))
        net.load_state_dict({
    
    k.replace('module.', ''): v for k, v in                 
                       torch.load(args.load, map_location=device).items()})

        logging.info(f'Model loaded from {
      
      args.load}')


    torch.cuda.set_device(LOCAL_RANK)
    net.to(device=device)
    # wrap our model with ddp
    net = DDP(net, device_ids = [device_id], output_device=device_id)

    try:
        train_net(net=net,
                  start=args.start_from,
                  epochs=args.epochs,
                  batch_size=args.batch_size,
                  learning_rate=args.lr,
                  device=device,
                  img_scale=args.scale,
                  val_percent=args.val / 100,
                  amp=args.amp)
    except KeyboardInterrupt:
        torch.save(net.module.state_dict(), 'INTERRUPTED_DDP.pth')
        logging.info('Saved interrupt')
        raise
    destroy_process_group()

How to use ddp?

Using DDP for distributed training can speed up the training process of deep learning models and reduce training time. Here are the general steps to implement DDP using PyTorch:

Initialize the distributed environment

Before DDP training, the distributed environment needs to be initialized. You can use torch.distributed.init_process_groupa function to initialize a distributed environment, which needs to specify parameters such as the distributed backend, host name, port number, process number, etc. For example, the following code initializes a distributed environment with 4 processes:

import torch
import torch.distributed as dist

dist.init_process_group(backend="nccl", init_method="tcp://localhost:23456", rank=0, world_size=4)

Load data and model

After initializing the distributed environment, the training data and model need to be loaded. The training data can be loaded using PyTorch's data loader such as torch.utils.data.DataLoader, and torch.nn.parallel.DistributedDataParallelthe model is wrapped with classes to distribute and synchronize the data across multiple GPUs or machines. For example, the following code loads the training data and model, and wraps the model in DDP:

import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.parallel

# 加载训练数据
train_dataset = ...

# 定义模型
model = ...

# 使用DDP包装模型
model = nn.parallel.DistributedDataParallel(model, device_ids=[torch.cuda.current_device()])

Define the optimizer and loss function

After loading the data and model, the optimizer and loss function need to be defined. An optimizer can be defined using PyTorch's optimizer (such as torch.optim.SGD) and a loss torch.nn.CrossEntropyLossfunction can be defined using PyTorch's loss function (such as ). For example, the following code defines an SGD optimizer and a cross-entropy loss function:

import torch.optim as optim
import torch.nn as nn

# 定义优化器
optimizer = optim.SGD(model.parameters(), lr=0.01)

# 定义损失函数
criterion = nn.CrossEntropyLoss()

training model

After defining the optimizer and loss function, we can start training the model. You can use PyTorch's training loop (like forloop) to iterate over the training dataset and in each iteration calculate the loss and gradient of the model and update the model parameters using the optimizer. For example, the following code shows a simple training loop:

for epoch in range(num_epochs):
    for inputs, targets in train_dataset:
        # 将输入和目标数据移到GPU上
        inputs = inputs.to(device)
        targets = targets.to(device)

        # 计算模型输出
        outputs = model(inputs)

        # 计算损失函数值
        loss = criterion(outputs, targets)

        # 计算梯度并更新模型参数
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Clean up the distributed environment

After training is complete, the distributed environment needs to be cleaned up. Functions can be used torch.distributed.destroy_process_groupto clean up distributed environments, for example:

dist.destroy_process_group()

It should be noted that when using DDP for distributed training, it is necessary to ensure that the number of GPUs used by each process is different to avoid data duplication and synchronization problems. Environment variables can be used CUDA_VISIBLE_DEVICESto specify the number of GPUs each process can use. For example, the following command can be used to start 4 processes, each using a different GPU:

CUDA_VISIBLE_DEVICES=0 python train.py --rank 0 --world-size 4
CUDA_VISIBLE_DEVICES=1 python train.py --rank 1 --world-size 4
CUDA_VISIBLE_DEVICES=2 python train.py --rank 2 --world-size 4
CUDA_VISIBLE_DEVICES=3 python train.py --rank 3 --world-size 4

Among them, --rankthe parameter specifies the number of the current process, and --world-sizethe parameter specifies the total number of processes.

mmseg-ddp

mmseg supports distributed training using DDP. The following are the general steps for DDP training using mmseg:

Initialize the distributed environment

Before DDP training, the distributed environment needs to be initialized. Multiple processes can torch.distributed.launchbe started using the command, with --nproc_per_nodeparameters specifying the number of GPUs to use per node. For example, the following command starts 4 processes, each using 1 GPU:

python -m torch.distributed.launch --nproc_per_node=1 train.py --launcher pytorch

In train.pythe script, mmcv.runner.init_dista function needs to be used to initialize the distributed environment. For example:

import mmcv.runner

mmcv.runner.init_dist()

Load data and model

After initializing the distributed environment, the training data and model need to be loaded. The training data can be loaded using mmseg's data loader such as mmseg.datasets.build_dataset, and mmseg.models.build_segmentorthe segmentation model can be built using functions. For example, the following code loads the training data and model:

from mmseg.datasets import build_dataset
from mmseg.models import build_segmentor

# 加载训练数据
train_dataset = build_dataset(cfg.data.train)

# 构建分割模型
model = build_segmentor(cfg.model)

Define optimizer and learning rate scheduler

After loading the data and model, the optimizer and learning rate scheduler need to be defined. An optimizer can be defined using mmseg's optimizer (eg mmcv.optim.build_optimizer) and a learning rate scheduler can mmcv.runner.build_lr_schedulerbe defined using mmseg's learning rate scheduler (eg). For example, the following code defines an SGD optimizer and a cosine annealing learning rate scheduler:

from mmcv.optim import build_optimizer
from mmcv.runner import build_lr_scheduler

# 定义优化器
optimizer = build_optimizer(model, cfg.optimizer)

# 定义学习率调度器
lr_scheduler = build_lr_scheduler(
    optimizer, cfg.lr_scheduler, total_iters_per_epoch=len(train_dataset))

Build a DDP model

After defining the optimizer and learning rate scheduler, mmcv.runner.DistributedDataParallelthe model needs to be packaged with classes to distribute and synchronize data among multiple GPUs or machines. For example, the following code builds a DDP model:

from mmcv.runner import DistSamplerSeedHook, Runner

# 构建DDP模型
model = mmcv.runner.DistributedDataParallel(
    model.cuda(),
    device_ids=[torch.cuda.current_device()],
    broadcast_buffers=False)

# 定义分布式采样器的随机数种子
dist_sampler_seed = cfg.get('dist_sampler_seed', None)

# 构建Runner对象
runner = Runner(
    model,
    batch_processor,
    optimizer,
    work_dir=cfg.work_dir,
    logger=logger,
    meta=cfg.get('meta', {
    
    }),
    max_iters=num_iterations,
    dist_sampler_seed=dist_sampler_seed)

# 注册分布式采样器的随机数种子
if dist_sampler_seed is not None:
    runner.register_hook(DistSamplerSeedHook(dist_sampler_seed))

training model

After building the DDP model, you can use mmseg's trainer (such as mmcv.runner.IterBasedRunner) for distributed training. You can use mmseg's training loop (like forloop) to iterate over the training dataset and in each iteration calculate the model's loss and gradient, and use the optimizer to update the model's parameters. For example, the following code trains a segmentation model:

from mmcv.runner import IterBasedRunner

# 构建训练循环
runner = IterBasedRunner(
    model,
    batch_processor,
    optimizer,
    work_dir=cfg.work_dir,
    logger=logger,
    meta=cfg.get('meta', {
    
    }),
    max_iters=num_iterations,
    iter_based=True)

# 开始训练
runner.run(train_loader, valid_loader=valid_loader, lr_scheduler=lr_scheduler)

References

pytorch official

Using Doka training can significantly improve the training speed and efficiency of deep learning models. In PyTorch, tools such as DataParallel or DDP can be used to implement multi-card training. The following is a sample code for multi-card training using DataParallel:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor

# 定义模型
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout2d(0.25)
        self.dropout2 = nn.Dropout2d(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = nn.functional.relu(x)
        x = self.conv2(x)
        x = nn.functional.relu(x)
        x = nn.functional.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = nn.functional.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = nn.functional.log_softmax(x, dim=1)
        return output

# 加载数据
train_data = MNIST(root='data', train=True, transform=ToTensor(), download=True)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)

# 定义模型、损失函数和优化器
model = Net()
criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5)

# 多卡训练
device_ids = [0, 1, 2, 3] # 定义使用的GPU设备编号
model = nn.DataParallel(model, device_ids=device_ids) # 将模型包装为DataParallel模型
model.to(device_ids[0]) # 将模型和数据移动到第一个GPU设备上

num_epochs = 10
for epoch in range(num_epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device_ids[0]), target.to(device_ids[0])
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % 100 == 0:
            print('Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))

In the above code, we first define a convolutional neural network model including two convolutional layers, two fully connected layers and a dropout layer, and load the MNIST dataset. Then, we use the DataParallel tool to package the model as a multi-card model and specify the GPU device number used. Next, we move the model and data to the first GPU device and use multiple GPU devices for training. During the training process, we need to move the data and model to the used GPU device, and accumulate gradients and update model parameters during backpropagation.

Both DataParallel and DDP are tools used in PyTorch to implement multi-card training, but they have some differences.

the difference:

Different ways of data parallelism

DataParallel adopts a data parallel method, which divides the input data into multiple parts, calculates them separately on multiple GPU devices, and then merges the results. There is a complete model on each GPU device, and the gradients are calculated independently, and finally the gradients are merged and the model parameters are updated.

DDP adopts a model parallel method, that is, the model is divided into multiple parts, which are calculated separately on multiple GPU devices, and then the model parameters are updated by exchanging information. There is only a part of the model on each GPU device, and the model parameters on each device are constantly updated as the training progresses.

Communication methods are different

DataParallel uses the All-Reduce algorithm to merge the gradients. After calculating the gradients on each GPU device, the All-Reduce algorithm merges the gradients into a single gradient, and then uses the single gradient to update the model parameters.

DDP uses a distributed synchronization mechanism to update model parameters, and the model parameters on each GPU device will be synchronized during the training process. DDP uses a method called "global synchronization", where all processes wait for the slowest process to finish computing at each training step, and then update the models of all processes with the same parameters.

connect:

Both can realize parallel calculation and parameter update of the model on multiple GPU devices.
Both need to divide and synchronize the data and model during the training process.
Both can significantly improve the speed and efficiency of model training.

Which tool to choose depends on different application scenarios and hardware conditions. If you have multiple GPU devices, and each device has enough memory to store the model and data, then consider using DataParallel. If your model is very large and needs to be trained on multiple nodes, then consider using DDP.

It is impossible to generalize which Doka training method is better, which method to choose depends on specific scenarios and needs.

The advantage of DataParallel is that it is simple to implement, easy to use, can be trained on multiple GPU devices in a single node, and is suitable for small or medium-sized models. However, the disadvantage of DataParallel is that it needs to copy the entire model to each GPU device, so for large models and datasets, it may cause insufficient memory and slow down the training speed.

The advantage of DDP is that it can be trained on multiple nodes and is suitable for large models and datasets. DDP adopts a model parallel method, which can decompose the model into multiple parts, so that each part can be trained on a single GPU device. In addition, DDP uses a distributed synchronization mechanism to update model parameters, which can avoid the decrease in training efficiency due to network transmission delays and bandwidth limitations. However, the implementation of DDP is more complicated and requires a certain understanding of the distributed environment and synchronization mechanism.

Therefore, which Doka training method to choose should be decided according to specific application scenarios and needs. If you only have a single node and multiple GPU devices, and the model is small, you can choose DataParallel; if your model is large, needs to be trained on multiple nodes, and has the conditions for a distributed environment, then you can choose DDP.