Pytorch multi-card training

Pytorch multi-card training 

The previous blog used Pytorch to manually implement LeNet-5, because only one of the two cards on the machine was used during training, so I wanted to use two graphics cards at the same time to train our network. Of course, LeNet There is no need for two cards to train a neural network with a relatively low number of layers and a relatively small data set. Here is just a study of how to use two cards.

Existing Methods #

I searched for multi-card training methods on the Internet, and they can be summed up in three ways:

  • nn.DataParallel
  • pytorch-encoding
  • distributedDataparallel

The first method is the multi-card training method that comes with pytorch, but it can also be seen from the name of the method that it is not completely parallel computing, but the data is calculated in parallel on two cards, the model is saved and the Loss is calculated. They are all concentrated on one of the several cards, which also leads to inconsistent memory usage of the two cards using this method.

The second method is a third-party package developed by others. It solves the problem that the calculation of Loss is not parallel. In addition, it also contains many other useful methods. Here is the GitHub link for interested students. look.

The third method is the most complicated of these methods. For this method, each GPU will perform derivative calculation on the data allocated to it, and then pass the result to the next GPU. This is the same as DataParallel. The data is aggregated to a GPU for derivation, and the calculation of Loss and update parameters are different.

Here I first chose the first method for parallel computing

Parallel computing related code #

First, you need to check whether there are multiple graphics cards on the machine

USE_MULTI_GPU = True



# 检测机器是否有多张显卡

if USE_MULTI_GPU and torch.cuda.device_count() > 1:

    MULTI_GPU = True

    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"

    os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1"

    device_ids = [0, 1]

else:

    MULTI_GPU = False

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Among them os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1"is to number the GPU in the machine

The next step is to read the model

net = LeNet()

if MULTI_GPU:

    net = nn.DataParallel(net,device_ids=device_ids)

net.to(device)

The difference between this and a single card is that nn.DataParallelthere is an extra step

Next is the definition of optimizer and scheduler

optimizer=optim.Adam(net.parameters(), lr=1e-3)

scheduler = StepLR(optimizer, step_size=100, gamma=0.1)

if MULTI_GPU:

    optimizer = nn.DataParallel(optimizer, device_ids=device_ids)

    scheduler = nn.DataParallel(scheduler, device_ids=device_ids)

Because the definitions of optimizer and scheduler have been changed, they are also different when they are called later.

For example, read a piece of code of learning rate:

optimizer.state_dict()['param_groups'][0]['lr']

now becomes

optimizer.module.state_dict()['param_groups'][0]['lr']

The detailed code can be seen in my GitHub repository

Guess you like

Origin blog.csdn.net/zenglongjian/article/details/129973636