[Pytorch] Multi-GPU training

When specifically using the pytorch framework for training, it was found that the server in the laboratory is a multi-GPU server, so it is necessary to put all network parameters in the multi-GPU for training during the training process.

   The text begins:

   The code involved is torch.nn.DataParallel, and the official recommendation is to use nn.DataParallel instead of multiprocessing. The official code document is as follows: nn.DataParallel    tutorial document is as follows: tutorial

torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)

This function implements the parallel use of data at the module level. Note that the batch size is greater than the number of GPUs.

Parameters: module: network model requiring multi-GPU training

device_ids: GPU number (default all GPUs)

output_device: (default is device_ids [0])

dim: the dimension in which the tensors are scattered, the default is 0

The method used in the code documentation is:

net = torch.nn.DataParallel(model, device_ids=[0, 1, 2])
out = net(input)

In actual operation, it is better to add other codes.

Key points:

My code and explanation are as follows:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Define device, it should be noted that "cuda: 0" represents the starting device_id is 0, if it is directly "cuda", it also starts from 0 by default. The starting position can be modified according to actual needs, such as "cuda: 1".

model = Model()
if torch.cuda.device_count() > 1:
  model = nn.DataParallel(model,device_ids=[0,1,2])
 
model.to(device)

Note here that if it is a single GPU, direct model.to (device) can be trained on a single GPU, but if it is more than one GPU, you need to use the nn.DataParallel function, and then to (device).

It should be noted that the starting number of device_ids must be the same as "cuda: 0" in the previously defined device, otherwise an error will be reported.

If device_ids is not defined, such as model = nn.DataParallel (model), all GPUs are used by default. You can use the specified GPU after defining device_ids, but you must pay attention to the definition of device at the beginning.

Through the above code, you can achieve multi-GPU training of the network.

The following are other bugs encountered during training:

1. In the actual training process, if you want to use submodules in the network , you need to pay attention to:

In single GPU, you can use the following code

model = Net()
out = model.fc(input)

But in DataParallel, it needs to be modified as follows:

model = models.resnet34(True)
model.fc = nn.Linear(model.fc.in_features, 2)
model.cuda()
model = nn.DataParallel(model, device_ids=[0,1])

In fact, when you print out the parallel network, you will find that you need to add "module". Be careful that it is a module, not a model. In this way, the network layer defined in the parallel network can be called.

2. When running the matplotlib library on the server, an error is reported because there is no graphical interface. There is no screenshot for the specific error report, but if you use a remote server for training and there is no graphical interface. When plotting with matplotlib, it is recommended to add the following code:

import matplotlib.pyplot as plt
plt.switch_backend('agg')

And don't use plt.show () during training, just save the picture, and download it later.

Published 190 original articles · praised 497 · 2.60 million views +

Guess you like

Origin blog.csdn.net/u013066730/article/details/104773627