Pytorch multi-card training
The previous blog used Pytorch to manually implement LeNet-5, because only one of the two cards on the machine was used during training, so I wanted to use two graphics cards at the same time to train our network. Of course, LeNet There is no need for two cards to train a neural network with a relatively low number of layers and a relatively small data set. Here is just a study of how to use two cards.
Existing Methods #
I searched for multi-card training methods on the Internet, and they can be summed up in three ways:
- nn.DataParallel
- pytorch-encoding
- distributedDataparallel
The first method is the multi-card training method that comes with pytorch, but it can also be seen from the name of the method that it is not completely parallel computing, but the data is calculated in parallel on two cards, the model is saved and the Loss is calculated. They are all concentrated on one of the several cards, which also leads to inconsistent memory usage of the two cards using this method.
The second method is a third-party package developed by others. It solves the problem that the calculation of Loss is not parallel. In addition, it also contains many other useful methods. Here is the GitHub link for interested students. look.
The third method is the most complicated of these methods. For this method, each GPU will perform derivative calculation on the data allocated to it, and then pass the result to the next GPU. This is the same as DataParallel. The data is aggregated to a GPU for derivation, and the calculation of Loss and update parameters are different.
Here I first chose the first method for parallel computing
Parallel computing related code #
First, you need to check whether there are multiple graphics cards on the machine
USE_MULTI_GPU = True
# 检测机器是否有多张显卡
if USE_MULTI_GPU and torch.cuda.device_count() > 1:
MULTI_GPU = True
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1"
device_ids = [0, 1]
else:
MULTI_GPU = False
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
Among them os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1"
is to number the GPU in the machine
The next step is to read the model
net = LeNet()
if MULTI_GPU:
net = nn.DataParallel(net,device_ids=device_ids)
net.to(device)
The difference between this and a single card is that nn.DataParallel
there is an extra step
Next is the definition of optimizer and scheduler
optimizer=optim.Adam(net.parameters(), lr=1e-3)
scheduler = StepLR(optimizer, step_size=100, gamma=0.1)
if MULTI_GPU:
optimizer = nn.DataParallel(optimizer, device_ids=device_ids)
scheduler = nn.DataParallel(scheduler, device_ids=device_ids)
Because the definitions of optimizer and scheduler have been changed, they are also different when they are called later.
For example, read a piece of code of learning rate:
optimizer.state_dict()['param_groups'][0]['lr']
now becomes
optimizer.module.state_dict()['param_groups'][0]['lr']
The detailed code can be seen in my GitHub repository