pytorch multi-GPU training summary (use of DataParallel)

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https: //blog.csdn.net/weixin_40087578/article/details/87186613
here to record with many pits pytorch multi-GPU stepped on the train only for single-server multi-gpu parallel data instead of multiple machines in a distributed training

First, the idea of ​​packaging the official model


This is the official pytorch schematic diagram according to this official should modify the reference

https://blog.csdn.net/qq_19598705/article/details/80396325

Above may also packaged with dataParallel Optimizer, a second control schematic official second row, the gradient distribution out on each of the gradient model update (third second row), and then updates its gradient merged into the master GPU model parameters (a second line of the last step)

There is absolutely no need, because the time to spread before each distribution model will be, when do not need back-propagation gradient loss will be distributed to each GPU, gradient calculated separately and then merge model. May just above the main GPU according to the gradient of the total loss to update the model, the model no longer synchronized GPU on the other, because the time will spread distribution model before.

So in the above link without dataParallel packaging optimizer.       

Parallel Computing DataParallel there is only forward propagation

Summary of the steps:

OS Import
Import Torch
args.gpu_id = "2,7"; ID # specified GPU
args.cuda = Not args.no_cuda and torch.cuda.is_available () # as determined whether the cpu
# configuration may be temporary environment at runtime Specifies = CUDA_VISIBLE_DEVICES '2,7' train.py the Python
os.environ [ 'CUDA_VISIBLE_DEVICES'] = # args.gpu_id assignment here must be a string, list being given
device_ids = range (torch.cuda.device_count ()) #torch .cuda.device_count () = 2
# device_ids = [0,1] where 0 is the designated 2, is a front gpu,. 1 is 7, and the model data distributed by the main gpu

IF arg.cuda:
model model.cuda = ( ) replication model # here to GPU, default cuda ( '0'), i.e., 2 to the first the GPU
IF len (DEVICE_ID)>. 1:
model = torch.nn.DaraParallel (model); provided that the model has # .cuda () the

data # front propagates also cuda (), i.e., copied to the main gpu in
for batch_idx, (data, label) in pBar:
IF args.cuda:
data, label = data.cuda (), label .cuda ();
= Variable data_v (Data)
target_var = Variable (label)
Prediction Model = (data_v, target_var, args)
# Prediction prediction result here is composed of two combined gpu earlier, the presence of only the first parallel computation propagates in
propagation front # each gpu calculated amount batch_size / len (device_ids), prior to propagation over the other and result in the main gpu
#prediction length = the batch_size

Criterion = nn.CrossEntropyLoss ()
Loss = Criterion (Prediction, target_var) # calculation Loss
Optimizer. zero_grad ()
loss.backward ()
optimizer.step ()
Model in succession after the function call function can be called directly for example model.state_dict (), model.load_state_dict (torch.load ( model_path) ...... not Effect but to add functions to write the job .module model.module.forward_getfeature (x). the function is not parallel to write operation, only there is only a parallel computing operation .DataParallel propagating in front of the main gpu. but there may be a change in thinking or written forward is forward calls, more than a few parameters returns do not have to return feature, predict

Second, to address multi-GPU load imbalance problem
I experience is the main gpu memory burst, and the other gpu memory only took 1/5 to load imbalance can not be tolerated, can no longer increase batch_size

Reference: https: //discuss.pytorch.org/t/dataparallel-imbalanced-memory-usage/22551/20 (English only saw half a day to read ...)

The reason is the unbalanced load loss = criterion (prediction, target_var) loss is calculated taking up a lot of memory, if we let each gpu computing loss alone, and then return to solve this problem, ie prediction, loss = model (data, target) (# back if not prediction, prediction do not even return, return only loss). Each gpu returned a loss, close to the main gpu is a list, to be loss.mean () or loss.sum (), the recommended mean.

Since this way, all can be calculated separately in the other gpu can be written forward, the result can be returned. But note tensor type arrays into len increases, e.g. prediction lenth = batchsize, but the loss of such specific numbers, the list is merged [loss1.loss2]

Effect: master GPU will be slightly higher. More than a few hundred M, but the basic realization of load balancing

example:

#model function in the forward
DEF forward (Self, X, target_var, args):
feature512 = self.forward_GetFeature (X)
IF target_var None IS:
return feature512;
classifyResult = self.classifier (feature512)
# If DataParallel, forward return return feature is the result of merging the plurality of GPU
# each GPU returns batchsize / n samples, n is the number of GPU


computing # Loss
center_loss = self.get_center_loss (feature512, target_var, args)
Criterion = nn.CrossEntropyLoss ()
cross_entropy_loss = Criterion ( classifyResult, target_var)
#CrossEntropyLoss softmax been seeking a direct input into the classification result to

Loss * = args.center_loss_weight cross_entropy_loss center_loss +
prec = Accuracy (classifyResult.data, target_var, TOPK = (. 1,))

# a return scalar, then that is to return to the last list, n is the result of a GPU
After # To loss.mean () in loss.backward ()
return prec [0], Loss
 
----------------
Disclaimer: This article is CSDN blogger "lllily" original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
Original link: https: //blog.csdn.net/weixin_40087578/article/details/87186613

Guess you like

Origin www.cnblogs.com/jfdwd/p/11466126.html