pytorch stand-alone multi-GPU training problems encountered RNN

Some of the problems encountered in the use DataParallel training.

1. The model can not recognize the custom module.

preview

As illustrated, appears as AttributeError: 'DataParallel' object has no attribute 'xxx' errors.

Cause: After using net = torch.nn.DataParallel (net), a net is originally packaged as the new module's attributes net.

Solution: After all net = torch.nn.DataParallel (net) calls the property is not initialized and forward, the need to replace the net as net.module.

2. Hidden status is not split into multi-GPU inside.

Such errors often appear in the training RNN and derivative networks because such networks are often defined using two nn.module classes. One is RNN, one is RNNcell.

The form of an error propagation tensor dimension error in the front RNN, specifically, the input data is split in accordance with the untagged batchsize GPU use the number to different GPU rather hidden state with no cells. Typically, each GPU computing batchsize / GPU amount of data. However, due to the use form RNNcell class defines self.cell State and self.hidden State RNN as a property class.

Cause: When using DataParallel, pytorch will forward function parameter in the same data to a different batch of the GPU split, without splitting the form of class attributes self.xx split off. If the amount of the property checking device, they can be found in the initialization cuda: 0 no change in the device number.

Solution: All amounts need to split in forward function as the return value between Form. Problems in RNN, to note there are some. For example, in order to solve this problem, the optimizer is often hidden in each RNN forward pass. At this point, hidden state initialization to start in every epoch, and the optimizer can not together. Some information on the details of the network is not avoided.

References:

https://www.zhihu.com/question/67726969

https://link.zhihu.com/?target=https%3A//senyang-ml.github.io/2019/07/20/pytorch-multigpu/

https://link.zhihu.com/?target=https%3A//blog.csdn.net/yuuyuhaksho/article/details/87560640

 

 

Released five original articles · won praise 0 · Views 6063

Guess you like

Origin blog.csdn.net/srplus/article/details/104382399