Inplace ABNSync and pytorch GPU multi-card parallel pit

say up front

As we all know, torch.nn.DataParallel(variable, device_ids=None) is a function for pytorch to perform multi-card parallelism. As long as the variable is converted by this function each time, it can be converted into a multi-card parallel variable. Inplace ABN is a new function that can replace BatchNorm, the effect is better than the BatchNorm that comes with pytorch, and it can save video memory. Since the author has never used the parallel model of pytorch before, I encountered many pitfalls when using the functions of this repo.

1、torch.cuda()

This function converts the variable into a pattern that can be run on the graphics card. In fact, there is a parameter that specifies the type of GPU. If no parameter is added, GPU 0 is specified by default, that is, this variable is converted to GPU 0. This is actually a bit of a problem, because when multiple cards are parallel, it is actually a main module and many sub-modules. The main module is responsible for getter/scatter operations and synchronization. If you use GPU 1, 2, 3 for parallelism, and no parameters are added to torch.cuda(), GPU 0 will be used as the main module by default. This causes problems because GPU 0 is not actually involved in the computation. So you can use torch.cuda(device_ids[0]).

2. The problem of ninja

The use of Inplace ABN requires the use of ninja, and ninja is a compiler similar to gcc. This can be installed through many channels, commonly used such as conda install ninja. Then after installation, when Inplace ABN calls it, it may show that the ninja command cannot be found. At this time, move the ninja executable file to /usr/sbin.

3. libcudart.so.9.1 cannot be found

You can put the folder path where libcudart.so.9.1 is located in ~/.bashrc, for example

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"

4. The setting of os.environ[“CUDA_VISIBLE_DEVICES”] is invalid

Unsolved, it seems that in pytorch this does not specify the GPU ID used.

5. Compilation-related issues in the use of Inplace ABNSync

subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
ImportError: /tmp/torch_extensions/inplace_abn/inplace_abn.so: undefined symbol: _ZN2at5ErrorC1ENS_14SourceLocationESs
Might be a pytorch version issue. See issue
issue32
issue71 of github

6. Inplace ABNSync stuck when synchronizing

Inplace ABNSync uses GPU 0 for synchronization by default. At the same time, there is a parameter in the class to set the ID of the GPU. It can be resolved after setting the ID.

7. Summary

Make good use of github's search function. Many closed issues may be very enlightening. To a certain extent, it is easier to solve problems by searching on github than on google.

おすすめ

転載: blog.csdn.net/pku_Coder/article/details/85111082