PYTORCH in DOCKER about GLOO address statement && NCCL address statement

1. Problem description

       When I used Doka to train pedestrian ReID in a docker environment, the following error was reported:

       RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/device.cc:127] rp != nullptr. Unable to find address for: my_username

       Said that the my_username address could not be found.

2. Solve the problem

      Before training the model, set the environment variables, that is, declare the address:

      export GLOO_SOCKET_IFNAME = lo

3. Supplement

     If there is a freeze during Doka training, use the following methods to solve it:

     Before distributed training, the terminal executes: export NCCL_SOCKET_IFNAME=lo

      

Guess you like

Origin blog.csdn.net/Guo_Python/article/details/112358458