1. Problem description
When I used Doka to train pedestrian ReID in a docker environment, the following error was reported:
RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/device.cc:127] rp != nullptr. Unable to find address for: my_username
Said that the my_username address could not be found.
2. Solve the problem
Before training the model, set the environment variables, that is, declare the address:
export GLOO_SOCKET_IFNAME = lo
3. Supplement
If there is a freeze during Doka training, use the following methods to solve it:
Before distributed training, the terminal executes: export NCCL_SOCKET_IFNAME=lo