pytorch distributed training error RuntimeError: Socket Timeout

Error background: In my training process, due to the particularity of the task, I used a multi-card training single-card test strategy. During model testing, due to the large data set and the large amount of calculation of the test process indicators, the test time is relatively long.

Error message:

 File "/home/anys/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 940, in __init__
    self._reset(loader, first_iter=True)
  File "/home/anys/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 971, in _reset
    self._try_put_index()
  File "/home/anys/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1205, in _try_put_index
    index = self._next_index()
  File "/home/anys/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 508, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/home/anys/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 227, in __iter__
    for idx in self.sampler:
  File "/home/anys/GRALF/AlignTransReID/TransReID/datasets/sampler_ddp.py", line 148, in __iter__
    seed = shared_random_seed()
  File "/home/anys/GRALF/AlignTransReID/TransReID/datasets/sampler_ddp.py", line 108, in shared_random_seed
    all_ints = all_gather(ints)
  File "/home/anys/GRALF/AlignTransReID/TransReID/datasets/sampler_ddp.py", line 77, in all_gather
    group = _get_global_gloo_group()
  File "/home/anys/GRALF/AlignTransReID/TransReID/datasets/sampler_ddp.py", line 18, in _get_global_gloo_group
    return dist.new_group(backend="gloo")
  File "/home/anys/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2503, in new_group
    pg = _new_process_group_helper(group_world_size,
  File "/home/anys/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 588, in _new_process_group_helper
    pg = ProcessGroupGloo(
RuntimeError: Socket Timeout

From the error message, we can see that when the data is loaded, the creation process causes a timeout. The solution is to increase the "survival" time of the "process":

torch.distributed.new_group(backend="gloo",timeout=datetime.timedelta(days=1))

When a timeout error occurs, you can first check all places where child processes are created, and increase the timeout.

Guess you like

Origin blog.csdn.net/qq_41509251/article/details/130573702