Error background: In my training process, due to the particularity of the task, I used a multi-card training single-card test strategy. During model testing, due to the large data set and the large amount of calculation of the test process indicators, the test time is relatively long.
Error message:
File "/home/anys/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 940, in __init__
self._reset(loader, first_iter=True)
File "/home/anys/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 971, in _reset
self._try_put_index()
File "/home/anys/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1205, in _try_put_index
index = self._next_index()
File "/home/anys/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 508, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/home/anys/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 227, in __iter__
for idx in self.sampler:
File "/home/anys/GRALF/AlignTransReID/TransReID/datasets/sampler_ddp.py", line 148, in __iter__
seed = shared_random_seed()
File "/home/anys/GRALF/AlignTransReID/TransReID/datasets/sampler_ddp.py", line 108, in shared_random_seed
all_ints = all_gather(ints)
File "/home/anys/GRALF/AlignTransReID/TransReID/datasets/sampler_ddp.py", line 77, in all_gather
group = _get_global_gloo_group()
File "/home/anys/GRALF/AlignTransReID/TransReID/datasets/sampler_ddp.py", line 18, in _get_global_gloo_group
return dist.new_group(backend="gloo")
File "/home/anys/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2503, in new_group
pg = _new_process_group_helper(group_world_size,
File "/home/anys/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 588, in _new_process_group_helper
pg = ProcessGroupGloo(
RuntimeError: Socket Timeout
From the error message, we can see that when the data is loaded, the creation process causes a timeout. The solution is to increase the "survival" time of the "process":
torch.distributed.new_group(backend="gloo",timeout=datetime.timedelta(days=1))
When a timeout error occurs, you can first check all places where child processes are created, and increase the timeout.