pytorch DataParallel error resolution
Error display
Error name:
StopIteration: Caught StopIteration in replica 0 on device 0.
Package version:
pytorch-pretrained-bert 0.6.2
torch 1.6.0
The error is as follows:
problem causes
It is normal when using a single GPU, but an error will be reported when using multiple GPUs. The problem arises when multiple GPUs are used for model training. Specifically, the pre-trained bert cannot be loaded with multiple GPUs. It should be the torch version. According to 2 we can know that the torch1.5 version has this problem, and I have this problem with torch1.6 . According to 3, replacing torch1.4 can solve the problem.
Solution
The relatively simple and rude solution is as follows:
pay attention to the following problems:
File "/miniconda/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py", line 727, in forward
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
Site-packages directory enter
/miniconda/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py
this path in the modeling.py
script to 727 rows
next(self.parameters()).dtype
intotorch.float32