torch报错:StopIteration: Caught StopIteration in replica 0 on device 0.

pytorch DataParallel error resolution

Error display

Error name:

StopIteration: Caught StopIteration in replica 0 on device 0.

Package version:

pytorch-pretrained-bert 0.6.2
torch                   1.6.0

The error is as follows:

Error display picture 1
Error display Figure 2

problem causes

It is normal when using a single GPU, but an error will be reported when using multiple GPUs. The problem arises when multiple GPUs are used for model training. Specifically, the pre-trained bert cannot be loaded with multiple GPUs. It should be the torch version. According to 2 we can know that the torch1.5 version has this problem, and I have this problem with torch1.6 . According to 3, replacing torch1.4 can solve the problem.

Solution

The relatively simple and rude solution is as follows:
pay attention to the following problems:

  File "/miniconda/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py", line 727, in forward
    extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility

Site-packages directory enter
/miniconda/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.pythis path in the modeling.pyscript to 727 rows
next(self.parameters()).dtypeintotorch.float32

Guess you like

Origin blog.csdn.net/weixin_44152453/article/details/109290978