[CUDA error record] A possibility of CUDA error: device-side assert triggered

[CUDA error record] A possibility of CUDA error: device-side assert triggered

Specific problem description analysis

frame_volume = torch.zeros((bs, ch, self.ch_num, h, w)).to(x)During the process of running Pytorch, the following error was reported for the new tensor statement :

RuntimeError: CUDA error: device-side assert triggered

And there are a lot of similar CUDA errors:

.../ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [xxx,0,0], thread: [xxx,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.

Searching on the Internet, I found that most of the similar CUDA error: device-side assert triggeredreasons are caused by the index out of bounds in the classification problem, but obviously the new tensor has no similar index problem, and through further positioning, it is found that the error is generated by converting the tensor .to(x)to caused on the GPU. So I suspect that the problem does not lie in this sentence, but is caused by the previous operation.
I mentioned ScatterGatherKernel through the error report in CUDA, and thought it might be a problem in the program torch.gather. So I debugged through debug, and found that after the following code snippet gather, debug could not display the contents of the variable used_volumeand , indicating that this step of operation has caused problems.gather_idx

for out_idx in range(4, -1, -1):
	out_volume = get_volume(out_idx)
	if out_idx != 0:
	   add_volume = self._bisample(out_volume, img_shape)
	used_volume = torch.cat([used_volume, 
	                         add_volume],
	                        dim=1)
	used_volume = torch.gather(used_volume, 1, gather_index)

Before used_volume = torch.gather(used_volume, 1, gather_index)the operation, the variable content can be viewed normally:
insert image description here
after the operation, the variable content cannot be viewed normally:
insert image description here
for my program, the problem is actually obvious. When the loop reaches the last time, it is not out_volumegenerated from add_volume(just forgot to write else) , what causes the last cycle to superimpose is actually the variable generated by the last cycle, and the length of the first dimension generated by each cycle out_volumeis incremented, so the correct index is out of bounds on the wrong tensor.

Summarize

For CUDA error: device-side assert triggeredthis type of error report, it is likely that the problem does not lie in the error code, but has occurred before. For details, please refer to the error message of CUDA. If there are conditions for single-step debugging, you can perform single-step debugging near the suspicious code. If you find that the variable content cannot be displayed normally after a certain step, then the problem should be in this step, and you can continue to investigate the cause in depth.

Guess you like

Origin blog.csdn.net/weixin_41625823/article/details/116203380