Frage
Bei der Verwendung des GPU-Trainings ist ein Fehler aufgetreten. Der Fehlerinhalt ist wie folgt
/pytorch/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [16,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [17,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [18,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [19,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [20,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [21,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [22,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [23,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [24,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [25,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [26,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [27,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [28,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [29,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:247: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
File "/home/work/ner/Msra/train.py", line 92, in <module>
train()
File "/home/work/ner/Msra/train.py", line 83, in train
train_data.map(start_train,
File "/home/anaconda3/envs/bitter/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2376, in map
return self._map_single(
File "/home/anaconda3/envs/bitter/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 551, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/anaconda3/envs/bitter/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 518, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/anaconda3/envs/bitter/lib/python3.9/site-packages/datasets/fingerprint.py", line 458, in wrapper
out = func(self, *args, **kwargs)
File "/home/anaconda3/envs/bitter/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2764, in _map_single
batch = apply_function_on_filtered_inputs(
File "/home/anaconda3/envs/bitter/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2644, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/home/anaconda3/envs/bitter/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2336, in decorated
result = f(decorated_item, *args, **kwargs)
File "/home/work/ner/Msra/train.py", line 67, in start_train
loss.backward()
File "/home/anaconda3/envs/bitter/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/anaconda3/envs/bitter/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Lösung
1. Ich habe die Feature-Abmessungen sorgfältig überprüft und sie sind alle korrekt.
2. Es besteht kein Mangel an GPU-Speicher.
3. Ändern Sie „batch_size“ auf eine kleinere Größe, was ungültig ist.
4. „num_classes“ ist korrekt.
5. Cuda- und Pytorch-Versionen sind sinnvoll
. 6. Fügen Sie am Ende „Sigmoid“ zum Netzwerk hinzu. , ungültig
7. Es gibt keine Klassifizierungsbezeichnung außerhalb der Grenzen.
8. Löschen Sie versteckte Jupyter-Lab-Dateien, ungültig
ls -a
rm .ipynb_checkpoints/ -r
9. Erzwungene Synchronisierung
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
Fehler gemeldet:
RuntimeError: CUDA error: device-side assert triggered
10. Wechseln Sie in den CPU-Betrieb
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = torch.device('cpu')
Fehler gemeldet:
IndexError: Target -1 is out of bounds.
11. Verlustfunktion geändert, ungültig
Ich habe alle Methoden im Internet ausprobiert, aber keine davon hat das Problem gelöst. Mir ist so langweilig.
Die ultimative Lösung! ! !
Wenn Transformer von 4.20.1
auf 4.16.2 ohne Angabe der Version installiert werden, ist das Problem behoben