failed to query event: CUDA_ERROR_ILLEGAL_INSTRUCTION: an illegal instruction was encountered

insert image description here
报错信息:
Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_INSTRUCTION: an illegal instruction was encountered
2022-03-24 23:32:13.170887: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1

Situation description:
I have customized a loss function

        def amp_loss(y_true, y_pred):  #其实就是幅频特性的损失
            #tf.squeeze先去掉axis=1的维度,因为Computes the 1-dimensional discrete Fourier transform of a real-valued signal over the inner-most dimension of input.
            #tf.signal.rfft做DFT
            #tf.math.abs求幅值
            #tf.expand_dims还原原来的axis=1的维度
            
                amplitude_true = tf.expand_dims(tf.math.abs( tf.signal.rfft(tf.squeeze(y_true))),-1)
                amplitude_pred = tf.expand_dims(tf.math.abs( tf.signal.rfft(tf.squeeze(y_pred))),-1)

                amplitude_loss = tf.math.reduce_mean(tf.math.square(amplitude_true - amplitude_pred))

                return amplitude_loss

The error is reported when model.fit, the consideration is that some arithmetic operations in the loss function are not supported by my cuda

1. Check if there is any problem with the cudnn version.
Like my notebook is tensorflow2.2 GPU version, cudnn seems to be 7.6.5

I try to run it in aws sagemaker studio lab.
The above is tensorflow-gpu 2.6.2 cudnn=8.2.1

2. Try to use the CPU to run and add
before import tensorflow as tf

import os

#用CPU跑
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

and try using

with tf.device('/cpu:0'):

Force the model to run on the CPU, and there is no problem in the actual measurement.

Guess you like

Origin blog.csdn.net/aa2962985/article/details/123720909