Recently, when running the model, I found that the code suddenly stopped running, and no exception was thrown or exited. After restarting the server, it was found that nvidia-smi
the command would report an error.
NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver.
solution:
ls /usr/src | grep nvidia
Here my driver version is nvidia-510.73.05sudo apt-get install dkms
sudo dkms install -m nvidia -v 510.73.05
It will return to normal after the installation is complete. Acceleration for this command
can also be used .nvidia-smi -pm 1