pytorch2.0训练报错：Could not load library libcudnn_cnn_train.so.8，Unable to register cuDNN factory解决办法

1.主要问题：

最近服务器挂掉了，涉及到了深度学习环境的重新搭建，现在的pytorch版本已经更新到2.0以上了，以前还是用的1.9，安装完成后遇到了这个问题，无法训练模型。
报错信息：
1.
Could not load library libcudnn_cnn_train.so.8. Error: /data/Anaconda3/envs/torch2.1/bin/…/lib/libcudnn_cnn_train.so.8: symbol _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERNS1_12OperationSetERP12cudnnContextmb, version libcudnn_cnn_infer.so.8 not defined in file libcudnn_cnn_infer.so.8 with link time reference

: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered

遇到这个问题时，我们大部分都是nvidia-smi，nvcc -V这俩命令走起来，结果一看都是对的，然后又去环境下面输入python，打印torch和gpu是否可用torch.cuda.is_available()，结果发现都是对的,结果是True，一脸懵。
在这里插入图片描述

2.版本说明

显卡型号：RTX4090
cuda版本：11.8
cudnn版本：8.9.0
驱动版本：535.129.03
在这里插入图片描述
不同的显卡和具体的安装配置，参考nvidia官网，注意一点就是版本对应。

3.问题产生的原因

安装完anaconda后，创建虚拟环境，并使用pip install torch安装torch版本，在安装的过程中，提示的torch版本为2.1，并随之而来的是安装了很多带nvidia-cudnn的东西，这东西可以理解为根据当前的torch版本适配的cudnn，2.0以上是默认安装带cuda和cudnn驱动的这些库
在这里插入图片描述

这里就需要注意了，实际上，我们通常自己就会去官网下载cuda和cudnn，例如，我已经配置好了cuda11.8和cudnn8.9.0，并且已经配置好了环境变量，但是2.0以上torch就好像给我们省略了，在安装torch的时候，虚拟环境下又下载了一遍cudnn，导致和外面配置好的cudnn冲突，而torch2.0以下的版本就不会下载这些nvidia的库，并且2.0以下的安装完成后就可以正常训练。

4.解决办法

给出3种解决办法：
（1）不在外面配置cuda和cudnn，也就是不再去配置环境变量，这稍微显得很不适应，并且每次新建环境都需要重新在新的环境安装，很冗余，不是很推荐；
（2）下载torch2.0以下的版本，直接使用pip install torch==1.9.0，这样最快，而且也不是所有的代码都需要2.0以上的版本，这个根据自己的情况使用，如果已经下载了2.1，需要卸载2.1并且还需要把2.1附带的那些cudnn的库全部卸载掉，最好去虚拟环境的lib下找到这些库，给他删掉！如果怕删不干净，直接把环境删掉，新建一个干净的环境再搞，否则就算安装了1.9还是会报错，已经测试过。
（3）如果坚持要2.0以上的torch版本，到这个地址下去下载不带nvidia库版本的torch的whl，区别在于名字不一样，很容易找到：
在这里插入图片描述
带nvidia的cudnn版本是这样的,后面有cudnn的名称，可不能下载错了：

手动下载下来之后，使用pip install *.whl安装就行了，然后再在这个地址里面下载好torchvision和torchaudio，这俩都很小，一并下载最好，pip install也可以，但是不好适配版本，给出的适配版本如下：
torch对应的torchaudio：
在这里插入图片描述
torch对应的torchvision：

如果版本不对应会报其他错误：

RuntimeError: Couldn't load custom C++ ops. This can happen if your PyTorch and torchvision versions are incompatible, or if you had errors while compiling torchvision from source. For further information on the compatible versions, check https://github.com/pytorch/vision#installation for the compatibility matrix. Please check your PyTorch version with torch.__version__ and your torchvision version with torchvision.__version__ and verify if they are compatible, and if not 
please reinstall torchvision so that it matches your PyTorch install.

RuntimeError: GET was unable to find an engine to execute this computation

这些都是同样的问题，一个是torch2.0以上和torch的其他库版本不对应，还有其他的一些环境报错问题，没有及时保存下来，可以评论，一起加油。

5.参考

[1] https://blog.csdn.net/shiwanghualuo/article/details/122860521
[2] https://pytorch.org/audio/main/installation.html
[3] https://blog.csdn.net/wangmou211/article/details/134595135