Deep learning environment construction - Nvidia driver and Cuda installation


foreword

To be honest, the author has been working for many years after graduation, but there are still many problems with the configuration of the development environment that make people confused. Fortunately, the author has always had the habit of writing notes, which have been recorded in the private cloud before, and now I sort them out and share them with you. And in addition to explaining the steps, it will also explain the reasons for doing so as much as possible, so that readers can clarify the logic.

Note: This article is for linux systems


1. Environment configuration and files

The environment configuration used in this article is:

  • Graphics driver: nvdia430
    • File name: NVIDIA-Linux-x86_64-430.14
  • queue: queue-10.0
    • File name: cuda_10.0.130_410.48_linux
  • hidden:hidden7.5
    • File name: cudnn-10.0-linux-x64-v7.5.0.56

It can be found that the graphics card driver we use is the 430 version, which does not match the 410 version driver included in cuda10.0, but it is not a big problem, and the 430 driver is fully compatible with cuda-10. However, it should be noted that if the installed display driver is too different from the cuda default display driver version, it is not clear whether there will be incompatibility.

In addition, the choice of different cuda and cudnn versions is very important, because the support of major deep learning frameworks in the latest version may not be ideal. For example, the precompiled installation package of tensorflow only supports cuda10.0, and other versions need to be compiled manually; When cuda10 first came out, pytorch's libtorch only supports cuda9, so please choose according to your needs.

2. Installation steps

2.1 Installation related dependencies

sudo apt-get install build-essential #这是编译环境,包含make,GCC G++等

The author’s computer can install the graphics card driver normally only after the compilation environment is installed, but after checking the information, it is found that the dependencies given by different authors are different. The following is a partial summary. If only the compilation environment is installed and the graphics card driver cannot be installed normally, please try to install the following package

sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libhdf5-serial-dev protobuf-compiler
sudo apt-get install --no-install-recommends libboost-all-dev
sudo apt-get install libopenblas-dev liblapack-dev libatlas-base-dev
sudo apt-get install libgflags-dev libgoogle-glog-dev liblmdb-dev
sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev 

2.2 Install the official driver

Go to nvidia's official website to download the corresponding driver https://www.nvidia.com/Download/index.aspx?lang=cn
insert image description here
and select the driver corresponding to your graphics card to download.

2.3 Disable nouveau and close the graphical interface (Xserver)

Nouveau is an open source 3D driver developed by a third party for NVIDIA graphics cards, and it has not been recognized and supported by NVIDIA. Although Nouveau Gallium3D is far from comparable to NVIDIA's official private driver in terms of game speed, it does make it easier for Linux to cope with various complex NVIDIA graphics card environments, allowing users to enter the desktop after installing the system and have a good display effect. Therefore, many Linux distributions integrate the Nouveau driver by default, which is installed by default when encountering an NVIDIA graphics card. This is especially true for the enterprise version of Linux. Almost all enterprise Linux distributions that support graphical interfaces include Nouveau.

However, for personal desktop users, Nouveau in the growth stage is not perfect. Unlike the enterprise version, individual users often need some 3D special effects in addition to wanting to display the graphical interface normally. Nouveau can't complete it most of the time, and users Nouveau became a hindrance when installing NVIDIA's official private driver. If you don't kill Nouveau, you always get an error when installing.

Xserver is the graphical interface of Linux

Modify the //etc/modprobe.d/blacklist.conf file and add the following command at the end of this file:

blacklist nouveau
options nouveau modeset = 0

As shown in the figure after modification:
insert image description here

Then execute the command to update the startup file and restart:

sudo update-initramfs -u
sudo reboot

Check whether nouveau is closed after restarting, enter the following command:

lsmod | grep nouveau

If there is no output, nouveau is disabled.
Finally close the GUI:

 service lightdm stop

Note that your computer may not have the lightdm graphical interface installed, but is using gdm3, in this case you need to execute

service gdm3 stop

Or close it after installing lightdm (this is recommended, because from experience, you will probably still have to install it in the future), the installation command is as follows:

sudo apt install lightdm

2.4 Install the driver

Enter the directory where the graphics card driver installation file is located and execute the following command to install:

sudo ./NVIDIA-Linux-x86_64-430.14.run –no-opengl-files

Since the driver version you downloaded may be different from mine, please refer to your own file name. In addition, the parameter –no-opengl-files means not to install OpenGL files. This parameter can avoid the problem of not being able to enter the graphical interface during the installation process
. Just accept or continue all the way. After the installation is complete, execute the following command to see if the installation is successful:

nvidia-smi

If it is successful, it will display as follows:
insert image description here
You will find that the cuda version of the display drive is 10.2, but it does not seem to conflict with the cuda10.0 we are about to install.
Then start Xserver and restart to check for bugs:

service lightdm start
sudo reboot

If your Xserver cannot be started normally, or stuck in the login interface, it is very likely that the Xserver was not closed correctly in the above installation steps or the parameter –no-opengl-files was not added during the driver installation, please check , the error handling method is concentrated in the following bug handling chapter.

2.5 install cuda

After installing the graphics card driver, start to install cuda. ​​You also need to go to nvidia's cuda page to download the corresponding installation package.

cuda download page: https://developer.nvidia.com/cuda-downloads

The homepage provides version 10.1 by default, what we need is 10.0, click legacy releases to download the earlier version:
insert image description here
insert image description here
download according to your own system version, the author uses ubuntu18,
insert image description here
download the first Base Installer and it can be used normally, patch 1 is the patch , can also be downloaded if necessary.

Next, start the installation. Note that you also need to close the Xserver first, and then execute the cuda10 installation file:

service lightdm stop
sudo ./cuda_10.0.130_410.48_linux.run

At this time, there will be many prompts for you to confirm. Since the graphics card driver has been successfully installed above, there is no need to install it again here.

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 XXX.XX ?

Choose no here, the rest are yes or accept

After the installation is successful, start adding system variables. Here you can choose to add it in the profile, or add it in the .bashrc file under your own user:

vim ~/.bashrc #打开配置文件
添加以下变量:
export PATH=/usr/local/cuda-10.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda-10.0

The effect after adding is as shown in the figure:
insert image description here
These variables allow python to find the cuda library and header files, avoiding errors such as no find.
Execute the source command to make the bashrc file take effect:

source ~/.bashrc

Finally, verify whether cuda is successfully installed. Note that you need to install the relevant sample when installing cuda. ​​If you follow the tutorial in this article, the sample has also been successfully installed:

cd /usr/local/cuda-10.0/samples/1_Utilities/deviceQuery
sudo make
./deviceQuery

If you see the following information, the installation is successful:
insert image description here

2.6 cudnn installation

After installing cuda, we also need to install cudnn: https://developer.nvidia.com/cudnn
It should be noted that cudnn needs to be registered before downloading. The author uses cudnn version 7.5, so you also need to choose an earlier version to download:
insert image description here
insert image description here
Install cudnn:
cudnn is easy to install, just unzip the file and copy it into the cuda root directory:

tar -zxvf cudnn-10.0-linux-x64-v7.5.0.56.tgz #解压
cd cudnn-10.0-linux-x64-v7.5.0.56 #进入cudnn文件夹
sudo cp -r cuda/* /usr/local/cuda-10.0/ #将文件夹下的所有文件拷贝进cuda10下

At this point, you're done and all the installations are successful.

3. Bug handling

Here is an introduction to the relevant bugs and solutions that the author has encountered in the display drive and cudnn:

3.1. Circular login or unable to enter Xserver

It is speculated that the main reason for the circular login or the inability to enter the graphical interface is an error in openGL. It is reflected in the operation that Xserver is not disabled or nouveau is not disabled during the installation process (but theoretically, the installation program cannot be executed without nouveau, and an error will be reported) , so the solution is to uninstall the current graphics card driver and install it again according to the tutorial. Here is the command to uninstall the graphics card driver:

service lightdm stop #关闭Xerver服务
sudo /usr/bin/nvidia-uninstall #nvidia自带的卸载程序
sudo apt-get install autoremove --purge nvidia* #通过apt来卸载

The above two commands can be executed again to avoid uninstallation.

3.2. The graphics card driver is missing, and nvidia-smi reports an error

If it is found that cuda does not start normally, enter nvidia-smi to report an error:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

There are two possibilities here, let me talk about the first one first: If you report this error just after installing the driver, it is because the nvidia driver has not been loaded by the kernel, you can try to execute the following command to check whether there is a kernel file:

cd /lib/modules
find . -name "*.ko" | grep -i nvidia

The output under normal circumstances should be:
insert image description here
compare and see if you are missing the nvidia.ko (or possibly nvidia_xxx.ko) file.
If missing, you need to install kernel source:

sudo apt-get install linux-source
sudo apt-get install linux-headers-4.18.0-25-generic

Among them: 4.18.0-25-generic comes from the output of the command uname -r.
The second case is that it has been used for a period of time after normal installation, and suddenly an error is reported one day. In this case, the cuda environment has been changed recently, resulting in The graphics card driver is damaged.
In this case, the simplest and rude method is to uninstall and reinstall all cuda and graphics card drivers:

service lightdm stop #关闭Xserver服务
sudo /usr/bin/nvidia-uninstall #nvidia自带的卸载程序
sudo apt-get autoremove --purge nvidia* #通过apt来卸载
sudo /usr/local/cuda-10.0/bin/uninstall_cuda_10.0.pl #卸载cuda

After the uninstall is complete, reinstall it again.

Guess you like

Origin blog.csdn.net/TchaikovskyBear/article/details/129144438