Linux installation TensorFlow-GPU (1.XX) CUDA CUDNN various pit guide

Installation environment centos7 virtual machine (Ubuntu installation principle is the same, commands are different)

TensorFlowGPU version 1.14.0, 1.15.0 tested two versions

Graphics cardTesla T4

Come up and get into the pit

I have installed TensorFlow several times, and when importing tensorflow as tf, 'illegal instruction' appears all the time. I finally found out that it is a problem at the instruction level of the CPU core. It took a lot of time to troubleshoot this problem...

first pit

cat  /proc/cpuinfo

Check the flags of the CPU first

processor       : 0
vendor_id       : HygonGenuine
cpu family      : 24
model           : 1
model name      : Hygon C86 7285 32-core Processor
stepping        : 1
microcode       : 0x1000065
cpu MHz         : 2000.000
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm art rep_good nopl extd_apicid eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core retpoline_amd ssbd ibpb vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr virt_ssbd arat npt nrip_save arch_capabilities

If there are only sse and sse2 in the flags, there will be an "illegal instruction" problem in the instruction set, and it must be upgraded to the form shown above, including sse, sse2, sse4_1, and sse4_2. You need to inquire about relevant knowledge. not expand here

Currently, Python 2.7.5 is installed by default on linux. The general situation is to install the conda environment, then use conda to virtualize a TensorFlow environment, and then install TensorFlow in this environment, but we install TensorFlow directly on the linux system this time, which is equivalent to directly Install on physical machine

Because this linux is virtualized, it has a certain degree of security. If it is a real physical machine, it is not recommended that you do this, and even the system will prompt you

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

We install the NVIDIA graphics card driver. This time we use Tesla T4. You can install the graphics card accordingly.

second pit

After installing the NVIDIA graphics card driver, the corresponding information will be displayed

# nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.106.00   Driver Version: 460.106.00   CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:06:10.0 Off |                    0 |
| N/A   55C    P0    27W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

It shows the version of NVIDIA-SMI and the version of the driver, the two versions are the same

It also shows CUDA Version: 11.2, the CUDA version, but actually CUDA is not installed at this time, this is the CUDA that comes with NVIDIA, we need to install it ourselves later

Refer to the table to select the CUDA version. This time, CUDA 10 is installed. (If it is not in this table, you need to find the corresponding table yourself). The download address is https://developer.nvidia.com/cuda-toolkit-archive

For specific installation steps, please refer to https://blog.csdn.net/weixin_48185819/article/details/107953955

After the installation is complete, you need to configure the environment variables, enter nvcc -V, and verify the installation

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

At this time, it is CUDA10.0, as you can see

But if you enter nvidia-smi, CUDA 11.2 will still be displayed, ignore it , and the one installed by yourself shall prevail

Next install cudnn

For specific installation steps, please refer to https://blog.csdn.net/weixin_48185819/article/details/107953955

You also need to find the corresponding version (if not in this table, you need to find the corresponding table yourself)

In fact, it is just a few copy operations, just like playing games when I was young, copying several files to the specified directory to play

So far we have installed CUDA, CUDNN, and it's over

Next is the environment required to install TensorFlow

Currently, Python 2.7.5 is installed by default on linux, we need to install a higher version of python

The installation is Python 3.6.8. For specific steps, please refer to https://blog.csdn.net/weixin_48185819/article/details/122586200?spm=1001.2014.3001.5501

third pit

There will be two python versions in the system, python2, python3

Everyone must pay attention to the problem of soft links. When installing python3, you must remember the installation path of python3, because when the system enters python, it points to python2. We delete the original soft link of python2 and point to the input command python2 again, and then Created a new soft connection python3 pointing to the command python

which python 定位python
/usr/bin/python

rm /usr/bin/python
rm:是否删除符号链接 "/usr/bin/python"?


ln -s /usr/local/python3/bin/python3.6 /usr/bin/python

First locate the location of python, and then delete the soft link. Note that it is 'delete symbolic link', not delete the file. You should pay attention to this when operating, and then you must create a new soft link, otherwise you will not be able to find python later.

The installation location of our python3 is /usr/local/python3/bin/python3.6, and point this location to /usr/bin/python. At this time, when we enter python, we enter python3, and python2 enters python2. Both versions coexist

python

Python 3.6.8 (default, Jan 20 2022, 17:26:16) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> 
>>> 



python2

Python 2.7.5 (default, Oct 14 2020, 14:45:30) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> 

Regarding the coexistence of two versions of soft connection construction, refer to  https://blog.csdn.net/weixin_48185819/article/details/122586200?spm=1001.2014.3001.5501

fourth pit

pip install tensorflow-gpu==1.14.0  -i https://pypi.douban.com/simple/

When using pip to install TensorFlow, it will be faster to use Douban source

At this time, if you use pip, you may use the pip of python2. With the above experience, you should realize that it is a soft connection problem, so we delete the current pip soft connection and establish the pip soft connection in python3

whereis python3

python3: /usr/lib/python3.6 /usr/lib64/python3.6 /usr/local/lib/python3.6 /usr/include/python3.6m /usr/local/python3 /usr/share/man/man1/python3.1.gz

If you forget the installation location, you can use whereis to find it, we find the pip3 related to python3, and then establish a soft connection

ln -s /usr/bin/pip3  /usr/local/bin/pip

Everyone's position may not be the same as mine, change it according to the actual situation

At this time, when using pip to install the software, it is the pip3 used

Below we use pip to install TensorFlow

pip install tensorflow-serving-api==1.15.0 -i https://pypi.douban.com/simple/

After the installation is complete, the test GPU shows True, indicating that the GPU can be used normally. If it is False, check according to related issues. You can refer to  https://blog.csdn.net/weixin_48185819/article/details/107953955

import tensorflow as tf
tf.test.is_gpu_available()


2021-01-05 10:09:11.372576: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2021-01-05 10:09:11.374501: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2021-01-05 10:09:11.376425: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2021-01-05 10:09:11.376776: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
 
2021-01-05 10:09:11.408957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Graphics Device, pci bus id: 0000:b1:00.0, compute capability: 7.0)
True

There are still various problems that can be easily found. The solutions are not explained here. Welcome to add

Guess you like

Origin blog.csdn.net/weixin_48185819/article/details/122622101