Installation environment centos7 virtual machine (Ubuntu installation principle is the same, commands are different)
TensorFlowGPU version 1.14.0, 1.15.0 tested two versions
Graphics cardTesla T4
Come up and get into the pit
I have installed TensorFlow several times, and when importing tensorflow as tf, 'illegal instruction' appears all the time. I finally found out that it is a problem at the instruction level of the CPU core. It took a lot of time to troubleshoot this problem...
first pit
cat /proc/cpuinfo
Check the flags of the CPU first
processor : 0
vendor_id : HygonGenuine
cpu family : 24
model : 1
model name : Hygon C86 7285 32-core Processor
stepping : 1
microcode : 0x1000065
cpu MHz : 2000.000
cache size : 512 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm art rep_good nopl extd_apicid eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core retpoline_amd ssbd ibpb vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr virt_ssbd arat npt nrip_save arch_capabilities
If there are only sse and sse2 in the flags, there will be an "illegal instruction" problem in the instruction set, and it must be upgraded to the form shown above, including sse, sse2, sse4_1, and sse4_2. You need to inquire about relevant knowledge. not expand here
Currently, Python 2.7.5 is installed by default on linux. The general situation is to install the conda environment, then use conda to virtualize a TensorFlow environment, and then install TensorFlow in this environment, but we install TensorFlow directly on the linux system this time, which is equivalent to directly Install on physical machine
Because this linux is virtualized, it has a certain degree of security. If it is a real physical machine, it is not recommended that you do this, and even the system will prompt you
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
We install the NVIDIA graphics card driver. This time we use Tesla T4. You can install the graphics card accordingly.
second pit
After installing the NVIDIA graphics card driver, the corresponding information will be displayed
# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.106.00 Driver Version: 460.106.00 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:06:10.0 Off | 0 |
| N/A 55C P0 27W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
It shows the version of NVIDIA-SMI and the version of the driver, the two versions are the same
It also shows CUDA Version: 11.2, the CUDA version, but actually CUDA is not installed at this time, this is the CUDA that comes with NVIDIA, we need to install it ourselves later
Refer to the table to select the CUDA version. This time, CUDA 10 is installed. (If it is not in this table, you need to find the corresponding table yourself). The download address is https://developer.nvidia.com/cuda-toolkit-archive
For specific installation steps, please refer to https://blog.csdn.net/weixin_48185819/article/details/107953955
After the installation is complete, you need to configure the environment variables, enter nvcc -V, and verify the installation
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
At this time, it is CUDA10.0, as you can see
But if you enter nvidia-smi, CUDA 11.2 will still be displayed, ignore it , and the one installed by yourself shall prevail
Next install cudnn
For specific installation steps, please refer to https://blog.csdn.net/weixin_48185819/article/details/107953955
You also need to find the corresponding version (if not in this table, you need to find the corresponding table yourself)
In fact, it is just a few copy operations, just like playing games when I was young, copying several files to the specified directory to play
So far we have installed CUDA, CUDNN, and it's over
Next is the environment required to install TensorFlow
Currently, Python 2.7.5 is installed by default on linux, we need to install a higher version of python
The installation is Python 3.6.8. For specific steps, please refer to https://blog.csdn.net/weixin_48185819/article/details/122586200?spm=1001.2014.3001.5501
third pit
There will be two python versions in the system, python2, python3
Everyone must pay attention to the problem of soft links. When installing python3, you must remember the installation path of python3, because when the system enters python, it points to python2. We delete the original soft link of python2 and point to the input command python2 again, and then Created a new soft connection python3 pointing to the command python
which python 定位python
/usr/bin/python
rm /usr/bin/python
rm:是否删除符号链接 "/usr/bin/python"?
ln -s /usr/local/python3/bin/python3.6 /usr/bin/python
First locate the location of python, and then delete the soft link. Note that it is 'delete symbolic link', not delete the file. You should pay attention to this when operating, and then you must create a new soft link, otherwise you will not be able to find python later.
The installation location of our python3 is /usr/local/python3/bin/python3.6, and point this location to /usr/bin/python. At this time, when we enter python, we enter python3, and python2 enters python2. Both versions coexist
python
Python 3.6.8 (default, Jan 20 2022, 17:26:16)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>>
>>>
python2
Python 2.7.5 (default, Oct 14 2020, 14:45:30)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>>
Regarding the coexistence of two versions of soft connection construction, refer to https://blog.csdn.net/weixin_48185819/article/details/122586200?spm=1001.2014.3001.5501
fourth pit
pip install tensorflow-gpu==1.14.0 -i https://pypi.douban.com/simple/
When using pip to install TensorFlow, it will be faster to use Douban source
At this time, if you use pip, you may use the pip of python2. With the above experience, you should realize that it is a soft connection problem, so we delete the current pip soft connection and establish the pip soft connection in python3
whereis python3
python3: /usr/lib/python3.6 /usr/lib64/python3.6 /usr/local/lib/python3.6 /usr/include/python3.6m /usr/local/python3 /usr/share/man/man1/python3.1.gz
If you forget the installation location, you can use whereis to find it, we find the pip3 related to python3, and then establish a soft connection
ln -s /usr/bin/pip3 /usr/local/bin/pip
Everyone's position may not be the same as mine, change it according to the actual situation
At this time, when using pip to install the software, it is the pip3 used
Below we use pip to install TensorFlow
pip install tensorflow-serving-api==1.15.0 -i https://pypi.douban.com/simple/
After the installation is complete, the test GPU shows True, indicating that the GPU can be used normally. If it is False, check according to related issues. You can refer to https://blog.csdn.net/weixin_48185819/article/details/107953955
import tensorflow as tf
tf.test.is_gpu_available()
2021-01-05 10:09:11.372576: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2021-01-05 10:09:11.374501: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2021-01-05 10:09:11.376425: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2021-01-05 10:09:11.376776: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2021-01-05 10:09:11.408957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:1 with 30555 MB memory) -> physical GPU (device: 1, name: Graphics Device, pci bus id: 0000:b1:00.0, compute capability: 7.0)
True
There are still various problems that can be easily found. The solutions are not explained here. Welcome to add