Horovod needs mpi for communication, NCLL and CUDA for compilation, so you need to install the corresponding dependencies before installing Horovod.
Environmental description
- centos7 x64
- openmpi 4.0.2
- nccl nccl-repo-rhel7-2.6.4-ga-cuda10.0-1-1
- cuda cuda_10.0.130_410.48_linux
- hidden 7.6.5
- tensorflow-gpu 1.14.0
- torch 1.1.0
- Loud 2.2.4
- horovod 0.19.1
open mpi install
Reference link: How do I build Open MPI?
Download address: download
1. Download openmpi
You can go to the download address to download the corresponding installation package and upload it to the server, or you can wget
download it on the server. Here, wget
openmpi with a download version of 4.0.2 is used.
wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.2.tar.gz复制代码
2. Unzip, compile, install
tar -zxvf openmpi-4.0.2.tar.gz
cd openmpi-4.0.2
mkdir /usr/local/openmpi-4.0.2
./configure --prefix=/usr/local/openmpi-4.0.2
make all install复制代码
3. Configure environment variables
/etc/profile
Add the following to the file:
# openmpi
export OPENMPI_HOME=/usr/local/openmpi-4.0.2
export PATH=$PATH:$OPENMPI_HOME/bin复制代码
Then source /etc/profile
make it effective
4. Test
Enter the openmpi installation package directory/examples (note that the installation package directory /opt/openmpi-4.0.2
is not the installation directory /usr/local/openmpi-4.0.2
)
make
./hello_c复制代码
If the following content appears, it means ok:
Hello, world, I am 0 of 1, (Open MPI v4.0.2, package: Open MPI root@dp-master Distribution, ident: 4.0.2, repo rev: v4.0.2, Oct 07, 2019, 109)复制代码
NCCL installation
Reference address: nccl-install-guide
Download address: NVIDIA Collective Communications Library (NCCL) Download Page
1. Install NCLL
After downloading the corresponding version of NCLL, upload NCLL to the server. Note that the installation files and execution commands of different systems are inconsistent (90 and 25x systems are different):
hundreds:
sudo rpm -ivh nccl-repo-rhel7-2.6.4-ga-cuda10.0-1-1.x86_64.rpm --force --nodeps
sudo yum update
sudo yum install -y libnccl libnccl-devel libnccl-static复制代码
ubuntu:
sudo dpkg -i nccl-repo-ubuntu1804-2.6.4-ga-cuda10.0_1-1_amd64.deb
sudo apt update
sudo apt install libnccl2 libnccl-dev复制代码
CUDA installation
Reference documents:
- Centos 7 installation Cuda10 process record
- Summarize the experience of building the Tensorflow-gpu environment of CUDA10+cudnn7 on CentOS7
1. Detect graphics driver
lspci | grep -i vga
01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)复制代码
If the above results appear, it means that there is already a graphics card driver. If not, you can refer to the two ways of installing nvidia in centos to install
2. Disable Nouveau driver
centos is driven by default Nouveau
, any output of the following command indicates that it has not been disabled (otherwise skip):
lsmod | grep nouveau复制代码
Disabled by rebooting the machine after /etc/modprobe.d/blacklist.conf
adding Nouveau
:
blacklist nouveau
options nouveau modeset=0复制代码
reboot
reboot
3. cuda installation
Go to CUDA Toolkit Archive to find a suitable version to download the installation package, here is 10.0
the version used (tensorflow1.14 has not been tested for 10.1):
wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux复制代码
Then execute the following command to install:
init 3
# 如果默认可以加上 --silent
sudo sh cuda_10.0.130_410.48_linux.run复制代码
This command needs to be executed before installation init 3
, otherwise the following error will be reported:
The file '/tmp/.X0-lock' exists and appears to contain the process ID '3031' of a runnning X server.复制代码
The solution reference document is: How to install NVIDIA.run?
After the installation is complete, /etc/profile
add the following configuration in:
export PATH=$PATH:/usr/local/cuda-10.0/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.0/lib64复制代码
Execute source /etc/profile
to make the modification effective
Execute nivdia-smi
the test
cudnn installation
After installing the graphics card driver and the algorithm, CUDA
you need to install cudnn
the algorithm to use the GPU. Use the following command to install CUDA:
conda install -y cudatoolkit=10.0
conda install -y cudnn复制代码
If you can't find the version, you can specify the library to download:
conda install -y cudatoolkit=10.0 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/复制代码
gcc, g++ upgrade
When installing and compiling horovod, gcc >= 4.8.5 g++ >= 4.9 is required. The version that comes with centos is 4.8.5 and needs to be upgraded to a higher version. Reference document: centos7 uses yum to upgrade gcc
sudo yum install centos-release-scl
sudo yum install devtoolset-7-gcc*
scl enable devtoolset-7 bash
which gcc
gcc --version
g++ --version复制代码
horovod install
Reference address 1: Install
Reference address 2: Horovod Installation Guide
1. Install horovod
When compiling horovod, you need the CPU version and the GPU version of tensorflow. Make sure that both are installed in the environment, otherwise it will trigger the operation of downloading the latest version of tensorflow (I’m not sure why, but when I install it myself, if there is no CPU version, it will automatically trigger the download of tensorflow-2.0 version, so I have both installed and then compile horovod. If there is no such situation, just compile it directly)
horovod is installed through pip, and python in the server is managed by conda, so first enter the environment corresponding to conda, and then execute the installation command:
conda activate ai
# 编译horovod需要CPU版本的不知道为什么,不存在就会去下载最新的tensorflow和pytorch
pip install tensorflow==1.14 tensorflow-gpu==1.14
# 编译
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 pip install --no-cache-dir horovod[tensorflow,keras,pytorch]
# 编译完要卸载所有tensorflow,然后重装tensorflow-gpu,不然只能用CPU很奇怪
pip uninstall -y tensorflow tensorflow-gpu tensorflow-estimator tensorboard
pip install tensorflow-gpu==1.14
# 测试horovod支持的框架是否包含tensorflow和pytorch
horovodrun -cb
# --测试GPU是否可用--
# 检测可用device
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
# 测试GPU是否可用
import tensorflow as tf
tf.test.is_gpu_available()复制代码
2. Test
The test refers to my other article: horovod test
3. Problem solving:
Question 1:
Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.复制代码
Solution
Reference (you need to pay attention to whether the installed version of the graphics card driver is consistent with the system):
-
Could not load dynamic library 'libnvinfer_plugin.so.6' #35968)
-
Could not load dynamic library 'libnvinfer.so.6'
# 确保/usr/local/cuda-10.1/lib64配置在系统变量LD_LIBRARY_PATH中了 echo $LD_LIBRARY_PATH # 如果没有将一下内容补充到/etc/profile中 export PATH=$PATH:/usr/local/cuda-10.1/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.1/lib64 # source使其生效 source /etc/profile复制代码
Question 2:
Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory;复制代码
solution
reference:
-
Solve the problem of Could not load dynamic library 'libcudart.so.10.0' )
# 这其实是由于tensorflow不支持cuda10.1导致的,卸载cuda10.1重新安装cuda10.0即可 cd /usr/local/cuda-10.1/bin/ # 勾选所有卸载原有的cuda sudo ./cuda-uninstaller # 将下载好的cuda10.0上传到服务器后进入到对应目录 sudo init 3 # 别用--silent,重装的时候不用再安装显卡驱动了,不然又得重启 sudo sh ./cuda_10.0.130_410.48_linux.run # 安装好后更新/etc/profiled的配置为一下内容 export PATH=$PATH:/usr/local/cuda-10.0/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.0/lib64 # 使修改生效 source /etc/profile # 重装cuda-toolkit conda uninstall -y cudatoolkit conda install -y cudatoolkit=10.0 conda install -y cudnn #重装tensorflow pip uninstall -y tensorflow tensorflow-gpu tensorflow-estimator tensorboard pip install tensorflow-gpu==1.14复制代码
Execute the following code to test, and it will be completed if no error is reported:
# 检测可用device from tensorflow.python.client import device_lib print(device_lib.list_local_devices()) # 测试GPU是否可用 import tensorflow as tf tf.test.is_gpu_available()复制代码
Question 3:
horovod.run.common.util.network.NoValidAddressesFound: Unable to connect to the horovodrun task service #1 on any of the addresses复制代码
This problem occurs during multi-machine multi-card training. Horovod's multi-node distributed training requires each node to have a port that can be connected, which may be caused by different situations. My main reason here is that there is a warning indicating that 253 expects to use enp4s0
the network card to connect but it is used enp3s0
(forgot to keep the error message). This is because the name of the network card driver has been changed after a system exception. Refer to CentOS7 network configuration and modify the name of the network card and common service management commands to change the name of the network card enp4s0
.
sudo nmtui
sudo reboot复制代码
Question 4:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.复制代码
It should be a problem with the graphics card driver, just reinstall the N card driver.