Horovod installation

Horovod needs mpi for communication, NCLL and CUDA for compilation, so you need to install the corresponding dependencies before installing Horovod.

Environmental description

  • centos7 x64
  • openmpi 4.0.2
  • nccl nccl-repo-rhel7-2.6.4-ga-cuda10.0-1-1
  • cuda cuda_10.0.130_410.48_linux
  • hidden 7.6.5
  • tensorflow-gpu 1.14.0
  • torch 1.1.0
  • Loud 2.2.4
  • horovod 0.19.1

open mpi install

Reference link:  How do I build Open MPI?
Download address: download

1. Download openmpi

You can go to the download address to download the corresponding installation package and upload it to the server, or you can wgetdownload it on the server. Here, wgetopenmpi with a download version of 4.0.2 is used.

wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.2.tar.gz复制代码

2. Unzip, compile, install

tar -zxvf openmpi-4.0.2.tar.gz
cd openmpi-4.0.2
mkdir /usr/local/openmpi-4.0.2
./configure --prefix=/usr/local/openmpi-4.0.2
make all install复制代码

3. Configure environment variables

/etc/profileAdd the following to the file:

# openmpi
export OPENMPI_HOME=/usr/local/openmpi-4.0.2
export PATH=$PATH:$OPENMPI_HOME/bin复制代码

Then source /etc/profilemake it effective

4. Test

Enter the openmpi installation package directory/examples (note that the installation package directory /opt/openmpi-4.0.2is not the installation directory /usr/local/openmpi-4.0.2)

make
./hello_c复制代码

If the following content appears, it means ok:

Hello, world, I am 0 of 1, (Open MPI v4.0.2, package: Open MPI root@dp-master Distribution, ident: 4.0.2, repo rev: v4.0.2, Oct 07, 2019, 109)复制代码

NCCL installation

Reference address: nccl-install-guide
Download address: NVIDIA Collective Communications Library (NCCL) Download Page

1. Install NCLL

After downloading the corresponding version of NCLL, upload NCLL to the server. Note that the installation files and execution commands of different systems are inconsistent (90 and 25x systems are different):

hundreds:

sudo rpm -ivh nccl-repo-rhel7-2.6.4-ga-cuda10.0-1-1.x86_64.rpm --force --nodeps
sudo yum update
sudo yum install -y libnccl libnccl-devel libnccl-static复制代码

ubuntu:

sudo dpkg -i nccl-repo-ubuntu1804-2.6.4-ga-cuda10.0_1-1_amd64.deb
sudo apt update
sudo apt install libnccl2 libnccl-dev复制代码

CUDA installation

Reference documents:

  1. Centos 7 installation Cuda10 process record
  2. Summarize the experience of building the Tensorflow-gpu environment of CUDA10+cudnn7 on CentOS7

1. Detect graphics driver

lspci  | grep -i vga

01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)复制代码

If the above results appear, it means that there is already a graphics card driver. If not, you can refer to the two ways of installing nvidia in centos to install

2. Disable Nouveau driver

centos is driven by default Nouveau, any output of the following command indicates that it has not been disabled (otherwise skip):

lsmod | grep nouveau复制代码

Disabled by rebooting the machine after /etc/modprobe.d/blacklist.confadding Nouveau:

blacklist nouveau
options nouveau modeset=0复制代码

rebootreboot

3. cuda installation

Go to CUDA Toolkit Archive to find a suitable version to download the installation package, here is 10.0the version used (tensorflow1.14 has not been tested for 10.1):

wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux复制代码

 

Then execute the following command to install:

init 3
# 如果默认可以加上 --silent
sudo sh cuda_10.0.130_410.48_linux.run复制代码

This command needs to be executed before installation init 3, otherwise the following error will be reported:

The file '/tmp/.X0-lock' exists and appears to contain the process ID '3031' of a runnning X server.复制代码

The solution reference document is: How to install NVIDIA.run?

After the installation is complete, /etc/profileadd the following configuration in:

export PATH=$PATH:/usr/local/cuda-10.0/bin 
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.0/lib64复制代码

Execute source /etc/profileto make the modification effective

Execute nivdia-smithe test

cudnn installation

After installing the graphics card driver and the algorithm, CUDAyou need to install cudnnthe algorithm to use the GPU. Use the following command to install CUDA:

conda install -y cudatoolkit=10.0
conda install -y cudnn复制代码

If you can't find the version, you can specify the library to download:

conda install -y cudatoolkit=10.0 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/复制代码

gcc, g++ upgrade

When installing and compiling horovod, gcc >= 4.8.5 g++ >= 4.9 is required. The version that comes with centos is 4.8.5 and needs to be upgraded to a higher version. Reference document: centos7 uses yum to upgrade gcc

sudo yum install centos-release-scl
sudo yum install devtoolset-7-gcc*
scl enable devtoolset-7 bash
which gcc
gcc --version
g++ --version复制代码

horovod install

Reference address 1: Install
Reference address 2: Horovod Installation Guide

1. Install horovod

When compiling horovod, you need the CPU version and the GPU version of tensorflow. Make sure that both are installed in the environment, otherwise it will trigger the operation of downloading the latest version of tensorflow (I’m not sure why, but when I install it myself, if there is no CPU version, it will automatically trigger the download of tensorflow-2.0 version, so I have both installed and then compile horovod. If there is no such situation, just compile it directly)

horovod is installed through pip, and python in the server is managed by conda, so first enter the environment corresponding to conda, and then execute the installation command:

conda activate ai
# 编译horovod需要CPU版本的不知道为什么,不存在就会去下载最新的tensorflow和pytorch
pip install tensorflow==1.14 tensorflow-gpu==1.14

# 编译
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 pip install --no-cache-dir horovod[tensorflow,keras,pytorch]

# 编译完要卸载所有tensorflow,然后重装tensorflow-gpu,不然只能用CPU很奇怪
pip uninstall -y tensorflow tensorflow-gpu tensorflow-estimator tensorboard
pip install tensorflow-gpu==1.14

# 测试horovod支持的框架是否包含tensorflow和pytorch
horovodrun -cb

# --测试GPU是否可用--
# 检测可用device
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

# 测试GPU是否可用
import tensorflow as tf
tf.test.is_gpu_available()复制代码

2. Test

The test refers to my other article: horovod test

3. Problem solving:

Question 1:

Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory

W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.复制代码

Solution

Reference (you need to pay attention to whether the installed version of the graphics card driver is consistent with the system):

  1. Could not load dynamic library 'libnvinfer_plugin.so.6' #35968

  2. Could not load dynamic library 'libnvinfer.so.6'

     # 确保/usr/local/cuda-10.1/lib64配置在系统变量LD_LIBRARY_PATH中了
     echo $LD_LIBRARY_PATH
    
     # 如果没有将一下内容补充到/etc/profile中
     export PATH=$PATH:/usr/local/cuda-10.1/bin 
     export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.1/lib64
    
     # source使其生效
     source /etc/profile复制代码

Question 2:

Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory;复制代码

solution

reference:

  1. Tensorflow checks whether it is using cpu or gpu

  2. Solve the problem of Could not load dynamic library 'libcudart.so.10.0' )

     # 这其实是由于tensorflow不支持cuda10.1导致的,卸载cuda10.1重新安装cuda10.0即可 
     cd /usr/local/cuda-10.1/bin/
     # 勾选所有卸载原有的cuda
     sudo ./cuda-uninstaller
     
     # 将下载好的cuda10.0上传到服务器后进入到对应目录
     sudo init 3
     # 别用--silent,重装的时候不用再安装显卡驱动了,不然又得重启
     sudo sh ./cuda_10.0.130_410.48_linux.run
     
     # 安装好后更新/etc/profiled的配置为一下内容
     export PATH=$PATH:/usr/local/cuda-10.0/bin 
     export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.0/lib64
     # 使修改生效
     source /etc/profile
     # 重装cuda-toolkit
     conda uninstall -y cudatoolkit
     conda install -y cudatoolkit=10.0
     conda install -y cudnn
     #重装tensorflow  
     pip uninstall -y tensorflow tensorflow-gpu tensorflow-estimator tensorboard
     pip install tensorflow-gpu==1.14复制代码

    Execute the following code to test, and it will be completed if no error is reported:

     # 检测可用device
     from tensorflow.python.client import device_lib
     print(device_lib.list_local_devices())
     
     # 测试GPU是否可用
     import tensorflow as tf
     tf.test.is_gpu_available()复制代码

Question 3:

horovod.run.common.util.network.NoValidAddressesFound: Unable to connect to the horovodrun task service #1 on any of the addresses复制代码

This problem occurs during multi-machine multi-card training. Horovod's multi-node distributed training requires each node to have a port that can be connected, which may be caused by different situations. My main reason here is that there is a warning indicating that 253 expects to use enp4s0the network card to connect but it is used enp3s0(forgot to keep the error message). This is because the name of the network card driver has been changed after a system exception. Refer to CentOS7 network configuration and modify the name of the network card and common service management commands to change the name of the network card enp4s0.

sudo nmtui
sudo reboot复制代码

Question 4:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.复制代码

It should be a problem with the graphics card driver, just reinstall the N card driver.

 

Guess you like

Origin blog.csdn.net/youshowkm/article/details/130611288