Ubuntu installation horovod work record

This article records some codes that I specifically used during the environment configuration process for reference. The specific use depends on the situation.

1. Ubuntu adjusts gcc g++ version to 4.9

#首先修改apt源
vi /etc/apt/sources.list
#添加下面两行
deb http://dk.archive.ubuntu.com/ubuntu/ xenial main
deb http://dk.archive.ubuntu.com/ubuntu/ xenial universe
#ESC :wq 保存退出
apt update
apt -y install gcc-4.9 g++-4.9
#修改版本
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 20
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.9 20
#查看版本信息
gcc -v  #gcc version 4.9.3 (Ubuntu 4.9.3-13ubuntu2) 
g++ -v
#删除操作 update-alternatives --remove gcc /usr/bin/gcc-4.9

2. Ubuntu uses Conda to create a python==3.6 virtual environment

cd /root/download
wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
bash Anaconda3-2021.11-Linux-x86_64.sh
#更改安装目录为/root/anaconda
#安装完成后有一个初始化,输入yes,最后
cd && source ~/.bashrc
#conda默认开启base虚拟环境,用以下命令关闭
conda config --set auto_activate_base false
#创建虚拟环境
conda create -n horovod python=3.6
#激活虚拟环境
conda activate horovod
#退出虚拟环境
conda deactivate

3. Install the specified CUDA version on ubuntu

#cuda安装
cd /root/download
wget https://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.run
sh cuda_10.1.243_418.87.00_linux.run --silent --toolkit
#配置环境
cd
vi ~/.bashrc
#修改为以下内容
export CUDA_HOME=/usr/local/cuda-10.1
export PATH=$PATH:$CUDA_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64
alias ll='ls -alF'
#source
source ~/.bashrc
#检查版本信息
nvcc --version
#cuda卸载
/usr/local/cuda-11.3/bin/cuda-uninstaller

4. Install the specified version of pytorch on ubuntu

#使用官网提供的下载方式,报错
conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1 -c pytorch
# 报错如下
# PackagesNotFoundError: The following packages are not available from current channels:
#   - torchvision==0.5.0
#添加清华源镜像
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --set show_channel_urls yes
#去掉 -c pytorch,因为这表示从原镜像中去下载
conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1
#发现依旧报错,那只能
pip install torchvision==0.5.0
conda install pytorch==1.4.0 cudatoolkit=10.1

5. Install the specified version of horovod on ubuntu

HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_PYTORCH=1 pip install horovod[pytorch]==0.19
horovodrun --check-build #--verbose

There will be similar output after performing a version check

Horovod v0.19.0:

Available Frameworks:
    [ ] TensorFlow
    [X] PyTorch
    [ ] MXNet

Available Controllers:
    [X] MPI
    [X] Gloo

Available Tensor Operations:
    [ ] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [X] Gloo    

6. Install NCCL on ubuntu

git clone https://gitee.com/occamrazor/nccl-2.4.7-1.git
cd nccl-2.4.7-1
make src.build CUDA_HOME=/usr/local/cuda-10.1
apt -y install build-essential devscripts debhelper fakeroot
make pkg.debian.build
ls build/pkg/deb/
dpkg -i build/pkg/deb/*

7. Install openMPI on ubuntu

apt -y install openmpi-bin openmpi-common libopenmpi-dev
#查看版本信息
mpiexec --version
ompi_info --version

Guess you like

Origin blog.csdn.net/eternal963/article/details/130754734