Horovod——Uber分布式深度学习框架部署实践

Alt
Horovod 是 Uber 开源的又一个深度学习工具,它的发展吸取了 Facebook一小时训练 ImageNet 论文与百度 Ring Allreduce 的优点,可为用户实现分布式训练提供帮助

Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed Deep Learning fast and easy to use.

References:

分布式模式介绍 — Tensorflow, Horovod
业界 | 详解Horovod:Uber开源的TensorFlow分布式深度学习框架
Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod github_homepage
Horovod示例代码

部署实践——Horovod in docker:

测试环境:

	    # Ubuntu18.04LST
	    # Nvidia-Driver410.93
	    # Docker版本18.09.2
	    # CUDA9.0
	    # Python3.6

## Ubuntu系统安装——bios设置:

		# Secure Boot设置为disabled
		# Boot 设置为了UEFI only
		# 把U盘设置为第一启动项

显卡驱动:

		# Nvidia官网下载合适驱动(.run安装包)
		# 禁用nouveau驱动
		# 编辑 /etc/modprobe.d/blacklist-nouveau.conf 文件,添加以下内容:
			blacklist nouveau
			blacklist lbm-nouveau
			options nouveau modeset=0
			alias nouveau off
			alias lbm-nouveau off
		# 保存,关闭nouveau:
	$ echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
		# 重启:
	$ sudo update-initramfs -u
	$ sudo reboot
		# 获取Kernel source (非常重要):
	$ sudo apt-get install linux-source
	$ sudo apt-get install linux-headers-$(uname -r)
		# 先按Ctrl + Alt + F1到控制台,关闭当前图形环境
	$ sudo service lightdm stop
		# 安装Nvidia驱动:
	$ chmod +x NVIDIA-Linux-x86_64-xxx.xx.run
	$ sudo ./NVIDIA-Linux-x86_64-xxx.xx.run
		# 然后挂在Nvidia驱动
	$ modprobe nvidia
		# 检查驱动是否安装成功
	$ nvidia-smi

系统版本支持及环境包配置:

	# 参考文档:https://nvidia.github.io/nvidia-docker/
$ sudo passwd
$ apt-get install -y vim
$ apt-get install -y curl
$ apt-get install -y net-tools
$ apt-get install -y gcc
$ apt-get install -y make

部署docker环境:

	# docker需要指定版本docker-ce=5:18.09.2~3-0~ubuntu-bionic,要先停止和删除原来的docker再安装新的
	# 参考文档: https://blog.csdn.net/bingzhongdehuoyan/article/details/79411479
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
$ sudo apt-get update
	# 查看当前可用的docker版本
$ apt-cache madison docker-ce  
	# 移除原来的docker
$ docker -v
$ sudo apt-get remove docker docker-engine docker-ce docker.io
	# 选定一个版本安装
$ sudo apt-get install docker-ce=5:18.09.2~3-0~ubuntu-bionic
$ docker -v
	# docker 相关命令
$ docker load < ./image_name.tar.gz 	# 加载已保存镜像
$ docker save docker_name > ./image_name.tar.gz 	# 保存docker镜像
$ docker stop $(docker ps -q) 	# 关闭所有容器
$ docker rm $(docker ps -a -q) 	# 删除所有容器
$ docker ps -a 	# 查看所有容器(包括正在运行和已停止)
$ nvidia-docker run -it 	# docker运行

通过配置DockerFile, 在线下载并部署horovod环境:

	# 在部署之前可以配置Dockerfile中的软件版本——CUDA、Tensorflow、Pytorch、Python
$ mkdir horovod-docker
$ wget -O horovod-docker/Dockerfile https://raw.githubusercontent.com/uber/horovod/master/Dockerfile
$ docker build -t horovod:latest horovod-docker

Nvidia-docker2.0安装:

	# 参考文档:https://github.com/NVIDIA/nvidia-docker/wiki/Installation-(version-2.0)#prerequisites
	# If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers
$ docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
$ sudo apt-get purge -y nvidia-docker
	# Add the package repositories
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update
	# Install nvidia-docker2 and reload the Docker daemon configuration
$ sudo apt-get install -y nvidia-docker2
$ sudo pkill -SIGHUP dockerd
	# Test nvidia-smi with the latest official CUDA image
$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

多机器root_ssh配置:

	# root_ssh免密登陆参考——https://www.cnblogs.com/toughlife/p/5633510.html
	# 编辑ssh配置文件允许root免密ssh登录:
$ sudo vi /etc/ssh/sshd_config
		# 调整PermitRootLogin参数值为yes
		# 将PermitEmptyPasswords yes前面的#号去掉
		# 将PermitEmptyPasswords 参数值修改为yes
$ service sshd restart  
	# 生成免密登陆公钥
	# 公钥参考——http://www.linuxproblem.org/art_9.html
	# su进入root
		a@A:~> ssh-keygen -t rsa
		a@A:~> ssh b@B mkdir -p .ssh
		a@A:~> cat .ssh/id_rsa.pub | ssh b@B 'cat >> .ssh/authorized_keys'
	# 最后将id_rsa 和 authorized_keys 复制到/mnt/share/ssh目录

Horovod测试:

Running on single machine:

	# su进入root模式
$ nvidia-docker run -it horovod:latest 	# 启动 nvidia-docker
root@c278c88dd552:/examples# mpirun \	# 容器内运行mpirun
		-np 1 \	# 运行1个进程,p.s.一块显卡分配一个进程
		-H localhost:1 \	# 本地主机允许分配的最大进程数
		python keras_mnist_advanced.py 	# 运行训练脚本
	# 完整命令: mpirun -np 1 -H localhost:1 python keras_mnist_advanced.py

Running on multiple machines:

	# su进入root模式
	# Primary workers:--------------------------------------------------------------------------------------------------------------------
host1$ nvidia-docker run -it \ 	
		--network=host \ 	# 网络配置同主机
		-v /mnt/share/ssh:/root/.ssh \ 	# 挂载本地ssh免密登录公钥路径到容器
		horovod:latest 	# 启动 nvidia-docker
	# 完整命令: nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest
root@c278c88dd552:/examples# mpirun \ 	# 容器内运行mpirun
		-np 3 \ 	# 运行3个进程
		-H host1:1,host2:1,host3:1 \ 	# 每个主机允许的最大进程数
		-mca plm_rsh_args "-p 12345" \ 	# 默认配置及端口配置
		python keras_mnist_advanced.py 	# 运行训练脚本
	# 完整命令: mpirun -np 3 -H ai001:1,ai002:1,ai003:1 -mca plm_rsh_args "-p 12345" python keras_mnist_advanced.py
	# Secondary workers:------------------------------------------------------------------------------------------------------------
host2$ nvidia-docker run -it \ 	# 容器内运行mpirun
		--network=host \ 	# 网络配置同主机
		-v /mnt/share/ssh:/root/.ssh \ 	# 挂载本地ssh免密登录公钥路径到容器
		horovod:latest bash -c "/usr/sbin/sshd -p 12345; sleep infinity" 	# 运行horovod与相关网络配置
	# 完整命令: nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
host3$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest bash -c "/usr/sbin/sshd -p 12345; sleep infinity" 	# 配置host3同host2

测试验证:

三台机器,每台一个GPU

猜你喜欢

转载自blog.csdn.net/weixin_38340975/article/details/87971642