CentOS7搭建Tensorflow计算环境
注意:本文为安装后根据回忆编写,疏漏之处在所难免,仅供参考,后期进行测试
0. 软硬件准备
- 服务器1台(500G硬盘,100G内存,GPU K40c一块, 内存硬盘GPU可根据需要选择)
- CentOS-7-x86_64-DVD-1804.iso
- Anaconda3-2019.03-Linux-x86_64.sh
- cuda_9.0.176 384.81 linux.run
- cudnn-9.0-linux-x64-v7.6.5.32.solitairetheme8
- kernel-devel-3.10.0-862.3.2.el7.x86_64.rpm
- kernel-headers-3.10.0-862.el7.x86_64.rpm
- NVIDIA-Linux-x86_64-384.183.run
- tensorflow_gpu-2.0.0-cp37-cp37m-manylinux2010_x86_64.whl
torch-1.3.1-cp37-cp37m-manylinux1_x86_64.whl
1. 安装CentOS7
- 创建虚拟机时,添加PCI设备,即显卡K40c
- 划分分区时根目录划分的大一些,350G以上
- 正常安装完成后修改静态ip及网关
vi /etc/sysconfig/network-scripts/ifcfg-ens192
# 把文件内容修改为
TYPE=Ethernet
BOOTPROTO=static
DEFROUTE=yes
PEERDNS=yes
PEERROUTES=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_PEERDNS=yes
IPV6_PEERROUTES=yes
IPV6_FAILURE_FATAL=no
NAME=eno192 #使用原NAME
UUID=ae0965e7-22b9-45aa-8ec9-3f0a20a85d11# 使用原UUID
ONBOOT=yes
IPADDR0=192.168.1.30 #根据需要填写
PREFIXO0=24
GATEWAY0=192.168.1.1
DNS1=8.8.8.8
DNS2=8.8.4.4
- 一些安装前的准备工作
- 使用Xshell等工具内网连接该虚拟机(非必须)
- 把需要的软件从其他机器传到待安装机器上
scp -r software/ 192.168.1.30:root/
- 检测显卡
lspci
- 根据显卡下载驱动(已经准备好)
https://www.nvidia.cn/Download/index.aspx
2. 安装Anaconda3
- 安装依赖包bunzip2
yum install -y bzip2
- 安装Anaconda3
bash Anaconda3-2019.03-Linux-x86_ 64.sh
- 修改并执行.bashrc文件
vi /root/.bashrc
# 添加export PATH=/root/anaconda3/bin:$PATH
source ~./bashrc
python # 验证python版本
- 配置远程访问jupyter notebook的功能
- 生成配置文件
jupyter notebook --generate-config
- 打开ipython,创建一个密文的密码,以123456为例
ipython
from notebook.auth import passwd
passwd()
Enter password: 123456
Verify password: 123456
'sha1:e00ee9ab9a42:22e8c0dc771612348eeee698cde8ec77fba42e7f'
exit()
- 把生成的密文‘sha:xx…’复制下来,修改默认配置文件
vi ~/.jupyter/jupyter_notebook_config.py
# 将ip设置为*,意味允许任何IP访问
c.NotebookApp.ip = '*'
# 这里的密码就是上边我们生成的那一串
c.NotebookApp.password = 'sha1:xx...'
# 服务器上并没有浏览器可以供Jupyter打开
c.NotebookApp.open_browser = False
# 监听端口设置为8888或其他自己喜欢的端口
c.NotebookApp.port = 8888
# 允许远程访问
c.NotebookApp.allow_remote_access = True
# 文件目录位置
c.NotebookApp.notebook_dir = u'/root/notebooks/'
- 启动jupyter notebook:
nohup jupyter lab --allow-root &
- 登录jupyter lab
http://服务器ip地址:8888/lab
- 通过路由器,内网映射到外网,可外网访问jypyterlab
3. 安装显卡驱动
- 查看我的内核版本:
[root@host8 ~]# uname -r
3.10.0-862.el7.x86_64
- 根据内核版本下载依赖包
- kernel-devel-3.10.0-862.3.2.el7.x86_ 64.rpm
- kernel-headers-3.10.0-862.el7.x86_64.rpm
注:el7代表CentOS7,3.10.0-862需要和内核版本一致
rpm -ivh kernel-devel-3.10.0-862.3.2.el7.x86_ 64.rpm
rpm -ivh kernel-headers-3.10.0-862.el7.x86_64.rpm
- 阻止 nouveau 模块的加载
echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/blacklist.conf
- 重新建立initramfs image文件
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
dracut /boot/initramfs-$(uname -r).img $(uname -r)
- 执行安装脚本
chmod u+x NVIDIA-Linux-x86_64-415.13.run
./NVIDIA-Linux-x86_ 64-384.183.run --kernel-source-path=/usr/src/kernels/3.10.0-862.el7.x86_64
6.检测安装结果
nvidia-smi
Sun Nov 24 21:25:10 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.183 Driver Version: 384.183 CUDA Version: 9.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40c Off | 00000000:0B:00.0 Off | 0 |
| 31% 68C P0 135W / 235W | 10378MiB / 11439MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 9632 C python 10365MiB |
+-----------------------------------------------------------------------------+
4. 安装cuda
- 安装cuda
cd %filename
sh cuda_9.0.176_384.81_linux.run
- 按照提示选择
# q可以直接到达协议底部
# 是否安装NVIDIA driver? no
# 其他选yes或默认
- 添加环境变量
vi ~/.bashrc # 编辑
在文档末尾添加:
export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} # 保存退出
- 环境变量生效
$ source ~/.bashrc - 验证cuda安装结果
cuda-install-samples-9.0.sh ~
cd ~/NVIDIA_CUDA-9.0_Samples/5_Simulations/nbody
make
./nbody
- 验证驱动版本和CUDA版本:
cat /proc/driver/nvidia/version
nvcc -V
5. 安装cudnn
- 下载cudnn(已准备好)
- 安装cudnn
cd %filepath
mv cudnn-9.0-linux-x64-v7.6.5.32.solitairetheme8 cudnn-9.0-linux-x64-v7.6.5.32.tgz
tar -xzvf cudnn-9.0-linux-x64-v7.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
- 验证cudnn
...似乎没有验证?找不到了
6. 安装TensorFlow、Pytorch
- 准备安装包
- tensorflow_ gpu-2.0.0-cp37-cp37m-manylinux2010 x86 _64.whl
- torch-1.3.1-cp37-cp37m-manylinux1_ x86 _64.whl
- 安装TensorFlow
pip install tensorflow_ _gpu-2.0.0-cp37-cp37m-manylinux2010_ x86 _64.whl
- 验证TensorFlow
#Open a new terminal if not done yet
activate tensorflow
python
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
>>> print(sess.run(hello))
- 安装Pytorch
pip install torch-1.3.1-cp37-cp37m-manylinux1_ x86 _64.whl
- 验证Pytorch
$ python
>>> import torch
>>> torch.__version__
附录1:CentOS7上lvm分区调整(resize2fs: Bad magic number in super-block while trying to open ...)
解决思路:
①确认分区类型为lvm
②查看到home分区有大量闲置空间,决定将home的空间分配给 /
卸载home >> 删除home >> 将home的空间添加到 " / " >> 重新分配home >> 格式化home >> 完成- 会用到的命令:
df -h # 查看磁盘空间
lsblk # 查看块设备详情
fdisk -l # 查看分区详情
lvremove\lvcreate # 逻辑卷删除/创建
lvdisplay\vgdisplay\pvdisplay #查看逻辑卷/卷组/物理卷
xfs_growfs # 加载xfs_growfs
3.操作过程:
- 分析:查看分区详情,看到sda2为lvm逻辑卷,所以可以通过将home的空间转移到根分区
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0 2:0 1 4K 0 disk
sda 8:0 0 300G 0 disk
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 299.5G 0 part
├─centos-root 253:0 0 50G 0 lvm /
├─centos-swap 253:1 0 9.8G 0 lvm [SWAP]
└─centos-home 253:2 0 239.6G 0 lvm /home
sr0 11:0 1 1024M 0 rom
- /home备份
# mkdir /tmp/home
# cp -r /home/* /tmp/home
- umount卸载
# umount /home
umount: /home: target is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))
# 如果提示busy,则使用fuser解除占用
# fuser -m -v -i -k /home
- 删除home逻辑卷(lv),将home的空间腾出来到卷组(vg)
# lvremove /dev/mapper/centos-home
Do you really want to remove active logical volume home? [y/n]: y
Logical volume "home" successfully removed
- 重新调整 / 的大小
# lvextend -L 250G /dev/mapper/centos-root # 调整到250G
Size of logical volume centos/root changed from 50.00 GiB (12800 extents) to 250.00 GiB (64000 extents).
Logical volume root successfully resized.
- xfs_growfs刷新
# xfs_growfs /dev/mapper/centos-root
meta-data=/dev/mapper/centos-root isize=256 agcount=4, agsize=3276800 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=0 finobt=0
data = bsize=4096 blocks=13107200, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal bsize=4096 blocks=6400, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
data blocks changed from 13107200 to 65536000
- 将剩下的空间重新划分到home中
# lvcreate -l +100%free -n home centos # -n 指定lv的名字,centos是vg的名字
Logical volume "home" created.
- 创建完成别忘了格式化
# mkfs.xfs /dev/centos/home
meta-data=/dev/centos/home isize=256 agcount=4, agsize=2601472 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=0 finobt=0
data = bsize=4096 blocks=10405888, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal log bsize=4096 blocks=5081, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
- 重新mount并查看
# mount /dev/mapper/centos-home /home
- 完成
(base) [root@localhost ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 400G 114G 287G 29% /
devtmpfs 50G 0 50G 0% /dev
tmpfs 50G 0 50G 0% /dev/shm
tmpfs 50G 9.0M 50G 1% /run
tmpfs 50G 0 50G 0% /sys/fs/cgroup
/dev/sda1 950M 164M 787M 18% /boot
tmpfs 9.9G 36K 9.9G 1% /run/user/0
/dev/mapper/centos-home 20G 33M 20G 1% /home
- 把home备份还原
附录2:
- 下载tensorflow: https://pypi.org
- 下载cudnn: https://developer.nvidia.com/cudnn
- 下载Pytorch: https://developer.nvidia.com/cuda-toolkit
...