Teach you how to build a PyTorch training environment on the Shengteng platform

Abstract: When running PyTorch business on the Ascend platform, it is necessary to build a heterogeneous computing architecture CANN software development environment and install the PyTorch framework to realize the migration, development and debugging of training scripts.

This article is shared from Huawei Cloud Community " Teach you how to build a PyTorch training environment on the Shengteng platform ", author: Shengteng CANN.

PyTorch is a popular deep learning framework in the industry, used to develop deep learning training scripts, and runs on CPU/GPU by default. When running the PyTorch business on the Ascend AI processor, it is necessary to build a heterogeneous computing architecture CANN (Compute Architecture for Neural Networks) software development environment and install the PyTorch framework to realize the migration, development and debugging of training scripts.

The following will show you how to quickly install the driver firmware, CANN software and PyTorch framework on the Ascend platform.

environmental inspection

Before installing the driver and firmware on the Ascend platform, first check whether the NPU in the installation environment is in place, and confirm whether the operating system version and kernel version meet the matching requirements of the corresponding version.

Take the Atlas 800 training server (model: 9010) (Ascend AI processor model Ascend 910) as an example, to check whether the NPU is in place, execute the lspci | grep d801 command. If there are N channels of NPUs on the server, echo N lines containing The "d801" field indicates that the NPU is normally in place.

Install drivers and firmware

1. Create a driver running user HwHiAiUser.

groupadd -g 1000 HwHiAiUser 
useradd -g HwHiAiUser -u 1000 -d /home/HwHiAiUser -m HwHiAiUser -s /bin/bash

2. Install the driver and firmware.

Download the firmware driver software of supporting products from the "Firmware and Driver" download page of the Ascend community, and upload it to any directory on the server, and then refer to the following command to install the firmware driver software package. Note that you need to install it as the root user.

a. Add executable permissions to the software package.

chmod +x Ascend-hdk-910-npu-driver_23.0.rc1_linux-x86-64.run
chmod +x Ascend-hdk-910-npu-firmware_6.3.0.1.241.run

b. Install the driver.

./Ascend-hdk-910-npu-driver_23.0.rc1_linux-x86-64.run --full --install-for-all

The default installation path is "/usr/local/Ascend", and the following echo information appears, indicating that the installation is successful.

Driver package installed successfully!复制

You can also run the npu-smi info command to view the information similar to the following, indicating that the driver is loaded successfully.

c. Install the firmware.

./Ascend-hdk-910-npu-firmware_6.3.0.1.241.run --full

If the following echo information appears, the installation is successful.

Firmware package installed successfully! Reboot now or after driver installation for the installation/upgrade to take effect

3. After the driver firmware is installed, restart the system.

reboot

Install CANN software dependencies

The CANN software installation process needs to download related dependencies. Please ensure that the installation environment can connect to the network and the software source has been configured. The following steps take the root user operation as an example.

1. Install third-party dependencies

Ubuntu system (Debian, UOS20, Linux and other systems operate in the same way):

apt-get install -y gcc g++ make cmake zlib1g zlib1g-dev openssl libsqlite3-dev libssl-dev libffi-dev unzip pciutils net-tools libblas-dev gfortran libblas3

openEuler system (EulerOS, CentOS, BCLinux and other systems operate in the same way):

yum install -y gcc gcc-c++ make cmake unzip zlib-devel libffi-devel openssl-devel pciutils net-tools sqlite-devel lapack-devel gcc-gfortran

2. Install Python and its dependencies

Take installing Python 3.7.5 as an example.

1) Download the python3.7.5 source package through the wget command.

wget https://www.python.org/ftp/python/3.7.5/Python-3.7.5.tgz

2) Unzip the source package

tar -zxvf Python-3.7.5.tgz

3) Compile and install Python from source code.

​cd Python-3.7.5
./configure --prefix=/usr/local/python3.7.5 --enable-loadable-sqlite-extensions --enable-shared
make
make install

Take --prefix=/usr/local/python3.7.5 path as an example for illustration. After executing the configuration, compilation and installation commands, the installation package is in the /usr/local/python3.7.5 path.

4) Set the python3.7.5 environment variable.

#用于设置python3.7.5库文件路径
export LD_LIBRARY_PATH=/usr/local/python3.7.5/lib:$LD_LIBRARY_PATH
#如果用户环境存在多个python3版本,则指定使用python3.7.5版本
export PATH=/usr/local/python3.7.5/bin:$PATH

5) Check whether the installation is successful.

​python3 --version
pip3 --version

If relevant version information is returned, the installation is successful.

6) Install pip dependencies.

pip3 install attrs numpy decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py

Install the CANN development kit package

1. From the "CANN" product page of the Ascend community, download the CANN development kit package according to the operating system architecture.

For example, "Ascend-cann-toolkit_6.3.RC1_linux-x86_64.run", and upload it to any directory of the installation environment.

2. Install the CANN development kit package.

# 添加可执行权限
chmod +x Ascend-cann-toolkit_6.3.RC1_linux-x86_64.run
# 校验软件包的一致性和完整性
./Ascend-cann-toolkit_6.3.RC1_linux-x86_64.run --check
# 执行安装命令
./Ascend-cann-toolkit_6.3.RC1_linux-x86_64.run --install --install-for-all

After the installation is complete, if the following information is displayed, the software installation is successful:

[INFO] xxx install success

xxx represents the actual package name installed.

Install PyTorch

After the CANN package is installed, you can install PyTorch. Developers can choose PyTorch 1.8.1 or PyTorch 1.11.0 version, and then install the APEX mixed precision module after PyTorch is successfully installed. Before installing Pytorch, you need to install the following dependencies.

pip3 install wheel
pip3 install typing_extensions

Install PyTorch 1.8.1

1) Install the official torch package.

x86_64 architecture

wget https://download.pytorch.org/whl/cpu/torch-1.8.1%2Bcpu-cp37-cp37m-linux_x86_64.whl
pip3 install torch-1.8.1+cpu-cp37-cp37m-linux_x86_64.whl

aarch64 architecture

wget https://repo.huaweicloud.com/kunpeng/archive/Ascend/PyTorch/torch-1.8.1-cp37-cp37m-linux_aarch64.whl
pip3 install torch-1.8.1-cp37-cp37m-linux_aarch64.whl

2) Install the PyTorch adaptation plug-in torch_npu provided by Shengteng.

x86_64 architecture

wget https://gitee.com/ascend/pytorch/releases/download/v5.0.rc1-pytorch1.8.1/torch_npu-1.8.1.post1-cp37-cp37m-linux_ x86_64.whl
pip3 install torch_npu-1.8.1.post1-cp37-cp37m-linux_ x86_64.whl

aarch64 architecture

​wget https://gitee.com/ascend/pytorch/releases/download/v5.0.rc1-pytorch1.8.1/torch_npu-1.8.1.post1-cp37-cp37m-linux_aarch64.whl
pip3 install torch_npu-1.8.1.post1-cp37-cp37m-linux_aarch64.whl

Here, version 5.0.rc1 is taken as an example. In practice, please select the PyTorch plug-in version supporting CANN for installation.

3) Install torchvision corresponding to the framework version.

pip3 install torchvision==0.9.1

4) Verify whether the installation is successful.

python -c "import torch;import torch_npu; a = torch.ones(3, 4).npu(); print(a + a);"

If the output contains the following key information, it means that PyTorch is installed successfully.

[[2., 2., 2., 2.],
  [2., 2., 2., 2.],
  [2., 2., 2., 2.]]

Install PyTorch 1.11.0

1) Install the official torch package.

x86_64 architecture

wget https://download.pytorch.org/whl/cpu/torch-1.11.0%2Bcpu-cp37-cp37m-linux_x86_64.whl
pip3 install torch-1.11.0+cpu-cp37-cp37m-linux_x86_64.whl

aarch64 architecture

wget https://repo.huaweicloud.com/kunpeng/archive/Ascend/PyTorch/torch-1.11.0-cp37-cp37m-linux_aarch64.whl
pip3 install torch-1.11.0-cp37-cp37m-linux_aarch64.whl

2) Install the PyTorch adaptation plug-in torch_npu provided by Shengteng.

x86_64 architecture

wget https://gitee.com/ascend/pytorch/releases/download/v5.0.rc1-pytorch1.11.0/torch_npu-1.11.0-cp37-cp37m-linux_ x86_64.whl
pip3 install torch_npu-1.11.0-cp37-cp37m-linux_ x86_64.whl

aarch64 architecture

wget https://gitee.com/ascend/pytorch/releases/download/v5.0.rc1-pytorch1.11.0/torch_npu-1.11.0-cp37-cp37m-linux_aarch64.whl
pip3 install torch_npu-1.11.0-cp37-cp37m-linux_aarch64.whl

3) Install torchvision corresponding to the framework version.

pip3 install torchvision==0.12.0

4) Verify that PyTorch is installed successfully.

python -c "import torch;import torch_npu; a = torch.ones(3, 4).npu(); print(a + a);"

If the output contains the following key information, it means that PyTorch is installed successfully.

[[2., 2., 2., 2.],
  [2., 2., 2., 2.],
  [2., 2., 2., 2.]]

Install the APEX Mixed Precision Module

The APEX mixed-precision module is a comprehensive optimization library that integrates optimization performance and precision convergence, and can provide mixed-precision training support in different scenarios.

1. Obtain the APEX source code and native APEX code adapted to Ascend.

# 获取昇腾适配的APEX源码
git clone -b master https://gitee.com/ascend/apex.git
# 在apex目录下获取原生APEX代码
cd apex
git clone https://github.com/NVIDIA/apex.git

2. Switch to the branch corresponding to the native APEX code.

cd apex
git checkout 4ef930c1c884fdca5f472ab2ce7cb9b505d26c1a
cd ..

3. Generate the full code of the Ascend Adapter under the scripts path of the Ascend Adapter APEX source code directory.

cd scripts
bash gen.sh

4. Compile and generate the Ascend-adapted APEX binary installation package.

cd ../apex
python3 setup.py --cpp_ext --npu_float_status bdist_wheel

5. Install APEX.

86_64 architecture

cd dist
pip3 install apex-0.1_ascend-cp37-cp37m-linux_ x86_64.whl

aarch64 architecture

cd dist
pip3 install apex-0.1_ascend-cp37-cp37m-linux_aarch64.whl

At this point, the PyTorch training environment is set up. Developers can migrate PyTorch network scripts to the Ascend platform for training and use the powerful computing power of the Ascend platform.

For more documentation, you can view it in the Shengteng Document Center [1], and you can also learn video courses in the "Shengteng Community Online Course [2]" section. Any questions during the learning process can be found in the "Shengteng Forum" [3] "Interactive exchange!

related reference

[1] Ascend Documentation Center

[2] Shengteng Community Online Course

[3] Shengteng Forum

 

 

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/9096825