The complete process of parallel supercomputing cloud multi-card training (from environment configuration to task submission)

Table of contents

Preface

1. Download the client from the official website and log in to your account

2. Remotely connect to the node server

3. Installation environment

3. Upload the code and data set to the server

4. Submit the training task to the computing node for model training.

5. Download the trained files 

Epilogue


Preface

Overview of this article: The usage process of parallel supercomputing cloud is quite different from Huawei Cloud's modelarts or elastic cloud server. Many novices may not understand the specific operation process when using parallel supercomputing cloud for the first time, so the author plans to give a complete guide on how to use parallel supercomputing cloud. Tutorial on supercomputing cloud multi-card training model.

Author's introduction: The author is an artificial intelligence alchemist. Currently, the main research direction in the laboratory is generative models. He also has some knowledge of other directions. He hopes to communicate with friends who are also interested in artificial intelligence on the CSDN platform. Share and make progress together. Thank you everyone~~~

 如果你觉得这篇文章对您有帮助,麻烦点赞、收藏或者评论一下,这是对作者工作的肯定和鼓励。  

1. Download the client from the official website and log in to your account

Official website link: https://cloud.paratera.com/

Click to download the client and choose the version that suits you to download.

Then log in to your account

As shown in the figure, the supercomputing services we mainly use are as follows:

Quick transfer: We are mainly used to upload local file resources to the server, or download server files to the local

Putty: Connect to the server remotely. The difference from SSH is that Putty runs locally on your computer, while SSH runs within the parallel supercomputing cloud client interface. I personally think it is not as easy to use as Putty.

Console: Used to check consumption status and view the help manual.

The detailed usage and introduction will be mentioned below.

2. Remotely connect to the node server

Open Putty

 

Select the supercomputing node you purchased and connect

 

The above is a brief manual for parallel supercomputing cloud.

in conclusion:

The node we connect to through Putty is the login node. On this node we are mainly used to install the environment and decompress the uploaded compressed package. But cannot run python program

3. Installation environment

First use module load anaconda/2021.11 to load the anaconda library so that we can use conda to configure the environment on the login node

Then we use source activate [your conda environment name] to enter your conda environment. PS: Because libraries such as pytorch, tensorflow, and mmcv in the parallel supercomputing cloud need to be compiled and installed manually. If you are not familiar with the operation, directly contact the engineers of the parallel supercomputing cloud to provide your account and the pytorch, python, and pytorch you want to install. cuda and other versions, let it help you create a good conda environment.

 

Use the conda list command to check what libraries have been installed in your current conda environment.

If you need to install a new library, use the following command to install it

pip install 库名 -i https://pypi.tuna.tsinghua.edu.cn/simple

3. Upload the code and data set to the server

Turn on quick transfer

 Connect to the corresponding supercomputing account

 

Drag and drop the local ZIP compressed package into a blank location or folder to automatically upload it. Then use the following command on Putty to decompress the ZIP archive to the specified path

unzip /path/to/file.zip(压缩包路径) -d /path/to/destination(目标路径)

After uploading and unzipping our code and data set, we need to write a shell script to start model training.

#!/bin/bash 
//比如是这个为开头

module load anaconda/2021.11 
//加载anaconda
module load cuda/11.3
//加载cuda
source activate python
//进入你的conda环境,这里的python要改为你自己的conda环境名

export WORLD_SIZE=4
//单机多卡的话使用多少张卡就设置为几

python -m torch.distributed.launch --nproc_per_node=4 /home/bingxing2/home/xxx/zjd/zijiandu/models/train.py
//同理_per_node也是设置为卡数,然后路径设置为自己模型的训练脚本路径

If /var/www/borg/fuel/app/tasks/monitor_sync.sh/var/www/borg/fuel/app/tasks/monitor_sync.sh appears: line 11: $'\r': command not found/var/ www/borg/fuel/app/tasks/monitor_sync.sh: line 12: syntax error near unexpected token `$'{\r'' If an error is reported with a newline character, this is caused by incorrect shell script format. We can use The dos2unix command converts the format of the shell script, and then reruns the shell script.

dos2unix /home/xxx/avc/sss.sh(你的shell脚本的路径)

 4. Submit the training task to the computing node for model training.

We first use the cd command to enter the directory of the previously written shell script.

 

Then use sbatch --gpus=4 ./shell script name to submit the script task to the computing node for training. We use 4 A100s for demonstration here.

 

This means the submission is successful. 60373 is the calculation process number. You can use this process number to cancel the calculation job later.

 

Then we use the parajobs command to view the running status of each graphics card in the computing node, as shown in the figure below

 

When you want to stop a job, just use the scancel process ID to cancel the job.

 

This is a log file. Error messages and printing information are all in it.

5. Download the trained files 

We cannot directly download the files in the folder. We first double-click to enter the local disk on the left.

 

Then enter the path where you want to save the downloaded file

 

Then you can right-click the file at this time and download it to the corresponding local path. 

 

At this point, this is the basic model usage process and training method of the entire parallel supercomputing cloud computing center. 

 

If you have any problems later, you can open the console 

 

Select help document 

 

Select the user manual for the corresponding partition, which contains tutorials on how to use various commands on the server. 

Epilogue

 如果您觉得这篇文章对您有帮忙,请点赞、收藏。您的点赞是对作者工作的肯定和鼓励,这对作者来说真的非常重要。如果您对文章内容有任何疑惑和建议,欢迎在评论区里面进行评论,我将第一时间进行回复。 

Guess you like

Origin blog.csdn.net/qq_35768355/article/details/132875292