[mmopenlab series uses DP mode for single-machine multi-card training] The command line under windows and the .sh file under linux are solved in one article | SenseTime

Table of contents

foreword

Use of command line commands in DP mode and analysis of environment variables

Analysis of the original dist_train.sh file:

Analysis of related environment variables:

config configuration file pre-configuration:

Windows DP start command:

Linux DP startup command: (use sh file)

reference

(2 messages) The basic concept of PyTorch multi-card/multi-GPU/distributed DPP (node&rank&local_rank&nnodes&node_rank&nproc_per_node&world_size) %2522request%255Fid%2522%253A%2522166022262716782248521003%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=166022262716782248521003&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduend~default -1-119606518-null-null.142^v40^control,185^v2^control&utm_term=node_rank&spm=1018.2226.3001.4187 Train a model — MMSegmentation 0.27.0 documentation https://mmsegmentation.readthedocs.io/zh_CN/latest/train .html


foreword

Ordinary stand-alone single-card training mode is difficult to solve the problem of too slow model training. For this, mmopenlab code documentation provides .sh files for DP and DDP.

 Among them, dist_train.sh corresponds to the single-machine multi-card training method in DP mode; slurm_train.sh corresponds to the multi-machine multi-card training method in DDP mode.

Note: This article only looks at the single-machine multi-card training mode. 


Use of command line commands in DP mode and analysis of environment variables

Analysis of the original dist_train.sh file:

CONFIG=$1  # 需要传输的第一个参数,即配置文件的路径
GPUS=$2  # 需要传输的第二个参数,即要使用的GPU格式个数
NNODES=${NNODES:-1}  # 所有的结点数(这里默认的是1,即只使用一个结点,即单机)
NODE_RANK=${NODE_RANK:-0}  # 结点编号(因为编号是从0开始,而且只有一个结点,所以这里是0)
PORT=${PORT:-29500}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}

PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
python -m torch.distributed.launch \
    --nnodes=$NNODES \
    --node_rank=$NODE_RANK \
    --master_addr=$MASTER_ADDR \
    --nproc_per_node=$GPUS \
    --master_port=$PORT \
    $(dirname "$0")/train.py \
    $CONFIG \
    --launcher pytorch ${@:3}  # 使用的环境,这里使用pytorch

Analysis of related environment variables:

RANK: Used to indicate the serial number of the process, used for inter-process communication, generally refers to a process.

LOCAL_RANK: The process number on a machine. (every machine starts from 0)

NODE: node. Also refers to a server. 

NNODES: The total number of all nodes. 

 NODE_RANK: The serial number corresponding to each node, for example: 0 or 1 or 2.

 NPROC_PER_NODE: The number of processes opened by each node.

WORLD_SIZE: Number of global nodes = NPROC_PER_NODE x NNODES 

config configuration file pre-configuration:

Just pay attention to these two points, and the others do not need to be adjusted: 

1. This is for each GPU, not all GPUs, so it does not need to be a multiple of the number of GPUs.

2. Adjust iters. After using multiple gpus, iters must be reduced (on the basis of single-card training).

Windows DP start command:

Here is my startup command: 

python3 -m torch.distributed.launch --nproc_per_node 3 --node_rank 0 --nnodes 1 tools/train.py test_120k_3gpu/deeplabv3_r50-d8_512x512_4x4_160k_coco-stuff164k.py --gpu-id 2 --launcher pyt
orch

Differences from single-card training:

Need to specify --nproc_per_node --node_rank --nnodes --launcher parameters (note their location!!! The location cannot be placed at will)

Note:

If an error is reported: subprocess.CalledProcessError: Command '[xxx, xxx, xxx]' returned non
-zero exit status 1.

Please look ahead, this is not the final error report, the specific reason for the error report is still ahead.

Linux DP startup command: (use sh file)

Note:

1. Before using the sh file, you need to compile it, otherwise it will report: -bash: ./tools/dist_train.sh: Permission denied

chmod 777 ./tools/dist_train.sh

2. If an error is reported:  tools/dist_train.sh: line 8: python: command not found

This is because the startup command of the sh file defaults to python, which can be changed to your own, such as python2 and python3.

Here is my startup command: 

sh tools/dist_train.sh test_120k_3gpu/deeplabv3_r50-d8_512x512_4x4_160k_coco-stuff164k.py 3 --work-dir test_120k_3gpu/
# 其中:只需要传config路径和gpus数量即可,其他参数可选

reference

icon-default.png?t=M666(2 messages) The basic concept of PyTorch multi-card/multi-GPU/distributed DPP (node & rank&local_rank&nnodes&node_rank&nproc_per_node&world_size) 257B%2522request%255Fid%2522%253A%2522166022262716782248521003%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=166022262716782248521003&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduend~ default-1-119606518-null-null.142^v40^control,185^v2^control&utm_term=node_rank&spm=1018.2226.3001.4187
Train a model — MMSegmentation 0.27.0 documentation icon-default.png?t=M666https://mmsegmentation.readthedocs.io/zh_CN/latest/ train.html


sh basic grammar_tealex's blog-CSDN blog_sh grammar icon-default.png?t=M666https://blog.csdn.net/tealex/article/details/69397776?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522166022463116781818773396%2522%25 2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=166022463116781818773396&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduend~default-1-69397776-null-null.142^v40^control, 185^v2^control&utm_term=sh%E8%AF%AD%E6%B3%95&spm=1018.2226.3001.4187

-bash: ./tools/dist_train.sh: Permission denied_taxuewuhenxiaoer's blog - CSDN blog icon-default.png?t=M666https://blog.csdn.net/weixin_44790486/article/details/121889408 

Guess you like

Origin blog.csdn.net/m0_61139217/article/details/126294575