Pytorch distributed training and breakpoint training

1. Pytorch distributed training

Pytorch supports multi-machine and multi-card distributed training. The machines participating in distributed training are represented by Node (Node is not limited to physical machines or containers, such as docker, a Node node is a machine), and Node is divided into Master Node and Slave There is only one Node, Master Node, and multiple Slave Nodes. Assume that there are two machines participating in distributed training, and each machine has 4 graphics cards. Execute the following commands on the two machines respectively (taking yolov5 training as an example):

Master Node executes the following commands:

python -m torch.distributed.launch \
       --nnodes 2 \
       --nproc_per_node 4 \
       --use_env \
       --node_rank 0 \
       --master_addr "192.168.1.2" \
       --master_port 1234 \
       train.py \
       --batch 64 \
       --data coco.yaml \
       --cfg yolov5s.yaml \
       --weights 'yolov5s.pt'

 Slave Node executes the following commands:

python -m torch.distributed.launch \
       --nnodes 2 \
       --nproc_per_node 4 \        
       --use_env \
       --node_rank 1 \
       --master_addr "192.168.1.2" \
       --master_port 1234 train.py 

Guess you like

Origin blog.csdn.net/weicao1990/article/details/127057328