Practical tutorial | Pytorch - Simple implementation of elastic training (with source code)

Author丨Yan Tingshuai@zhihu (authorized)

Source丨https://zhuanlan.zhihu.com/p/489892744

Edit丨Gokushi Platform

Due to work needs, knowledge in distributed training has recently been supplemented. After some theoretical study, I still feel unfinished, and many knowledge points cannot be accurately obtained (for example: what should the distributed primitives scatter, all reduce and other code levels look like, how is the ring all reduce algorithm used in gradient synchronization, how the parameter server parameter is partially updated).

"What I cannot create, I do not understand." was written on the blackboard in the office of the famous physicist and Nobel laureate Richard Feynman. There is also a slogan of "show me the code" in the programmer world. Therefore, I plan to write a series of articles on distributed training, present the abstract concept of distributed training in the form of code, and ensure that each code is executable, verifiable, and reproducible, and contribute source code to let Everyone communicates with each other.

After research, it is found that pytorch has a good abstraction and perfect interface for distributed training. Therefore, this series of articles will be carried out with pytorch as the main framework. Many of the examples in the article are from pytorch documents, and debugging and expansion.

Finally, since there are already many theoretical introductions to distributed training on the Internet, the introduction of the theoretical part will not be the focus of this series of articles. I will focus on the introduction at the code level.

Pytorch - Distributed training minimalist experience: https://zhuanlan.zhihu.com/p/477073906

Pytorch - Distributed Communication Primitives (with source code): https://zhuanlan.zhihu.com/p/478953028

Pytorch - Handwritten allreduce distributed training (with source code): https://zhuanlan.zhihu.com/p/482557067

Pytorch - Parallel and minimalist implementation between operators (with source code): https://zhuanlan.zhihu.com/p/483640235

Pytorch - Multi-machine multi-card minimalist implementation (with source code): https://zhuanlan.zhihu.com/p/486130584

1 Introduction

Pytorch introduced torchrun in 1.9.0 as a replacement for versions prior to 1.9.0 torch.distributed.launch. Torchrun torch.distributed.launch mainly adds two functions based on the function:

  • Failover: When worker training fails, all workers will be automatically restarted to continue training;

  • Elastic: Nodes can be dynamically added or deleted. This article will use an example to illustrate how Elastic Training should be used;

In this example, a worker group with 4 GPUs will be started on Node0 first, and after a period of training, workers with 4 GPUs will be started on Node1, and a new worker group will be formed with the workers on Node1, eventually forming a 2-GPU worker group. Distributed training for 8 cards.

b944e42db08584c7020d58ae2a72cb7b.png

2. Model building

A simple fully connected model neural network model

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))

3. Checkpoint processing

Since every time a node is added or deleted, all workers will be killed, and then all workers will be restarted for training. Therefore, the training state should be saved in the training code to ensure that the training can continue with the last state after restarting.

The information that needs to be saved generally includes the following:

  • model : parameter information of the model

  • optimizer : the parameter confidence of the optimizer

  • epoch: the number of epochs currently executed

The code for save and load is as follows

  • torch.save: Use python's pickle to serialize python's object and save it to a local file;

  • torch.load : Deserialize the local file after torch.save and load it into memory;

  • model.state_dict(): Stores each layer of the model and its corresponding param information

  • optimizer.state_dict(): Stores the parameter information of the optimizer

def save_checkpoint(epoch, model, optimizer, path):
    torch.save({
    "epoch": epoch,
    "model_state_dict": model.state_dict(),
    "optimize_state_dict": optimizer.state_dict(),
}, path)

def load_checkpoint(path):
    checkpoint = torch.load(path)
    return checkpoint

4. Training code

The initialization logic is as follows:

  • Lines 1~3: Output the key environment variables of the current worker for later display of results

  • Lines 5-8: Create the model, optimizer, and loss function

  • Lines 10 to 12: initialization parameter information

  • Lines 14 to 19: If there is a checkpoint, load the checkpoint and assign it to model, optimizer and firt_epoch

local_rank = int(os.environ["LOCAL_RANK"])
    rank = int(os.environ["RANK"])
    print(f"[{os.getpid()}] (rank = {rank}, local_rank = {local_rank}) train worker starting...")
    
    model = ToyModel().cuda(local_rank)
    ddp_model = DDP(model, [local_rank])
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
    optimizer.zero_grad()
    max_epoch = 100
    first_epoch = 0
    ckp_path = "checkpoint.pt"
    
    if os.path.exists(ckp_path):
        print(f"load checkpoint from {ckp_path}")
        checkpoint = load_checkpoint(ckp_path)
        model.load_state_dict(checkpoint["model_state_dict"])
        optimizer.load_state_dict(checkpoint["optimize_state_dict"])
        first_epoch = checkpoint["epoch"]

Training logic:

  • Line 1: The number of epoch executions is from first_epoch to max_epoch, so that the original epoch can continue to be trained after the worker is restarted;

  • Line 2: In order to show the effect of dynamically adding node, add the sleep function here to reduce the speed of training;

  • Lines 3 to 8: model training process;

  • Line 9: For simplicity, the text is saved as a checkpoint per epoch; the current epoch, model and optimizer are saved to the checkpoint;

for i in range(first_epoch, max_epoch):
        time.sleep(1) # 为了展示动态添加node效果,这里添加sleep函数来降低训练的速度
        outputs = ddp_model(torch.randn(20, 10).to(local_rank))
        labels = torch.randn(20, 5).to(local_rank)
        loss = loss_fn(outputs, labels)
        loss.backward()
        print(f"[{os.getpid()}] epoch {i} (rank = {rank}, local_rank = {local_rank}) loss = {loss.item()}\n")
        optimizer.step()
        save_checkpoint(i, model, optimizer, ckp_path)

5. How to start

Since we use torchrun to start multi-machine and multi-card tasks, there is no need to use the spawn interface to start multiple processes (torchrun will be responsible for starting our python script as a process), so directly call the train function written above, and separately before and after Just add the initialization and effect functions of DistributedDataParallel.

The following code describes the invocation of the train interface above.

def run():
    env_dict = {
        key: os.environ[key]
        for key in ("MASTER_ADDR", "MASTER_PORT", "WORLD_SIZE", "LOCAL_WORLD_SIZE")
    }
    print(f"[{os.getpid()}] Initializing process group with: {env_dict}")
    dist.init_process_group(backend="nccl")
    train()
    dist.destroy_process_group()


if __name__ == "__main__":
    run()

In this example, torchrun is used to perform multi-machine and multi-card distributed training tasks (Note: torch.distributed.launch It has been eliminated by pytorch, try not to use it again). The startup script is described as follows (Note: both node0 and node1 are started through this script)

  • --nnodes=1:3 : Indicates that the current training task accepts at least 1 node, and at most 3 nodes participate in distributed training;

  • --nproc_per_node=4: Indicates that there are 4 processes on the node on each node

  • --max_restarts=3: The maximum number of restarts of a worker group; it should be noted here that node fail, node scale down and node scale up will all cause restart;

  • --rdzv_id=1: A unique job id, all nodes use the same job id;

  • --rdzv_backend: The backend implementation of rendezvous supports both c10d and etcd by default; rendezvous is used for communication and coordination between multiple nodes;

  • --rdzv_endpoint: The address of rendezvous, which should be the host ip and port of a node;

torchrun \
    --nnodes=1:3\
    --nproc_per_node=4\
    --max_restarts=3\
    --rdzv_id=1\
    --rdzv_backend=c10d\
    --rdzv_endpoint="192.0.0.1:1234"\
    train_elastic.py

6. Analysis of results

Code: BetterDL - train_elastic.py: https://github.com/tingshua-yts/BetterDL/blob/master/test/pytorch/DDP/train_elastic.py

Operating environment: 2 x 4 card v100 machines

image: pytorch/pytorch:1.11.0-cuda11.3-cudnn8-runtime

gpu: v100

First execute the startup script on node0

torchrun \
    --nnodes=1:3\
    --nproc_per_node=4\
    --max_restarts=3\
    --rdzv_id=1\
    --rdzv_backend=c10d\
    --rdzv_endpoint="192.0.0.1:1234"\
    train_elastic.py

get the following result

  • Lines 2 to 5: The current startup is a single-machine 4-card training task, so WORLD_SIZE is 4, and LOCAL_WORKD_SIZE is also 4

  • Lines 6~9: A total of 4 ranks participate in distributed training, rank0~rank3

  • Lines 10~18: rank0~rank3 all start training from epoch=0

r/workspace/DDP# sh run_elastic.sh
[4031] Initializing process group with: {'MASTER_ADDR': '192.0.0.1', 'MASTER_PORT': '44901', 'WORLD_SIZE': '4', 'LOCAL_WORLD_SIZE': '4'}
[4029] Initializing process group with: {'MASTER_ADDR': '192.0.0.1', 'MASTER_PORT': '44901', 'WORLD_SIZE': '4', 'LOCAL_WORLD_SIZE': '4'}
[4030] Initializing process group with: {'MASTER_ADDR': '192.0.0.1', 'MASTER_PORT': '44901', 'WORLD_SIZE': '4', 'LOCAL_WORLD_SIZE': '4'}
[4032] Initializing process group with: {'MASTER_ADDR': '192.0.0.1', 'MASTER_PORT': '44901', 'WORLD_SIZE': '4', 'LOCAL_WORLD_SIZE': '4'}
[4029] (rank = 0, local_rank = 0) train worker starting...
[4030] (rank = 1, local_rank = 1) train worker starting...
[4032] (rank = 3, local_rank = 3) train worker starting...
[4031] (rank = 2, local_rank = 2) train worker starting...
[4101] epoch 0 (rank = 1, local_rank = 1) loss = 0.9288564920425415
[4103] epoch 0 (rank = 3, local_rank = 3) loss = 0.9711472988128662
[4102] epoch 0 (rank = 2, local_rank = 2) loss = 1.0727070569992065
[4100] epoch 0 (rank = 0, local_rank = 0) loss = 0.9402943253517151
[4100] epoch 1 (rank = 0, local_rank = 0) loss = 1.0327017307281494
[4101] epoch 1 (rank = 1, local_rank = 1) loss = 1.4485043287277222
[4103] epoch 1 (rank = 3, local_rank = 3) loss = 1.0959293842315674
[4102] epoch 1 (rank = 2, local_rank = 2) loss = 1.0669530630111694
...

Execute the same script as above on node1

torchrun \
    --nnodes=1:3\
    --nproc_per_node=4\
    --max_restarts=3\
    --rdzv_id=1\
    --rdzv_backend=c10d\
    --rdzv_endpoint="192.0.0.1:1234"\
    train_elastic.py

The result on node1 is as follows:

  • Lines 2 to 5: Due to the addition of node1, the distributed training task of 2 machines and 8 cards is currently executed, so WORLD_SIZE=8, LOCAL_WORLD_SIZE=4

  • Lines 6~9: The current rank of workers on node1 is rank4 ~rank7

  • Lines 13~20: Since node1 was added when the work training on node0 reached epoch35, it started training after epoch 35

/workspace/DDP# sh run_elastic.sh
[696] Initializing process group with: {'MASTER_ADDR': '192.0.0.1', 'MASTER_PORT': '42913', 'WORLD_SIZE': '8', 'LOCAL_WORLD_SIZE': '4'}
[697] Initializing process group with: {'MASTER_ADDR': '192.0.0.1', 'MASTER_PORT': '42913', 'WORLD_SIZE': '8', 'LOCAL_WORLD_SIZE': '4'}
[695] Initializing process group with: {'MASTER_ADDR': '192.0.0.1', 'MASTER_PORT': '42913', 'WORLD_SIZE': '8', 'LOCAL_WORLD_SIZE': '4'}
[694] Initializing process group with: {'MASTER_ADDR': '192.0.0.1', 'MASTER_PORT': '42913', 'WORLD_SIZE': '8', 'LOCAL_WORLD_SIZE': '4'}
[697] (rank = 7, local_rank = 3) train worker starting...
[695] (rank = 5, local_rank = 1) train worker starting...
[694] (rank = 4, local_rank = 0) train worker starting...
[696] (rank = 6, local_rank = 2) train worker starting...
load checkpoint from checkpoint.ptload checkpoint from checkpoint.pt
load checkpoint from checkpoint.pt
load checkpoint from checkpoint.pt
[697] epoch 35 (rank = 7, local_rank = 3) loss = 1.1888569593429565
[694] epoch 35 (rank = 4, local_rank = 0) loss = 0.8916441202163696
[695] epoch 35 (rank = 5, local_rank = 1) loss = 1.5685604810714722
[696] epoch 35 (rank = 6, local_rank = 2) loss = 1.11683189868927
[696] epoch 36 (rank = 6, local_rank = 2) loss = 1.3724170923233032
[694] epoch 36 (rank = 4, local_rank = 0) loss = 1.061527967453003
[695] epoch 36 (rank = 5, local_rank = 1) loss = 0.96876460313797
[697] epoch 36 (rank = 7, local_rank = 3) loss = 0.8060566782951355
...

The result on node0 is as follows:

  • Lines 6~9: When the works on node0 is executed to epoch 35, the training script is executed on node1, requesting to be added to the training task

  • Lines 10~13: All workers are restarted. Since node1 is added, the distributed training task of 2 machines and 8 cards is currently executed, so WORLD_SIZE=8, LOCAL_WORLD_SIZE=4

  • Lines 14~17: The current rank of works on node1 is rank0~rank3

  • Lines 18~21: Load checkpoint

  • Lines 22-30: Then the model, optimizer and epoch in the checkpoint continue to train

...
[4100] epoch 35 (rank = 0, local_rank = 0) loss = 1.0746158361434937
[4101] epoch 35 (rank = 1, local_rank = 1) loss = 1.1712706089019775
[4103] epoch 35 (rank = 3, local_rank = 3) loss = 1.1774182319641113
[4102] epoch 35 (rank = 2, local_rank = 2) loss = 1.0898035764694214
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4100 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4101 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4102 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4103 closing signal SIGTERM
[4164] Initializing process group with: {'MASTER_ADDR': '192.0.0.1', 'MASTER_PORT': '42913', 'WORLD_SIZE': '8', 'LOCAL_WORLD_SIZE': '4'}
[4165] Initializing process group with: {'MASTER_ADDR': '192.0.0.1', 'MASTER_PORT': '42913', 'WORLD_SIZE': '8', 'LOCAL_WORLD_SIZE': '4'}
[4162] Initializing process group with: {'MASTER_ADDR': '192.0.0.1', 'MASTER_PORT': '42913', 'WORLD_SIZE': '8', 'LOCAL_WORLD_SIZE': '4'}
[4163] Initializing process group with: {'MASTER_ADDR': '192.0.0.1', 'MASTER_PORT': '42913', 'WORLD_SIZE': '8', 'LOCAL_WORLD_SIZE': '4'}
[4162] (rank = 0, local_rank = 0) train worker starting...
[4163] (rank = 1, local_rank = 1) train worker starting...
[4164] (rank = 2, local_rank = 2) train worker starting...
[4165] (rank = 3, local_rank = 3) train worker starting...
load checkpoint from checkpoint.pt
load checkpoint from checkpoint.pt
load checkpoint from checkpoint.pt
load checkpoint from checkpoint.pt
[4165] epoch 35 (rank = 3, local_rank = 3) loss = 1.3437936305999756
[4162] epoch 35 (rank = 0, local_rank = 0) loss = 1.5693414211273193
[4163] epoch 35 (rank = 1, local_rank = 1) loss = 1.199862003326416
[4164] epoch 35 (rank = 2, local_rank = 2) loss = 1.0465545654296875
[4163] epoch 36 (rank = 1, local_rank = 1) loss = 0.9741991758346558
[4162] epoch 36 (rank = 0, local_rank = 0) loss = 1.3609280586242676
[4164] epoch 36 (rank = 2, local_rank = 2) loss = 0.9585908055305481
[4165] epoch 36 (rank = 3, local_rank = 3) loss = 0.9169824123382568
...

This article is for academic sharing only, if there is any infringement, please contact to delete the article.

Dry goods download and study

Backstage reply: Barcelona Autonomous University courseware, you can download the 3D Vison high-quality courseware accumulated by foreign universities for several years

Background reply: computer vision books, you can download the pdf of classic books in the field of 3D vision

Backstage reply: 3D vision courses, you can learn excellent courses in the field of 3D vision

Computer Vision Workshop official website: 3dcver.com

1. Multi-sensor data fusion technology for autonomous driving

2. A full-stack learning route for 3D point cloud target detection in the field of autonomous driving! (Single-modal + multi-modal/data + code)
3. Thoroughly understand visual 3D reconstruction: principle analysis, code explanation, and optimization and improvement
4. The first domestic point cloud processing course for industrial-level combat
5. Laser-vision -IMU-GPS fusion SLAM algorithm sorting
and code
explanation
Indoor and outdoor laser SLAM key algorithm principle, code and actual combat (cartographer + LOAM + LIO-SAM)

9. Build a structured light 3D reconstruction system from scratch [theory + source code + practice]

10. Monocular depth estimation method: algorithm sorting and code implementation

11. The actual deployment of deep learning models in autonomous driving

12. Camera model and calibration (monocular + binocular + fisheye)

13. Heavy! Quadcopters: Algorithms and Practice

14. ROS2 from entry to mastery: theory and practice

15. The first 3D defect detection tutorial in China: theory, source code and actual combat

Heavy! Computer Vision Workshop - Learning Exchange Group has been established

Scan the code to add a WeChat assistant, and you can apply to join the 3D Vision Workshop - Academic Paper Writing and Submission WeChat exchange group, which aims to exchange writing and submission matters such as top conferences, top journals, SCI, and EI.

At the same time , you can also apply to join our subdivision communication group. At present, there are mainly ORB-SLAM series source code learning, 3D vision , CV & deep learning , SLAM , 3D reconstruction , point cloud post-processing , automatic driving, CV introduction, 3D measurement, VR /AR, 3D face recognition, medical imaging, defect detection, pedestrian re-identification, target tracking, visual product landing, visual competition, license plate recognition, hardware selection, depth estimation, academic exchanges, job search exchanges and other WeChat groups, please scan the following WeChat account plus group, remarks: "research direction + school/company + nickname", for example: "3D vision + Shanghai Jiaotong University + Jingjing". Please remark according to the format, otherwise it will not be approved. After the addition is successful, the relevant WeChat group will be invited according to the research direction. Please contact for original submissions .

c334ad8f8016910ac01a34d5d49f4137.png

▲Long press to add WeChat group or contribute

24e80987f18f963e4ae19ce112e3682a.png

▲Long press to follow the official account

3D vision from entry to proficient knowledge planet : video courses for 3D vision field ( 3D reconstruction series , 3D point cloud series , structured light series , hand-eye calibration , camera calibration , laser/vision SLAM, automatic driving, etc. ), summary of knowledge points , entry and advanced learning route, the latest paper sharing, and question answering for in-depth cultivation, and technical guidance from algorithm engineers from various large factories. At the same time, Planet will cooperate with well-known companies to release 3D vision-related algorithm development positions and project docking information, creating a gathering area for die-hard fans that integrates technology and employment. Nearly 4,000 Planet members make common progress and knowledge to create a better AI world. Planet Entrance:

Learn the core technology of 3D vision, scan and view the introduction, unconditional refund within 3 days

b15038e46d633b63c1b423a602a6c4a8.png

 There are high-quality tutorial materials in the circle, which can answer questions and help you solve problems efficiently

I find it useful, please give a like and watch~

Guess you like

Origin blog.csdn.net/qq_29462849/article/details/124335800