DLRover: Ant Open Source Large-Scale Intelligent Distributed Training System

6332517e1ef6186b5381f17c32f1f92f.gif

Text|Sha Jian

Ant Group Senior Technical Expert

Focus on the field of distributed deep learning

Mainly responsible for the design and development of Ant's large-scale distributed training engine

This article is 4491 words read for 12 minutes   

This article introduces the project motivation and core capabilities of DLRover as a whole. In the future, we will publish a series of articles to introduce DLRover from multiple perspectives, such as synchronous/asynchronous elastic training, optimization of policy services, docking of various clusters and training frameworks, and development of policy customization. Stay tuned for more details.

01

technical background

In June 2022, Ant Group decided to fully introduce the ESG framework, and launched and established a four-in-one sustainable development strategy of "digital inclusiveness", "green and low carbon", "scientific and technological innovation", and "open ecology". For "green and low-carbon", four sub-topics have been set up, including green operation, technology-assisted industrial carbon neutrality, ecological protection and restoration of green and low-carbon life.

In this context, green AI has also become an important work direction of the Ant AI Infra team. As an important part of green AI, the Engineering Efficiency Improvement Project is committed to building a high-performance offline AI engineering system. By improving computing power efficiency and resource utilization, it can ultimately achieve the goal of saving resources and reducing carbon emissions.

Currently, the tools for users to submit distributed training jobs include Yarn or KubeFlow/Training-Operator. When submitting a job, the user needs to specify the job resources in the job, including the number of nodes of different roles and resource specifications (number of CPU cores, memory, GPU, etc.).

After the training job is submitted, the job may encounter the following problems:

  • The cluster resources are not enough to start all the nodes of the job, the job can only wait.

  • The node of the training job may have errors, such as being preempted by high-quality tasks, machine failures, IO failures, etc., resulting in job failures.

After these issues, users can only modify the job resource to resubmit the job.

In response to these two problems, Ant Group opened up the ElasticDL project based on Kubernetes in the early stage to support the elastic fault tolerance of TF 2.x distributed training on K8s. During the implementation of the project, we found the following problems:

  • User provisioned resources can be too low causing OOM and poor training performance.

  • In order to ensure the success rate and speed of jobs, users usually configure excess resources, resulting in low utilization.

  • More and more users use PyTorch or other frameworks other than TF to develop and train models.

  • More and more distributed clusters are beginning to support AI operations, such as Ray and Spark clusters, can they be adapted to any computing cluster?

  • With the increasing adoption of online learning,

    How to use a system to solve compatible offline training at the same time?

The first two problems make the CPU utilization of the cluster usually only about 20%. At the same time, algorithm developers need to invest a lot of manual operation and maintenance costs. In order to solve the demand for resource efficiency improvement at the training end, it supports multiple offline training modes on different clusters. Automatically find the optimal resource allocation for distributed training jobs of different frameworks.

The Ant AI Infra team upgraded and expanded DLRover based on the elastic fault-tolerant idea of ​​ElasticDL. Its goal is to improve the intelligence of distributed model training. At present, the training jobs of many companies run in mixed clusters, and the operating environment is complex. As the name suggests, DLRover, as the "Land Rover" in the field of distributed training, can easily control no matter how rough the terrain is.

595ce8e6ae73286dae8d410a9b41875c.png

02

Overall program

DLRover proposed the concept of "ML for System" to improve the intelligence of distributed training, so what capabilities should such a system have?

We think it is mainly reflected in the following aspects:

  • Decoupling: Do not couple with the underlying training framework, only rely on interface abstraction, and follow the principle of dependency inversion. ( ie Elastic Runtime )

  • Resource Scheduling: It has the ability to manage and control resources from God's perspective. And the decision-making ability based on the accurate portrait of the job.

  • Data-driven: Collect and master cluster resource data and training job data at the same time. Data-driven intelligence.

  • Job interaction: Based on the understanding of training jobs and model white-boxing, dynamically optimize and adjust the training jobs according to the actual situation. Elastic fault tolerance beyond simple machinery!

  • Intelligence: Through the collection of cluster and job information, combined with the algorithm model + fixed strategy, an accurate job optimization strategy is produced.

We hope to design and implement a system that allows users to completely get rid of the shackles of resource allocation and focus on model training itself. Without any resource configuration input, DLRover can still provide the optimal resource configuration for each training job. Considering that users may run their training jobs in different ways, DLRover provides a Single-Job Mode in addition to the Cluster Mode for unified management of jobs on the training platform, so that independent algorithm developers can also enjoy basic features such as elastic fault tolerance.

03

system structure

DLRover consists of four main components: ElasticJob, Elastic Trainer, Brain Service, and Cluster Monitor.

ac2acde5e4042bc25f2a85d9cb13b3eb.png

The figure above shows how DLRover manages deep learning training jobs on a K8s cluster. DLRover submits jobs to the cluster in the form of ElasticJob CRDs. After receiving the CRD, the ElasticJob Operator will pull up a Master Pod as the Elastic Trainer. It gets the initial resource plan from the Brain service. Elastic Trainer uses it to create a Scale CRD, and apply the Scale CRD to inform the ElasticJob Controller to start the required Pods, and each Pod will start an Elastic Agent on it.

During the training process, the Elastic Trainer's Training Master distributes data shards to Workers. At the same time, Cluster Monitor monitors the running status of each job ( ie Workload of each node ) and cluster status ( ie resource water level ). These data will be reported to Brain periodically, and Brain will persist the data in the database.

Then DLRover Brain selects an appropriate algorithm to generate a new resource plan according to the running status of the job, and notifies Elastic Trainer to start resource adjustment.

In general, DLRover can help distributed training jobs run automatically in the cluster, which can be regarded as the automatic driving of distributed jobs. Model developers only need to pay attention to the algorithm design of the model. The current open source version of DLRover can provide users with the following capabilities :

  • Automatic resource derivation: Helps users automatically initialize training resources to improve resource utilization and job stability.

  • Dynamic training data sharding: For the barrel effect caused by different Worker performance barriers, the training data is allocated according to the actual consumption speed, which can cooperate with Failover to record the consumption location, and the data will not be lost.

  • Single-point fault tolerance: Provides the capability of single-point fault tolerance without completely restarting jobs.

  • Resource Elasticity: Supports elastic expansion and contraction of Pod-level and CPU/Memory-level resources during runtime, and dynamic global optimization decisions.

04

What can DLRover bring

1. Job zero resource parameter configuration

Users do not need to provide any resource information when submitting a distributed job. DLRover will automatically profile the job and deduce the optimal resource configuration . Automatically adjust resources. The configuration comparison of the two submission scripts is shown below:

0e9d0ac989a3d21feca0d6821d3b9fcb.png

2.  Single-point fault tolerance improves operation stability and recovery efficiency

DLRover supports single-point recovery for Parameter Server and Worker role failure exit without restarting the entire job, and can realize user-insensitive restart for non-user code and data type errors. For example, in a cluster, a very common type of error is due to insufficient memory configured by the user, resulting in OOM training. With the help of DLRover, we can automatically pull up an optimally configured node to recover a failed Node. In a real environment, the training job managed by DLRover has increased the training success rate from 84% to more than 95% compared with the baseline Kubeflow TF-Operator job.

448fbe7adc674804c41b67f32d45fb2a.png

3.  Automatic expansion and contraction to improve job training performance

DLRover supports automatic adjustment of training resources at the Parameter Server and Worker levels during training to improve training performance. By monitoring the Workload of job nodes, DLRover can analyze the bottleneck of resource allocation. Common resource bottlenecks include: node preemption, unbalanced workload, low computing power due to insufficient CPU, and insufficient number of nodes. DLRover can continuously optimize training performance through dynamic resource hot updates.

4cc93f5b5701105e2e9e2173a076b371.png

4. Automatic expansion and contraction to improve resource utilization

Usually different model training jobs require different resource configurations. However, users tend to over-allocate the resources of the job to ensure the success rate of the job. This usually results in a lot of wasted resources. DLRover's automatic expansion and contraction capabilities can automatically configure resources according to the real needs of the job, and achieve optimal training performance with the least resources, thereby reducing resource waste. The following graph shows a comparison of resource utilization curves for automatic resources versus manual resources:

6f99ef96fbcb89fad29975dfe41ecd52.png

5.  Dynamic data distribution solves the problem of slow nodes

There are situations of resource oversold and preemption in the mixed cluster. Some nodes consume data slowly, and the fast nodes need to wait for the slow nodes, which reduces the training speed. DLRover can dynamically distribute data to slow nodes to distribute less data and reduce waiting. In addition, DLRover should ensure that training tasks consume data as much as possible according to user configuration parameters to avoid repeated consumption/loss of data, which will bring uncertainty to training and affect model performance.

When expanding or shrinking, a global coordinator is required to know the details of the current consumption data of the record node. When a node fails and restarts, the global coordinator needs to know the data that the node has consumed and has not yet consumed. If these logics are done by the training nodes, the training nodes need to interact with each other, which increases the complexity of the training node logic. DLRover Master acts as the global coordinator.

All in all, in our opinion, the complexity of training node logic can be simplified through dynamic data. The training node only needs to obtain shards from the DLRover Master, and then read the data without processing other logic.

6.  Unify offline and online learning paradigms

The dynamic data sharding features mentioned above actually help us decouple Data Source and training jobs. On this basis, DLRover can support both offline training and online learning jobs that consume real-time sample streams. ( It can directly connect to the sample stream through Dlrover.trainer, or it can be used as the training sink node of the stream computing engine )

In Ant's practice, DLRover can be used as an ideal component to help us build an end-to-end online learning system. DLRover can provide a series of practical issues such as data source consumption location recording and recovery, online learning long-distance running operation stability and performance guarantee, and resource utilization guarantee. Simple examples are also provided in our open source warehouse, and we will open up more peripheral components in the future.

7. Support asynchronous and synchronous training mode

Training jobs with different natures in different business domains are run in the training cluster every day: the large-scale sparse model of the recommendation system usually runs in the training mode of the PS/Worker architecture for asynchronous parameter update, and the resources are mostly CPU computing. The dense models in the CV/NLP field are mostly trained synchronously on the GPU server in a data-parallel manner. At this time, there is only one role of Worker.

DLRover is designed to support synchronous and asynchronous update modes at the same time, so as to unify various training paradigms.

8. Decoupling from the DL training framework

DLRover supports users to use any of their own training frameworks. The underlying training code needs to interact deeply with the underlying distributed code by providing an agreed API interface to achieve automatic elastic scaling. After the deployment in the cluster is completed, terminal algorithm students can basically access it without feeling.

05

Summary & Future Plans

DLRover has been implemented on a large scale in Ant, and the utilization rate of cluster resources has increased by more than 15% compared with the baseline stability. At the same time, it also effectively solves the problem that the training throughput is lower than expected due to unreasonable resource allocation. We hope that the open source of DLRover can help more peers to promote the concept of low carbon, green and AI. At the same time, it also effectively reduces the operation and maintenance cost in model development, and releases more productivity to solve business problems.

The current tuning algorithm of DLRover, as well as resource and job portrait strategies are mainly aimed at optimizing the internal technology stack of Ant. Considering the diversity of the actual technology stacks of different organizations, in terms of design, DLRover made a unified interface abstraction at the API layer, and the specific tuning algorithm and job portrait strategy can be flexibly customized. We welcome developers from different organizations to work with us to build the DLRover project according to their own characteristics and make it stronger.

learn more...

DLRover Star ✨:
https://github.com/intelligent-machine-learning

   Recommended reading this week  

bf6a8f137af64d236c91b4930fa47cee.png

Full analysis of Go native plug-in usage problems

3770b8ed36cb01ace4d7f1a5aa1b9c98.png

MOSN builds Subset optimization ideas sharing

2987414183a37da85e0960908061ed25.png

MOSN Documentation User Guide

1c3ae0209db43ae70a5ec648ddf6ee56.png

The release of MOSN 1.0 opens a new architecture evolution

ca6d42e18ccddb0d145adf94d6002e6d.jpeg

Guess you like

Origin blog.csdn.net/SOFAStack/article/details/129394779
Recommended