Koordinator Heterogeneous Resource/Task Scheduling Practice

foreword

Koordinator is a new generation of scheduling system that Alibaba Cloud has open sourced based on the technology and practical experience accumulated in the unified scheduling system we built in the past. Koordinator supports hybrid scheduling of various workloads on Kubernetes. Its goal is to improve the runtime efficiency and reliability of workloads (including latency-sensitive workloads and batch tasks). Koordinator is not only good at mixed department scenarios, but also supports task scheduling scenarios such as big data and AI training. This article shares the practical experience of using Koordinator to support heterogeneous resource management and task scheduling scenarios.

AI/LLMs Bring New Opportunities and New Challenges

From the release of ChatGPT in November 2022 to the present, the attention and impact ChatGPT has attracted may have surpassed almost all hotspots in the history of information technology. Many industry experts have been conquered by it. For example, Zhang Yong, CEO of Alibaba Cloud, said: "All industries, applications, software, and services are worth redoing based on the capabilities of large models." NVIDIA CEO Huang Renxun said that it brought AI iPhone time. ChatGPT has ushered in a new era, and domestic and foreign companies and scientific research institutions have followed up. Almost every week, one or more new models are launched, ranging from natural language processing, computer vision to artificial intelligence-driven scientific research, generative AI, etc. , applications are blooming; large models become the key to improving business efficiency and opening up the next growth point. The same demand for cloud computing, infrastructure, and distributed systems is also coming.

In order to support the large model training needs of tens of billions and hundreds of billions of parameters, cloud computing and infrastructure need to provide more powerful and scalable computing and storage resources. One of the core technologies that large model training relies on is distributed training. Distributed training needs to transfer a large amount of data between multiple computing nodes, so a high-performance network with higher bandwidth and lower latency is required. In order to maximize the performance of computing, storage and network resources and ensure training efficiency, scheduling and resource management systems need to design more reasonable strategies. On this basis, the infrastructure needs to be continuously enhanced in reliability, with node failure recovery and fault tolerance capabilities to ensure the continuous operation of training tasks.

Large model training is inseparable from heterogeneous computing devices, typically the well-known GPU. In the GPU field, NVIDIA still occupies a dominant position, and other manufacturers such as AMD and domestic chip manufacturers are struggling to catch up. Taking NVIDIA as an example, its strong product design capabilities, solid technical strength and flexible market strategy enable it to quickly launch better chips, but the architecture of the products is quite different, such as the NVIDIA A100 model and NVIDIA H100 model system The difference in architecture is very obvious, and there are many details that need to be paid attention to in the way of use, which brings a lot of challenges to the upper-level scheduling system and resource management system.

The powerful combination of Koordinator+KubeDL

In the large model training scenario supported by Alibaba Cloud, we use Koordinator to solve basic task scheduling requirements and heterogeneous device resource management requirements. At the same time, use KubeDL to manage the training job lifecycle and training job queuing and scheduling requirements.

Koordinator is not only good at mixed department scheduling scenarios, but also provides general task scheduling capabilities including elastic quota scheduling and gang scheduling for big data and AI model training scenarios. In addition, it has fine-grained resource scheduling management capabilities, not only supports centralized allocation of GPUs, but also perceives hardware system topology to allocate resources, and supports joint allocation and device sharing capabilities of GPU & RDMA.

We choose to use KubeDL to manage the training job life cycle because it not only supports a large number of internal AI field-related scenarios, but also benefits from its excellent design and implementation, operability, reliability and functional scalability They are all excellent, and they are a unified controller that can support a variety of training workloads, such as TensorFlow, PyTorch, Mars, etc. In addition, it can also adapt to the Gang scheduling capabilities provided by different schedulers, which can help the stock scenarios that have already used the KubeDL project to smoothly switch to Koordinator; KubeDL also has a built-in general job queuing mechanism, which can effectively solve the scheduling needs of the job itself .

The powerful combination of Koordinator and KubeDL can well solve the scheduling needs of large model training.

Job scheduling

A Job is a higher-level abstraction, usually with a specific computational task or operation. It can be split into multiple subtasks to complete in parallel, or split into multiple subtasks to complete collaboratively. Usually Job does not depend on other workloads and can run independently. Moreover, Job is more flexible, and has fewer constraints in terms of time dimension, space dimension, or resources.

Job queuing

Job also needs to be scheduled by the scheduler, which means that Job also needs to be queued when scheduling. So why do you need to line up? Or what problems can we solve by queuing?

It is because the resources in the system are limited, and our budget is also limited, while the number of jobs and computing requirements are often unlimited. If queuing and scheduling are not performed, those jobs with high computing requirements or long execution time will occupy a large amount of resources, causing other jobs to be unable to obtain enough resources for computing, and may even cause the cluster system to crash.

Therefore, in order to ensure that each job can obtain resources fairly and avoid resource contention and conflicts, it is necessary to queue and schedule jobs.

We use the general Job queuing and scheduling mechanism provided by KubeDL to solve this problem. Because KubeDL itself supports a variety of training workloads, it naturally supports scheduling according to the granularity of the job; and it has a fairness guarantee mechanism among multi-tenants, which reduces resource contention and conflicts between jobs, and in the process of queuing and scheduling , KubeDL evaluates and allocates jobs based on factors such as job computing requirements, priorities, and resource requirements, to ensure that each job can get appropriate resources for computing. KubeDL supports a variety of extension plug-ins, such as Filter plug-ins, Score plug-ins, etc., which can further expand its functions and features to meet the needs of different scenarios.

Elastic Quota

One of the core problems to be solved by job queuing is the fairness of resource supply, which is generally solved through the elastic quota mechanism in the scheduling system.

There are several core issues to be solved by the elastic Quota mechanism: firstly, to ensure fairness, so that the resource demand of certain tasks cannot be too high to cause other tasks to be starved to death, and most tasks should be able to obtain resources as much as possible; secondly, there must be a certain Elasticity can share idle quotas with tasks that need more resources at the moment, and also be able to get back shared resources when resources are needed, which means that flexible strategies need to be provided to meet the needs of different scenarios.

Koordinator implements the elastic quota scheduling capability, which can guarantee the fairness among tenants. At the beginning of the design, we considered compatibility with the ElaticQuota CRD defined in the scheduler-plugins community project, so that existing clusters and users can smoothly transition to Koordinator.

In addition, we are not only compatible with ElasticQuota's original ability to manage Quota according to Namespace, but also support management according to tree structure, which can cross Namespace. This method can well support the quota management needs of a complex organization. For example, there are multiple product lines in a company, and the budget and usage of each product line are different. They can be converted to Quota for management, and with the help of flexibility Quota, to temporarily share idle resources that are not used temporarily to other departments in the form of quotas.

Coscheduling

When a Job is queued and scheduled, the Job Controller will create a batch of subtasks, corresponding to K8s, which is a batch of Pods. These Pods often require a coordinated start and run. This also requires the scheduler to allocate resources according to a group of Pods during scheduling. This group of Pods must be able to apply for resources, or once a Pod cannot obtain resources, it is considered a scheduling failure. This is the All-or-Nothing scheduling semantics that the scheduler needs to provide.

If we do not schedule according to a group in this way, there will be a deadlock in the resource dimension due to competition among multiple jobs at the resource scheduling level, that is, at least two jobs will not be able to get resources, even if they are originally idle If the resource is enough for one of the jobs to run, it will not be available.

For example, in the figure below, Job A and Job B create a batch of Pods at the same time. If they are not sorted in the middle Scheduling Queue but scheduled randomly, it will appear that the Pods of Job A and Job B each hold some resources of some nodes. If the cluster resources are tight at this time, it is very likely that both Job A and Job B may not be able to obtain resources. But if after sorting, we try to let the Pods of one of the Jobs first try to allocate resources first, then at least one Job can be guaranteed to run.

When a set of Pods divided by a Job is very large, and the resources in the cluster are not sufficient, or the Quota is not very large, such a set of Pods can be divided into more subgroups, the size of which can be run Based on the task, assuming that a Job requires a minimum cutting granularity of 3 Pods per group, then this minimum granularity is generally called min available in the scheduling domain.

Specifically in the field of AI model training, for some special jobs such as TFJob, its subtasks have two roles. These two roles also need to be set with different min available in the production environment. This scenario of distinguishing different roles may also require that the min available of each role is satisfied before it can be considered as conforming to the All-or-Nothing semantics.

Koordinator has a built-in Coscheduling scheduling capability, which is compatible with the community's scheduler-plugins/coscheduling to define PodGroup CRDs, and also supports joint scheduling of multiple PodGroups, so that min available scenarios can be set by role. Koordinator implements a KubeDL Gang Scheduler plug-in, so that it can be integrated with KubeDL to support such scheduling scenarios.

Refined equipment management

Limitations of K8s device management

K8s is responsible for device management and allocation through kubelet, and interacts with the device plugin to implement the entire mechanism. This mechanism was sufficient in the early days of K8s, and other manufacturers such as AMD and domestic chip manufacturers also seized the opportunity to catch up.

Kubelet and device plugin collaboration process

First of all, K8s only allows devices to be allocated through kubelet, which makes it impossible to obtain globally optimal resource arrangement, that is, it is fundamentally unable to exert resource efficiency. For example, there are two nodes in a cluster, both of which have the same device, and the remaining number of allocatable devices is equal. However, in fact, the hardware topology of the devices on the two nodes will cause a large difference in the runtime effect of the Pod. There is no scheduler In the case of intervention, it may not be possible to offset this difference.

The second is that it does not support the ability to jointly allocate GPU and RDMA. Large-scale model training relies on high-performance networks, and inter-node communication of high-performance networks requires the use of RDMA protocols and network devices that support RDMA protocols, and these devices work closely with GPUs at the system topology level on nodes, such as the following The picture above is the hardware topology of NVIDIA's A100 model. We can see that the PCIe Switch is connected with a GPU and a high-performance network card. We need to allocate these two devices nearby to achieve low-latency communication between nodes. And what is more interesting here is that when multiple GPUs need to be allocated, if multiple PCIe Switches are involved, it means that multiple network cards need to be allocated. This is related to another limitation of K8s, that is, the declared resource protocol is Quantitative, rather than arbitrary changes, that is to say, the user does not actually know how many RDMA-capable network cards are needed for this Pod. The user only knows how many GPU devices are needed, and expects to allocate RDMA network cards nearby.

Moreover, kubelet does not support device initialization and cleaning functions, nor does it support device sharing mechanisms. The latter is generally not used in training scenarios, but it is used in online inference services. The online inference service itself also has obvious peak and valley characteristics, and it does not need to occupy complete GPU resources at many times.

The device management capabilities of nodes such as K8s have fallen behind the times to a certain extent. Although the latest version supports the DRA allocation mechanism (similar to the existing PVC scheduling mechanism), this mechanism is only supported in the latest version of K8s. , but the actual situation is that there are still a large number of stock clusters in use, and upgrading to the latest version of K8s is not a trivial matter, so we have to find other ways.

Koordinator refined device management mechanism

We propose a solution in Koordinator, which can solve these problems and achieve fine-grained resource scheduling.

Koordinator refined device management mechanism

As can be seen from the above figure, a Pod created by the user is allocated by the koord-scheduler scheduler according to the Device CRD reported by koordlet, and written into the Pod Annotation, and then the Sandbox and Container are pulled up by the kubelet. In the middle, the kubelet It will initiate a CRI request to containerd/docker, but in the Koordinator solution, the CRI request will be intercepted by koord-runtime-proxy and forwarded to the GPU plug-in in koordlet, which will perceive the device allocation results on Pod Annotation and generate necessary device environment variables, etc. The information is returned to koord-runtime-proxy, and finally the modified CRI request is forwarded to containerd/docker, and finally returned to kubelet. In this way, you can seamlessly intervene in the life cycle of the entire container to implement custom logic.

Koordinator Device CRD is used to describe the device information of the node, including the topology information of the Device, which can guide the scheduler to implement fine-grained allocation logic.

Koordinator Device object

Future: NRI model

As mentioned earlier, the Koordinator single-machine side relies on koord-runtime-proxy to cooperate to complete device information injection. We also realize that the koord-runtime-proxy method is not easy to implement in your cluster. This involves modifying the startup parameters of kubelet.

Therefore, the Koordinator community will introduce mechanisms such as NRI/CDI to solve the problem in this scenario. This work is being co-built with relevant Intel teams.

NRI/CDI is a plug-in mechanism supported by containerd. Its deployment method is somewhat similar to the familiar CNI, which supports the opportunity to modify parameters or implement some custom logic before and after starting Sandbox/Container. This is equivalent to the built-in runtimeproxy mechanism of containerd.

GPU&RDMA is jointly allocated according to the hardware topology

As mentioned earlier, large model training not only uses GPUs, but also relies on RDMA network devices. Make sure that the delay between GPU and RDMA is as low as possible, otherwise the delay between devices will be amplified to the entire distributed training network, slowing down the overall training efficiency.

This requires awareness of the hardware topology when allocating GPUs and RDMA, and allocating such devices as close as possible. Try to allocate in the order of the same PCIe, the same NUMA Node, the same NUMA Socket and cross-NUMA, and the delay will increase in turn.

Moreover, we also found that different models of GPUs from the same hardware manufacturer have different hardware system topologies, which requires our scheduler to be aware of these differences. For example, the following figure is a simple device connection diagram of NVIDIA A100 model System Topology and NVIDIA H100.

NVIDIA A100 System Topology

The NVLINK communication method between NVIDIA A100 GPUs is different from that of NVIDIA H100 models, and the number of NVSwitches is also different. This difference will bring great differences to the usage methods.

NVIDIA H100

Differences of NVIDIA-based system in multi-tenant mode

The special feature of NVIDIA H100 GPU in the multi-tenant VM scenario is that the communication between multiple GPUs needs to be realized by operating NVSwitch.

In a multi-tenancy scenario, NVIDIA manages the isolation status of NVLink through NVSwitch to ensure security, and requires only trusted software to operate NVSwitch. This credit software can be customized.

NVIDIA supports multiple modes, one is the Full Passthrough mode, which directly connects the GPU and NVSwitch to the Guest OS of the VM. Will be reduced ( Original: Reduced NVLink bandwidth for two and four GPU VMs ).

The other is called Shared NVSwitch multi-tenant mode, which only requires the GPU to be directly connected to the Guest OS, and then manages NVSwitch through a special VM called Service VM, and calls NVIDIA Fabric Manager through ServiceVM to activate NVSwitch to realize inter-GPU communication. This mode will not appear because of the drawbacks of the Full Passthrough mode, but the way of use is obviously more complicated. This special hardware architecture and usage also lead to some additional requirements when allocating GPUs. NVIDIA defines which GPU device instances can be combined and allocated together. For example, if a user applies for allocation of 4 GPUs, it must be allocated according to the regulations 1, 2, 3, and 4 or 5, 6, 7, and 8 together, otherwise it will cause Pods cannot run.

We don’t know the reason behind this special allocation method, but analyzing these allocation constraints, we can find that the combination relationship stipulated by the manufacturer is just in line with the hardware system topology, that is, the allocation that can meet the expectations of GPU&RDMA joint allocation mentioned above result.

NVIDIA H100 System Topology

Author: Li Tao (Lv Feng)

Click to try cloud products for free now to start the practical journey on the cloud!

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Ministry of Industry and Information Technology: Do not provide network access services for unregistered apps Go 1.21 officially released Ruan Yifeng released " TypeScript Tutorial" Bram Moolenaar, the father of Vim, passed away due to illness The self-developed kernel Linus personally reviewed the code, hoping to calm down the "infighting" driven by the Bcachefs file system. ByteDance launched a public DNS service . Excellent, committed to the Linux kernel mainline this month
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/yunqi/blog/10094734