Thinking and Practice of Complex Workload Hybrid Scheduling under Cloud Native Architecture

Author: laboratory Chen / Big Data Laboratory

On October 25th, the first China Cloud Computing Infrastructure Developers Conference was held in Changsha. Transwarp and many domestic and foreign manufacturers jointly discussed cloud computing fields such as "cloud native", "security and fault tolerance" and "management and optimization". The topic had in-depth exchanges and discussions. Transwarp Container Cloud R&D engineers shared the relevant content of "Thinking and Practice of Complex Workload Hybrid Scheduler Based on Kubernetes". This article is to organize the content of the conference.

In recent years, the concept of cloud native has swept across the entire cloud computing field. The changes brought about by cloud native technology represented by Kubernetes have caused enterprises to think deeply. More and more enterprises are gradually migrating their infrastructure to cloud native architecture and business applications. It is also developed and deployed in compliance with the cloud native twelve elements standard. The change in the concept of technology delivery has also accelerated the process of digital and intelligent transformation of enterprises.
The initial stage of cloud-native technology is naturally suitable for micro-service architecture. With the rapid development of the entire cloud-native technology and continuous consolidation of cloud-native infrastructure, enterprises have gradually begun to "move" traditional big data analytical applications and computing applications to cloud-native Architecture. So far, cloud-native infrastructure has become an inevitable trend as a unified infrastructure within the enterprise. However, the use of cloud-native infrastructure as a unified infrastructure is bound to face compatibility issues after the integration of the basic platform, such as: how traditional big data tasks are orchestrated and scheduled under the cloud-native architecture, computing data advocated in big data How to perfectly implement localization under the cloud native architecture, etc. Therefore, although unifying cloud native infrastructure is the general trend, there is still a long way to go.

Transwarp is an early practitioner of cloud-native technology and has explored various aspects to promote a unified cloud-native infrastructure. The data cloud platform product TDC is the product of years of accumulation and practice in the unified cloud-native infrastructure. TDC covers the three functions of analysis cloud, data cloud, and application cloud. It meets the requirements of enterprises for the construction of three types of cloud platforms in one platform, including data warehouses, streaming engines, analysis tools, DevOps and other applications. Complex workload scenarios. For this reason, Transwarp's underlying cloud platform has done a lot of work over the years. Next, let’s share our thinking and practice on complex workload hybrid schedulers under a unified cloud native infrastructure.

—Unified cloud native infrastructure —

After the emergence of the concept of a unified cloud native infrastructure, how to solve the scheduling and scheduling of multiple types of workloads has become an urgent problem, including but not limited to MicroService, BigData, AI, and HPC workloads. For MicroService, it is naturally supported by the cloud native architecture, so how to meet the scheduling and scheduling of other types of workloads is an urgent need to solve, typically such as Spark, TensorFlow and other community representative computing tasks, HDFS, HBase and other big data storage services. The open source community with Kubernetes as the core has also made corresponding attempts to address these needs. For example, the problem of task scheduling has been solved through Spark Operator and TensorFlow Operator, but the scheduling-related capabilities are still lacking. In addition, there are some enterprise-level features of the big data ecosystem that are not supported by the native Kubernetes scheduling capabilities. To this end, we conducted research and thinking on how to solve the lack of scheduling capabilities of Kubernetes under the background of a unified infrastructure.


—Big Data/AI Ecological Scheduler—

Let us review the characteristics of related schedulers in the big data/AI ecosystem. The main research objects are Mesos and YARN.

  • Months

Mesos was born at the University of California, Berkeley, and was used on Twitter after being open sourced. The technical prototype was designed and implemented with reference to Google's internal scheduler. Mesos is a framework with a two-level scheduling architecture. It itself mainly focuses on resource allocation based on the DRF algorithm. How to manage and allocate specific task resources is implemented by a specific Framework. Therefore, under such a flexible architecture, developers have a very broad space for development. However, because it does not provide too many functional features, it has not established a corresponding ecology, which makes more and more users discouraged and turn to other projects. The features of Mesos are summarized as follows:
1. Two-level scheduling architecture, more flexible
2. Focus on resource allocation based on DRF algorithm
3. Customizable Framework to achieve resource scheduling and management of specific tasks
4. Support online, offline, HPC type tasks Scheduling

  • YARN

YARN is a native resource management and scheduling framework released in Hadoop 2.0. With the release of YARN, Hadoop has completely established its unshakable position in the field of big data. Nearly all big data components and services can be scheduled and managed by YARN. Although its architecture is not flexible enough compared to Mesos, YARN has a powerful ecological endorsement of Hadoop compared to Mesos. Its development can be described as smooth and smooth. Related features and capabilities are also respected by enterprises, which solves many problems in enterprise resource scheduling and management. Its characteristics are summarized as follows:
1. Single-level scheduling architecture, not flexible enough
2. Support hierarchical resource queue, can map multi-tenant and enterprise organization structure
3. Support resource sharing, flexible scheduling, fair scheduling
4. Support multiple big data tasks Orchestration and scheduling
5. Support online, offline, HPC type task scheduling


—Kubernetes native scheduler—

Compared with the big data/AI ecological scheduler, the Kubernetes native scheduler has unique advantages in the fields of microservices, stateless applications, etc. However, in the context of a unified cloud native infrastructure, the Kubernetes native scheduler has insufficient capabilities. Be constantly zoomed in. Here are some shortcomings of the Kubernetes native scheduler:
1. Does not support resource scheduling under the multi-tenant model
2. Does not support the scheduling of big data and AI tasks
3. Does not support resource queues
4. Does not support resource sharing and flexible scheduling
5. Does not support fine-grained resource management and control
6. Does not support application-aware scheduling
7. Single scheduling and sorting algorithm


—Kubernetes Eco-Scheduler—

  • Volcano

Volcano ( https://volcano.sh/zh/ ) is a Huawei Cloud open source Kubernetes native batch processing system, which can support batch processing task scheduling and complements the lack of Kubernetes native scheduler in this regard. The architecture of Volcano is as follows:

Thinking and Practice of Complex Workload Hybrid Scheduling under Cloud Native Architecture
Its main features include but are not limited to the following:
1. Support batch processing tasks, MPI tasks, AI task scheduling
2. Support unified workload definition, and arrange and schedule different workloads by adding CRD Job
3. Support single Job heterogeneous Pod template Definition, break the constraints of Kubernetes native resources
4. Support resource queues, resource sharing and flexible scheduling
5. Support group scheduling, fair scheduling and other scheduling strategies

Although the Volcano project itself is excellent enough to provide many new features that the Kubernetes native scheduler does not have, in the context of a unified cloud native infrastructure, there may still be some limitations, such as: 1. The deployment form is if multiple schedulers Form (coexisting with the Kubernetes native scheduler), there may be resource scheduling conflicts with the native scheduler, so it is more suitable for deployment in a proprietary cluster; 2. The current version does not support multi-level hierarchical resource queues, making The mapping cannot be performed well in the enterprise multi-tenant scenario.

  • YuniKorn

The YuniKorn ( https://yunikorn.apache.org/ ) project is an open source project initiated by Cloudera. It is positioned as a cross-platform general scheduler. Its three-tier architecture design can achieve adaptation to multiple underlying platforms, and currently supports Kubernetes, YARN support is still under development. The YuniKorn project was born to enable batch processing tasks, long-running services, and stateful services to be scheduled by a unified scheduler. The architecture of YuniKorn is as follows:
Thinking and Practice of Complex Workload Hybrid Scheduling under Cloud Native Architecture
Its main features include but are not limited to the following:
1. Flexible architecture design, which can be cross-platform
2. Support batch processing tasks, long-term running services, stateful service scheduling
3. Support hierarchical resource pool /Queue definition
4. Support fair scheduling of resources between queues
5. Support resource preemption across queues based on fair strategy
6. Support GPU resource scheduling

At the beginning of the creation of the YuniKorn project, it was also designed after investigating projects such as kube-batch (the implementation of core functions in Volcano). Therefore, compared with kube-batch at the design level, there are more considerations for the design. The excellent design is further for the unified scheduler. The foundation is laid. However, because its shim layer needs to constantly complement the capabilities of this layer in order to adapt to each underlying platform, in order to keep up with the rhythm of the community, it cannot be regarded as a fully compatible Kubernetes native scheduler.

  • Scheduling Framework v2

While the Kubernetes ecosystem is booming, the community has not stopped. The new Scheduling Framework has been introduced since Kubernetes v1.16, which further liberates the scalability of the scheduler. The core idea is to plug in every link in the Pod scheduling process as much as possible, and rewrite all the original scheduling algorithms/strategies in the form of plug-ins to adapt to the new Scheduling Framework. The extension points are shown in the figure below:
Thinking and Practice of Complex Workload Hybrid Scheduling under Cloud Native Architecture
Based on this expansion capability, the community interest group also initiated the Scheduler-Plugins ( https://github.com/kubernetes-sigs/scheduler-plugins ) project to show that it can be implemented based on the Scheduling Framework v2 What Kubernetes native scheduler does not have. Currently, plugins such as GangScheduling, ElasticQuota, CapacityScheduling, LoadAwareScheduling, etc. have been implemented. Developers can directly compile the scheduler based on the project or import the project plug-in into a custom scheduler to compile, so as to ensure that it is fully compatible with the full capabilities of the Kubernetes native scheduler and can enjoy the benefits of extended plug-ins.


—Thinking and Practice in TDC—

In the context of a unified cloud native infrastructure, TDC is also facing the problem of how to solve the mixed scheduling of multiple workloads. Based on the investigation of related projects in the open source community and the thinking of TDC's own pain points, the following requirements are put forward for the scheduler:

1. Globally unique scheduler to prevent resource scheduling conflicts
2. Support resource scheduling in multi-tenant scenarios
3. Support reasonable scheduling of multiple workloads
4. Support resource sharing and flexible scheduling
5. Support fine-grained resource management and control
6. Support multiple scheduling algorithms
7. Support application-aware scheduling
8. Support multiple scheduling strategies

Combining with the development and status quo of the aforementioned Kubernetes ecological scheduler, we designed a scheduler suitable for TDC demand scenarios-Transwarp Scheduler based on the community's native expansion capability Scheduling Framework v2.



—Transwarp Scheduler Design—

Drawing on the design ideas of outstanding open source projects in the community, there are two cores in Transwarp Scheduler: one is to extend and implement completely based on the Scheduling Framework to ensure full compatibility with the original capabilities of the community; the other is to abstract and package on this basis to increase resources The definition of the queue meets the functional requirements of TDC at the resource scheduling level. In order to reduce the cost of migration and learning when users use Transwarp Scheduler, there is no new Workload-related CRD in Transwarp Scheduler.

  • Resource queue

The resource queue is a CRD, we named it Queue. It has the following characteristics:

1. Support hierarchical definition
2. Support weighted resource sharing
between queues 3. Support resource borrowing and recycling
between queues 4. Support fair scheduling between queues
5. Support fine-grained resource management and control
within queues 6. Support multiple queues Sorting algorithm

Through such resource queue definitions, its hierarchical definition capabilities can be used to simulate resource quota management in enterprise multi-tenant scenarios, and achieve shared and flexible quotas to break through the hard quota limits of ResourceQuota supported in native Kubernetes and achieve better Fine-grained resource management and control. At the same time, precise sorting algorithms can be specified inside each queue to meet the specific needs of different organizations and departments. On the basis of supporting the native Kubernetes scheduler capabilities, it will continue to complement the resource queues that are usually required in big data/AI scenarios. Dispatch management capabilities. In order to associate Pod scheduling with resource queues during the scheduling process, we implement it by extending the plug-in of SchedulingFramework. The main plug-ins are as follows:

1. QueueSort plug-in : Implement the Sort extension point, sort according to the sorting algorithm of the Queue to which the Pod belongs, and by default, fair scheduling between different queues is based on the HDRF algorithm.

2. QueueCapacityCheck plug-in : Implement the PreFilter extension point to check and preprocess the resource usage of the Queue.

3. QueueCapacityReserve plug-in : Implement the Reserve extension point to lock the resources that are determined to use the Queue; implement the UnReserve extension point to release the Queue resources that have been locked but the scheduling/binding fails.

4. QueuePreemption plug-in : PostFilter extension point to realize the preemption function during resource recovery.

  • Resource queue binding

In addition to the CRD of the resource queue, we also added a CRD bound to the resource queue, named QueueBinding. The reason for adding QueueBinding is to make the definition of resource queue only focus on resource scheduling level work, without having to pay attention to the association with Kubernetes resources itself, such as which namespace is bound to the resource queue, and how many Pods are allowed to be submitted by the resource queue. Such restriction conditions themselves are not the concern of the resource queue. If you try to couple in the definition of the resource queue, the controller code of the resource queue will increase the corresponding change processing. With a CRD such as QueueBinding, the resource queue can be decoupled from Kubernetes resource relevance. This part of the restriction check logic is completed by the QueueBinding controller. The relationship between Queue, QueueBinding and Kubernetes resources is as follows:
Thinking and Practice of Complex Workload Hybrid Scheduling under Cloud Native Architecture

  • Big data/AI scheduling capability expansion

Based on the resource queue introduced above, we carry out more refined and complete control at the resource level. However, only resource management and control is not enough. A scheduling strategy for specific workloads is also required, especially in the big data/AI field where the native Kubernetes scheduler is not specialized. In the following chapters, we will take the workload of Spark and TensorFlow, the mainstream computing frameworks in the big data/AI field, as a reference, and briefly explain the implementation of the corresponding scheduling strategy in Transwarp Scheduler.

1. TensorFlow job scheduling

The tf-operator in the open source project KubeFlow solves the problem of how TensorFlow jobs are orchestrated in Kubernetes, allowing users to easily and quickly set up stand-alone or distributed TensorFlow job operations in Kubernetes. However, at the Pod scheduling level, it is still possible that some TensorFlow Worker Pods are scheduled due to insufficient resources, while the other part is in the Pending state waiting for resources. Because the TensorFlow framework itself cannot all start and run at the same time, the entire training task will not be executed, thus causing a waste of resources. Similar problems are actually caused by the lack of GangScheduling scheduling mechanism in Kubernetes. All Pods that cannot achieve jobs are either scheduled or not scheduled, thus leaving resources for jobs that can be scheduled.

In order to make up for such lack of capabilities, kube-batch, Volcano, YuniKorn and other projects have implemented the GangScheduling scheduling strategy, and adapted TensorFlow's workload definition in Kubernetes, so that the corresponding scheduler can be applied The scheduling policy takes effect. Similarly, GangScheduling/CoScheduling related scheduling functions are also implemented in the scheduler-plugins project. In Transwarp Scheduler, referring to the characteristics of the implementation of the above projects, the ability of GangScheduling is supplemented by extending QueueSort, PreFilter, Permit, PostBind and other plug-ins to meet the scheduling requirements of TensorFlow-type tasks.

2. Spark job scheduling

The Spark project also has an open source spark-operator to solve its orchestration problems on Kubernetes. The reason why Spark can run on Kubernetes is because the Spark community has introduced Kubernetes as a ResourceManager support since version 2.3. However, no matter how native Spark connects to Kubernetes or how spark-operator deploys Spark jobs, there are similar resource waiting issues with TensorFlow that cause resource deadlock or waste. For example, when multiple Spark jobs are submitted at the same time, the Driver Pods of Spark jobs that are started at the same time run out of resources, which directly results in none of the Spark jobs that can be executed normally, causing a resource deadlock problem.

The solution to this problem is similar to TensorFlow's GangScheduling scheduling strategy, which must meet the conditions of All-Or-Nothing, but the Spark job itself does not require all Executors to be started to start calculations, so it only needs to ensure that at least how many ExecutorPods can be scheduled Otherwise, the Driver Pod should not be scheduled, so as to achieve effective and efficient scheduling. In Transwarp Scheduler, a certain variable condition is added to the realization of GangScheduling to satisfy Spark's job scheduling.



—Transwarp Scheduler Architecture—

According to the above design overview, the architecture and internal interactions of
Thinking and Practice of Complex Workload Hybrid Scheduling under Cloud Native Architecture
Transwarp Scheduler are as follows: The entire Transwarp Scheduler consists of three parts:

  1. Scheduler: Based on the Scheduling Framework to implement the core functions of the scheduler, the scheduling policy plug-in extensions are compiled to the Scheduler.

  2. Controller Manager: Added CRD Queue controller and CRD QueueBinding controller.

  3. Webhook: Added the admission webhook extension point of CRD Queue and the admission webhook extension point of CRD QueueBinding.



﹀Future
outlook

Based on the above design and implementation, the Transwarp Scheduler can meet many needs of TDC products and solve the pain points that the native Kubernetes scheduler cannot support, and will be released together in subsequent TDC versions. In addition, Transwarp Scheduler will continue to explore some more high-level scheduling strategies, such as application-aware, load-aware and other scheduling strategies, and will actively adopt and absorb the opinions of the community and feedback some general designs and implementations to the community.

The concept of cloud native has been proposed for many years, and with the rapid development of ecology, its concept is constantly being redefined. Transwarp's data cloud platform product TDC is constantly exploring and advancing in the wave of cloud native, and is constantly advancing to build a world-class data cloud platform product.

Guess you like

Origin blog.51cto.com/15015752/2554341