The architectural evolution of the cluster scheduling framework (transfer)

The cluster architecture is a very important component of the modern data center and has grown considerably in recent years. The architecture has also shifted from a monolithic design to a more flexible, decentralized and distributed design. However, many modern open source implementations are still monolithic designs or lack many features that would be very useful to real users.

This blog is the first in a series on task scheduling for large fleets. Resource scheduling is already well implemented at Amazon, Google, Facebook, Microsoft, or Yahoo, and demand is growing elsewhere. Scheduling is an important topic because it is directly related to the investment in running the cluster: a bad scheduler will result in low utilization, and expensive hardware resources will be wasted. It cannot achieve high utilization by itself, because loads that conflict with resource utilization must be carefully configured and properly scheduled.
Architecture Evolution

This blog discusses how scheduling architectures have evolved in recent years, and why. Figure 1 demonstrates the different approaches: a gray square corresponds to a device, a circle of different colors corresponds to a task, and a square with an "S" represents a scheduler. ​Arrows represent scheduler scheduling decisions; three colors represent different workloads (eg, web serving, batch analytics, and machine learning)



Figure 1: Different scheduling architectures, gray boxes represent cluster devices, circles represent tasks, and Si represent schedulers.

(a) Monolithic Scheduler (b) Secondary Scheduling (c) Shared State Scheduling (d) Distributed Scheduling (e) Hybrid Scheduling

​Many cluster architectures, such as massive HPC, use Borg The scheduler, which is completely different from the Hadoop scheduler and the Kubernetes scheduler, is a monolithic scheduler.

Monolithic scheduling

A single scheduling process runs on a physical machine (such as JobTacker in Hadoop V1, kube-scheduler in Kubernetes), and assigns tasks to other physical machines in the cluster. All workloads are subject to a scheduler, and all tasks run through this one scheduling logic (see Figure 1a). The simplest format of this architecture is unique. On this basis, many load schedulers have been developed, such as Paragon and Quasar schedulers, which use machine learning methods to avoid resource competition between different loads.

Clusters today are running different types of applications (corresponding to early MapReduce job scenarios), however, using a single scheduler to handle such complex and heterogeneous workloads can be tricky for several reasons:
The scheduler must treat long-term Run service jobs and batch analytics jobs​, which are reasonable requirements.
Because different applications have different requirements, more functions are added to the scheduler, and business logic and deployment methods are added.
The order in which the scheduler handles tasks becomes an issue: if the scheduler is not carefully designed, queuing effects (such as head-of-line blocking) and rollbacks can become a problem.

All in all, this sounds like a nightmare for engineers, and the endless feature requests faced by scheduler maintainers confirms it.

secondary scheduling

Second-level scheduling solves this problem by separating resource scheduling and task scheduling, which allows task scheduling logic to not only be tailored to different application requirements, but also preserves the possibility of sharing resources between clusters. Although the emphasis is different, both Mesos and YARN cluster management use this approach: In Mesos, resources are offered to the application layer for scheduling, while YARN allows the application layer to schedule requests for resources (and then accept the resource allocation). Figure 1b illustrates this concept. Jobs are responsible for scheduling (S0-S2) interactions with the resource manager, and the resource manager assigns dynamic resources to each job. This solution gives customers the possibility to flexibly schedule job strategies.

However, there is also a problem with solving the problem through secondary scheduling: application-level scheduling​ hides the global scheduling of resources, that is, the global optional resource configuration is no longer visible. Instead, you can only see the resources that the resource manager actively provides (offer, corresponding to Mesos) or request/allocate (request/allocate, corresponding to YARN) to the application. This brings some problems
: ​Reentrant priority (that is, high priority will eliminate low priority tasks​) becomes difficult to implement. In offer-based mode, resources occupied by running tasks are not visible to the higher-level scheduler; in request-based mode, the underlying resource manager must understand the reentrancy policy (relative to the application).
The scheduler cannot intervene in running services, potentially reducing resource usage efficiency (for example, "starved neighbors" occupying IO bandwidth) because the scheduler cannot see them.
Application-related schedulers pay more attention to the different situations of underlying resource usage, but their only way to select resources is the Offer/request interface provided by the resource manager, which can easily become complicated.

Shared-state architecture

The shared-state architecture solves this problem by adopting a semi-distributed model, in which multiple copies of the cluster state are independently updated by the application-level scheduler, as shown in Figure 1C. Once there is an update locally, the scheduler issues a concurrent transaction to update all shared cluster state. Sometimes a transaction update may fail because another scheduler sent a conflicting transaction.

The most important examples of shared state architectures are Google's Omega system, and Microsoft's Apollo and Hashicorp's Nomad container scheduling. In these examples, the shared cluster state architecture is implemented through a single module, the "cell state" in Omega, the "resource monitor" in Apollo, and the "plan queue" in Nomad. The difference between Apollo and the other two is that the shared state is read-only, and the scheduling transaction is directly submitted to the cluster device; the device itself will check the conflict to decide whether to accept or reject the update, making Apollo even when the shared state is temporarily unavailable. can continue.

Logically speaking, a shared state design does not necessarily have to distribute the full state elsewhere, this way (a bit like Apollo) each physical device maintains its own state and sends updates to other interested agents such as schedulers, devices Health monitoring, and resource monitoring systems. The local state of each physical device becomes a "shard" of globally shared state.

However, the shared state architecture also has some drawbacks, must act on stable (stale, stale) information (unlike the centralized scheduler), and may degrade scheduler performance in high contention situations (although there are also other architectures. this possibility).

The fully distributed architecture

looks more decentralized: there is no coordination between schedulers, and many independent schedulers are used to respond to different loads, as shown in Figure 1d. Each scheduler acts on its own local (partial or often stale [stale]) cluster state information. Typically, jobs can be submitted to any scheduler, and the scheduler can publish jobs to any cluster node for execution. Unlike the secondary scheduler, each scheduler does not have a responsible partition, and the global scheduling and resource partitions are statistically significant and randomly distributed, which is a bit like a shared state architecture, but there is no central control.

Although the underlying concept of decentralization (decentralized random selection) appeared in 1996, distributed scheduling in the modern sense should start from the Sparrow paper. At that time, there was a discussion that there are many fine-grained tasks. Advantage, the key assumption of Sparrow's paper is that the task period on the cluster can become very short; next, the author assumes that a large number of tasks means that the scheduler must support a high throughput of decisions, and a single scheduler cannot support such a high amount of decision-making (Assuming millions of tasks per second), Sparrow spreads this load across many schedulers.

This implementation makes a lot of sense: decentralization theoretically means more arbitration, but this is very suitable for certain kinds of workloads, which we will discuss in a later series. Now, there are enough reasons to prove that since distributed scheduling is uncoordinated, it is more suitable for simple logic than complex single-level scheduling, secondary scheduling or distributed state scheduling. For example:
1. Distributed scheduling​ is based on a simple "slot" concept, dividing each device into n standard time slots and running n concurrent tasks at the same time, although this simplification ignores that the task resource requirements are separate different facts.
2. Use a queue method that obeys simple service rules (such as FIFO in Sparrow) on the task side (worker side), so that the flexibility of the scheduler is limited, and the scheduler only needs to decide on which device to enqueue the task.
3. Because there is no central control, the distributed scheduler has a certain difficulty in setting global variables (for example, fairness policies or strict priority precedence, etc.).
4. Because distributed scheduling is designed to make quick decisions based on minimal knowledge, it cannot support or undertake complex application-related scheduling strategies, so avoiding interference between tasks is difficult for fully distributed scheduling.

Hybrid Architecture​

Hybrid architecture is a solution proposed recently (originated in academia) to solve the shortcomings of fully distributed architecture, which integrates the design of monolithic or shared state. In this way, such as Tarcil, Mercury and Hawk, there are generally two scheduling paths: one is a distributed path designed for partial loads (for example, short-duration tasks or low-priority batch loads), and the other is centralized scheduling, processing remaining under load, as shown in Figure 1e. The scheduler that works in the hybrid architecture is unique for the described workload. In fact, as far as I know, there is no real hybrid architecture deployed in production systems.
Practical significance

The relative value of different scheduling architectures, in addition to many research papers, the discussion is not limited to the academy. For in-depth discussions of Borg, Mesos and Omega papers from an industry perspective, you can refer to Andrew Wang's professional blog. However, many of the systems discussed above have been deployed in large enterprise production systems (eg, Microsoft's Apollo, Google's Borg, Apple's Mesos), and in turn these systems have inspired other available open source projects.

Today, many cluster systems run containerized workloads, so there is a series of container-oriented "Orchestration Frameworks" (Orchestration Framworks) similar to Google and others called "cluster management systems". However, there is little discussion of these scheduling frameworks and design principles, focusing more on user-facing scheduling APIs (for example, this Armand Grillet report comparing Docker Swarm, Mesos/Marathon, and Kubernetes' default scheduler). However, many customers neither understand the difference between different scheduling architectures nor which one is more suitable for their application.

Figure 2 shows the architecture of some open source orchestration frameworks and the functions supported by the scheduler. The bottom of the chart also includes a comparison of Google and Microsoft closed-source systems. ​The resource granularity column shows whether the scheduler allocates tasks to fixed-size time slots, or allocates resources according to multi-dimensional requirements (such as CPU, memory, disk IO, network bandwidth, etc.).



Figure 2: Classification and function comparison of common open source orchestration frameworks, and comparison with closed source systems.

The main factor in determining an appropriate scheduling architecture is whether your cluster is running a heterogeneous (or mixed) workload. For example, front-end services (eg, load-balancing web services and memcached) and batch data analysis (eg, MapReduce or Spark) are mixed together. This combination makes sense to improve system utilization, but different applications require different scheduling methods. In a mixed setting, monolithic scheduling is likely to lead to sub-optimal task allocation, because based on application requirements, monolithic scheduling logic cannot be diversified, and secondary or shared state scheduling may be more suitable.

Many resources running on user-facing service loads generally meet the peak demand of the container, but in reality the resources are over-allocated. In this case, being able to have the opportunity to reduce over-allocation of resources to low-priority loads is key to an efficient cluster. Although Kubernetes has a relatively mature solution, Mesos is currently the only open source system that supports this over-allocation strategy. This function should have more room for improvement in the future, because according to the Google borg cluster, the utilization of many clusters is still less than 60-70%. In subsequent blogs we will discuss aspects such as resource estimation, over-allocation and efficient device utilization.

Finally, specialized analytics and OLAP applications (eg, Dremel or SparkSQL) are well suited for fully distributed scheduling. However, fully distributed schedulers (such as Sparrow) have relatively strict function settings built in, so when the load is homogeneous (that is, all tasks run at the same time), the setup times (set-up times) are short (that is, long after tasks are scheduled Time runs, like MapReduce application tasks running on YARN), and task throughput (churn) is high (ie, many scheduling decisions must be made in a short time). We will discuss these conditions in detail in the next blog, and why fully distributed scheduling (and distributed modules in hybrid architectures) is only valid for this use case.

Now, we can show that distributed scheduling is simpler than other scheduling architectures and does not support other resource dimensions, over-allocation or rescheduling.

In conclusion, the table in Figure 2 shows that there is still a lot of room for improvement in open source frameworks relative to more advanced but closed-source systems. Actions can be taken from missing features, poor usage, unpredictable task performance, noisy neighbours reducing efficiency, and the scheduler being fine-tuned to support specific customer needs.

However, there is a lot of good news: while many clusters still use monolithic scheduling today, many are starting to migrate to more flexible architectures. Kubernetes can already support pluggable schedulers today (kube-scheduler pods can be replaced by other API-compatible scheduling pods), and more schedulers will support "extenders" from version 1.2 to provide customized policies. Docker Swarm, as I understand it, will also support pluggable schedulers in the future. ​Next Steps



The next blog will discuss whether fully distributed architecture is a key technological innovation for scalable cluster scheduling (dissenting voice: not necessary). We then discuss resource adaptation strategies (improving utilization), and finally discuss how our Firmament scheduling platform combines and shares state architecture and monolithic scheduling quality, as well as fully distributed scheduler performance issues.

Original link: The evolution of cluster scheduler architectures (Translation: Yang Feng)
Translation from: Dockerone

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326665768&siteId=291194637