Three Papers on Operating System Schedulers

[Introduction] I have not forgotten that I am currently engaged in work related to the DingOS operating system, and I am not lost because of LLM. LLM will become the infrastructure, and LLM will empower the operating system, but the value of the operating system exists objectively, unless the computer architecture undergoes earth-shaking changes.

The scheduling problem of where and when computations run is perhaps the most fundamental issue in any system that multiplexes resources. However, like many other important problems in computing (such as query optimization in databases), the study of schedulers is like a pendulum that is active and dormant because it is considered a "solved" problem.

Scheduling has always been one of the most fundamental operations in systems and networks. It involves assigning tasks to CPUs and switching between them, decisions that are critical to both application performance and system efficiency. Operating system (OS) scheduling has long focused on fairness.

However, two developments in recent years have led to a revival of OS scheduling research. First, the emergence of cloud computing gives different, difficult-to-optimize metrics. For example, microlatency and microsecond (µs) scales, metrics that are not considered in traditional schedulers. Second, the end of Moore's Law has made specialization of the operating system stack (including scheduling) a necessity for continued performance gains.

Three papers in recent years may have achieved breakthroughs related to performance, scalability, and strategy selection. The first paper challenges the assumed trade-off between low latency (typically achieved by provisioning dedicated cores) and high utilization (requiring core reallocation), addressing this by implementing allocation decisions at single-microsecond granularity. The second paper decomposes policy creation and manipulation so that userspace agents fully handle policy creation and manipulation, while fixed kernel mechanisms are responsible for communicating events to agents and implementing scheduling decisions to applications. The second paper, the ability to make load-balancing and allocation decisions based on microsecond-level flexible policies, culminates in the problem of choosing a policy by application.

1. Microsecond-level core reallocation

The first paper, by Ousterhout et al., answers the fundamental question of how quickly core allocation can occur in an operating system and whether this reallocation benefits application performance. The system described in the paper, called Shenango, challenges the widely held notion that distributing cores across applications at the microsecond level is infeasible because of high overhead and potential cache pollution.

In this paper, the authors elaborate on the design and implementation of the Shenango system, including how to achieve fast core reallocation and how to avoid performance degradation due to reallocation. In addition, the authors verify the effectiveness of the Shenango system through extensive experiments, that fast core reallocation is indeed possible, and demonstrate its significant advantages in terms of performance.

3d80119c00935c303f487010d4520ec1.jpeg

In the Shenango operating system, we have achieved microsecond-level core reallocation, and the key lies in the use of dedicated scheduling cores. The core can make a CPU core allocation decision every 5 microseconds to ensure the high efficiency of the system. To determine when to allocate or reclaim cores from applications, Shenango monitors the length of each application's thread run queue and network packet queue, and uses their derivatives as congestion signals. This method can effectively avoid system congestion and ensure the stability and reliability of the system. At the same time, the algorithm runs entirely on a dedicated core that also manages the CPU cores that direct incoming network packets to their corresponding target applications. This makes the operation of the whole system more efficient, and also improves the reliability and safety of the system.

The authors demonstrate the effectiveness of this approach by showing how fine-grained reallocation of CPU cores can improve the performance of latency-sensitive and batch applications that coexist on the same system. By allocating CPU cores based on the instantaneous incoming packet rate, Shenango OS has reduced latency using a 5-microsecond core reallocation interval versus a 100-microsecond interval for the former and more than 6x higher throughput for the latter. Subsequent research has shown that Shenango's microsecond-level scheduler can also help mitigate interference with other system resources, such as cache and memory bandwidth, and provide fine-grained feedback to the network to prevent overloading.

2. Deploy the framework of operating system scheduling to Linux

Building an efficient scheduler like Shenango is an interesting lab exercise, but there are many more factors to consider in a production environment. For example, how to be compatible with existing applications and operating systems (such as Linux), how to meet different needs, and how to achieve higher scalability and reliability, etc. In order to solve these problems, some Google engineers built a framework called ghOSt, which can implement different scheduling strategies and deploy them into the Linux kernel for easier use by users.

d545f9d0c6ba9695e1ea0cfb0d92414d.jpeg

The key rationale behind the design of ghOSt is to increase the flexibility of the operating system. ghOSt takes inspiration from microkernels to delegate OS scheduling to userspace agents, either globally or per-CPU. The advantage of this method is obvious: the user space agent can formulate different scheduling strategies according to different needs and scenarios, not just limited by the inherent rules of the kernel code. As a result, developers can enjoy the flexibility of user-space development without the constraints of kernel code and long deployment cycles.

To enable seamless communication between userspace agents and the kernel, ghOSt uses shared memory to pass hint information, enabling agents to make more informed scheduling decisions. This approach not only improves the performance of the operating system, but also provides applications with broader functionality and greater efficiency. The most simplified kernel scheduling class is one of the most important components in the design of ghOSt. The kernel scheduling class is responsible for converting the scheduling events passed by the agent into a format that the kernel can understand, and returning the processing results to the agent.

In general, the design of ghOSt makes the operating system more flexible and efficient, so that it can better meet the needs of different users. It provides developers with more freedom and creative space, enabling them to better realize their own ideas and creativity. At the same time, ghOSt's design also provides users with a better experience and faster response speed, enabling them to complete their work more efficiently.

The biggest challenge faced by ghOSt is the communication latency between kernel components and user space agents, which may need to reach 5 microseconds. this may result in

(1) a race condition, e.g., a userspace agent schedules a thread to a CPU that has been removed from the thread's CPU mask);

(2) Low utilization, as the CPU remains idle waiting for the agent's scheduling decision.

ghOSt avoids race conditions by implementing a transactional API on shared memory that allows agents to commit scheduling decisions atomically. To mitigate the second problem, the authors propose to use a custom eBPF program that runs locally on each core and temporarily schedules tasks until a decision from the agent is received. The same technique works when offloading other operating system functions into user space, such as memory management.

3. Choose the best scheduling strategy option

After the introduction of ghOSt, custom scheduling strategies can be easily developed and deployed, but the question is which strategy each application should use. To answer this question, McClure et al. performed a comprehensive analysis.

After the introduction of ghOSt, custom scheduling policies can be easily developed and deployed. However, while this is a nice development, which strategy to use is an important question for each application. To address this issue, McClure et al. conducted a comprehensive analysis and made the following recommendations:

First, the requirements of the application and its nature should be considered. For example, some applications need to maintain high availability and need to be able to provide services at all times, so they need to use policies with high tolerance. Other applications may frequently need to scale and therefore require a strategy that scales well. Understanding the nature of the application is key to choosing a scheduling strategy.

Second, resource utilization in the data center should be considered. Applications running in a data center typically share physical resources such as CPU, memory, and network bandwidth. Therefore, those strategies should be selected that can maximize the use of these resources. For example, load balancing strategies can be used to ensure that each node can distribute the load evenly, thereby maximizing the resource utilization of the entire data center.

Finally, the cost of operation and administration should be considered. Some strategies may increase administrative and operational costs, so there is a tradeoff between these costs and performance. Strategies should be chosen that meet the needs of the application while minimizing operational and management costs.

The authors divide the scheduling process into two distinct strategies: distributing cores among applications and balancing load tasks across CPUs within each application. Surprisingly, they found that the second strategy is relatively simple; it is the best load balancing strategy regardless of task service time distribution, number of cores, core allocation strategy and load balancing overhead, both in terms of latency and efficiency.

In contrast, core allocation strategies are much more complex. For example, in contrast to past work, the authors found that actively reclaiming an application's cores based on average latency or utilization performed better for small tasks, rather than waiting for the CPU to become idle. They also found that when dealing with small tasks, it is better to allocate a certain amount of CPU to each application rather than dynamically allocate it.

55f14d0a1717bc60f4f6ffb908ab5ec3.jpeg

This analysis opens up new research areas, such as the development of new hardware that implements scalable global queues that perform even better than task fetching in simulations. In addition, this study does not consider the existence of preemption, so further research on how preemption policies affect scheduling decisions is needed.

4. Summary

These three papers explore how to introduce modern methods in operating system schedulers. The first paper focuses on building a scheduler that is as fast as possible, and the second aims to simplify implementation and make it compatible with existing applications and operating systems. The third paper explores optimal scheduling strategies for different types of applications. Ultimately, these three papers make useful contributions to the effort to develop better scheduling strategies for modern computing systems. These papers highlight the need for better, more efficient, and more flexible operating system schedulers, open up new research areas, and demonstrate the importance of continued development and innovation in operating system scheduling strategies.

[References and related reading]

Guess you like

Origin blog.csdn.net/wireless_com/article/details/131118418