Kernel lock optimization for operating system performance improvement

Performance is king, and system performance improvement is the pursuit of every engineer. Currently, performance optimization focuses on removing inefficiencies in the system software stack or bypassing high-overhead system operations. For example, kernel bypass achieves this goal by moving several operations in user space, and by refactoring the underlying operating system for certain classes of applications.

In many areas, specialization seems to be the answer to better performance, both in the application and in the kernel, and even between different kernel subsystems. In particular, specialization can build the context in which an application requests certain functionality from the system. While application specialization and the kernel bypass storage, networking, and accelerators, concurrency control in the kernel can be critical to overall performance.

af6e92d3997d53bd27210b0d7a3e1f7f.jpeg

1. The performance of the operating system: kernel lock

Kernel locks are a mechanism for controlling process access to shared resources. In the Linux kernel, kernel locks are implemented by assigning a special lock at process creation time. When a process needs to access a shared resource, the kernel will check whether the process already holds the lock. If not, the process will be added to the queue waiting for the lock, and wait for other processes to release the lock.

Kernel locks are critical to achieving good performance and scalability of applications. However, kernel synchronization primitives are usually invisible and out of reach of application developers. Designing locking algorithms and verifying their correctness is already challenging, and increasing hardware heterogeneity makes it even more challenging. Developers' lack of awareness of the environment in which the lock is operating, issues such as priority inversion and lock holder priority, is essentially a lack of context.

Is there a way for userspace applications to tune concurrency control in the kernel?

For example, a user can prioritize specific tasks or system calls that hold a set of locks. Users can enforce hardware-specific policies, such as asymmetric multiprocessing-aware locking, and can prioritize reads and writes based on a given workload. It may be possible to further improve the performance of the system if developers are allowed to tune the various locks in the kernel, change their parameters and behavior, and even change between different lock implementations.

Software stack specialization is a new way to improve application performance, which proposes to push code to the core for performance purposes, and improve application scalability by avoiding the bottleneck of increasing the number of cores. Over time, even a monolithic kernel like Linux has begun to allow userspace applications to customize kernel behavior. Developers can use eBPF to customize the kernel for tracing, security, and even performance purposes.

In addition to eBPF, Linux developers are also using io_uring, a shared memory ring buffer between userspace and the kernel, to speed up asynchronous IO operations. Also, today's applications can handle on-demand paging entirely in user space.

Applications control the concurrency mechanisms of the underlying kernel, which provides various opportunities for lock designers and application developers.

30c0203fd2d8ff2a815ff370069c651b.jpeg

2. Locks: past, present and future

Hardware is a major factor in determining the scalability of locks, thereby affecting application scalability. For example, with queue-based locks, excessive traffic can be reduced when multiple threads acquire locks at the same time. At the same time, hierarchical locks use batching to minimize the problem of cache line thrashing.

SHFLLock proposes a new idea of ​​designing lock algorithms by decoupling lock strategies and implementations to achieve less kernel memory overhead and performance degradation. Mainly introduces the concept of a shuffler program, which reorders the queue or modifies the state of waiting threads. While ShflLocks provides a way to enforce policy, there is also an attempt to focus on common policies over a simple set of lock acquisition/release APIs. To cater to application needs, by analyzing the specific kernel locks affecting a given workload, application developers should define their policies in a controlled and safe manner, and dynamically update lock acquisition policies, using the shuffler to enforce policies.

3 Typical Scenario: Scheduling Threads Waiting for Locks

Threads waiting for a lock can be scheduled in two different ways: acquisition-aware scheduling based on the order in which the locks are acquired, and occupancy-aware scheduling based on the time the thread spends within the critical section.

3.1 Scheduling of acquisition perception

Lock switches, enabling developers to switch between various lock algorithms. There are three situations that stand out:

  1. Switch from a neutral reader-writer lock design to a per-cpu or NUMA-based reader design for read-intensive workloads. For example, page faults and enumerating files in a directory. The other case is from a neutral Read-write locks switch to pure write locks; an example is creating multiple files in a directory.

  2. Switch from a Numa-based lock design to a combined approach with Numa awareness, where the lock holder performs operations on behalf of waiting threads. This approach has better performance because it removes at least one cached transfer.

  3. Switch between blocking and non-blocking locks, and vice versa, for example, by switching off the stop/wake strategy of SHFLLock's shuffler function, switching blocking reads to non-blocking read-write locks (rwlock).

This approach brings two benefits: first, developers can remove temporary synchronization, such as using non-blocking locks and using wait events to implement stop/wake strategies, which are commonly used in Btrfs file systems. Second, it allows developers to unify the design of locks by dynamically multiplexing multiple policies.

e6c71a7665fc272339fd56b12058c7d4.png

3.1.1 Lock inheritance

A process may acquire multiple locks to perform an operation. For example, a process in Linux can acquire up to 12 locks (for example, a rename operation), or an average of 4 locks to perform memory or file metadata management operations.

Unfortunately, this mode of locking raises the problem of queue-based locking, where some threads have to wait longer to acquire the top-level lock, which is made by another thread waiting for another lock. For example, suppose thread t1 wants to acquire two locks, L1, and then L2 as one operation, and t2 only wants to reclaim the operation on L1. Since these locking protocols are fifo-based, t1 might get L2 at the end of the queue while t2 is waiting to get L1. The developer can provide more context to the kernel: either t1 acquires all the locks together, or t1 declares the locks it already holds, which can give it a higher priority to acquire the next lock L2.

An application may wish to prioritize a system call path or a set of tasks for better performance. Developers can do this by encoding the task priority context and passing this information to the affected locks. For system calls, developers can share information about a set of locks and priority threads on the critical path. The shuffler program will then prioritize these threads over other threads waiting for the specified application's lock.

3.1.2 Exposing the Semantics of the Scheduler

In general, oversubscribing hardware resources, such as CPU or memory, results in better resource utilization, for both userspace runtime systems, and virtual machines. While oversubscription improves hardware utilization, it also introduces a double-scheduling problem. The hypervisor can schedule a vCPU to serve as the lock holder or the next lock in the VM. The hypervisor can expose vCPU scheduling information to the shuffler program to prioritize services based on their runtime quotas.

3.1.3 Adaptive stop/wake strategy

All closed locks follow the strategy of parking after rotation, that is, they stop by themselves after rotating for a certain period of time. This spin time is mostly ad hoc, that is, the waiter either spins for a certain amount of time according to the time quota, or keeps spinning if there are no tasks to perform. Application developers can now expose temporal context after analyzing the length of critical sections to minimize energy consumption and wakeup information to schedule the next waiter on time to minimize wakeup latency. Additionally, developers can further encode sleep information to wake waiters before locking to reduce long wake-up delays. This approach also works with paravirtualized spinlocks to avoid convoy effects.

3.2 Occupancy Scheduling

3.2.1 Priority Inheritance

Priority inversion occurs when a low-priority task holding a lock is scheduled by a normal-priority task to wait for the same lock. The problem is stated in the Linux IO stack: When scheduling an IO request, a normal task that wants to acquire a lock can schedule a lower priority background task that holds the same lock. The scheduling of locks is a background task, which leads to a decrease in IO performance.

3.2.2 Cooperative Scheduling for Task Fairness

This introduces a new class of problems known as the scheduler subversion problem, where two tasks acquire locks at different times. Tasks held for a long time subvert the operating system's scheduling goals. The operating system solves this problem by keeping track of critical region sizes and penalizing long-running tasks. Although this solution solves the problem, it enforces scheduling fairness even for applications that may not benefit from it.

3.2.3 Task-Fair Locking on Asymmetric Multiprocessor (AMP) Machines

With cores of different compute power in one processor, the basic locking primitives used on this architecture suffer from a scheduler subversion problem where application throughput can collapse due to the slower compute power of the weaker core. To achieve faster processes, developers can either allocate critical locks on faster cores or reorder the queue of threads waiting to acquire locks to improve overall locking.

3.2.4 Real-time scheduling

Similar to scheduling in real-time systems, application developers can create locking policies that always schedule threads to guarantee SLOs. Here, the lock can be designed as an algorithm based on phase fairness. This approach also allows to eliminate jitter and guarantees an upper bound on tail latency for latency-critical applications.

3.3 Analysis of dynamic lock

Application developers can configure file information about any kernel locks. Selecting which locks to configure enables developers to configure at different levels of granularity. For example, they can configure all spinlocks running in the kernel, locks in specific functions, code paths or namespaces, or even individual lock instances. This approach benefits application developers by being able to better understand the underlying synchronization by analyzing only the parts of interest.

Developers can also reason about performance contracts that affect application performance based on certain guarantees provided by various shuffler policies or even sets of policies.

4. An optimization framework for kernel locks

Redefine the decisions and behaviors used by kernel locks and expose them as APIs. User-defined code replaces these exposed APIs, and users can customize the locking function according to their needs. For example, whether to spin before joining the waiting queue could be an API so the user can make that decision. The user first writes his own code to modify the lock protocol in the kernel according to the use case, and then the operating system replaces the annotated lock function inside the kernel. The flow diagram is as follows:

2fd18abff7b3d53a0c78a0c8f97412c0.jpeg

The user specifies a lock policy (1), and the eBPF verifier verifies it after compilation, taking into account eBPF restrictions and mutual exclusion security properties (2, 3). The verifier will then notify the user of the verification result (4), and if successful, store the compiled eBPF code in the file system (5). Finally, replace the annotated function specifying the lock (6) with the field patch module.

4.1 API

Various APIs support the flexible implementation of lock policies while ensuring security. The underlying implementation of the operating system relies on eBPF to modify the kernel lock. By using eBPF and the lock API, the desired policy is implemented for a set of lock instances in the kernel. A user can encode multiple policies, which are converted to native code and checked for security by eBPF verifiers. Verifiers perform symbolic execution before loading native code into the kernel, such as memory access control or only whitelisted helper functions.

a3dc3db3218f5db4a677093b2e205839.jpeg

4.2 Security

In addition to eBPF validators, ShflLocks has a separate phase for lock acquisition and a phase for reordering the wait queue. The user relies on the API function to compare the current node and the shuffler node and whether to reorder the current node, and can also design a scheduler cooperative lock to prioritize the nodes with a smaller critical slice length, thereby reducing the priority of the node class.

Although incorrect user implementations may break the fairness-guaranteed policy, the mutex property can be checked and ensured at runtime. Also, the kernel doesn't have any deadlock issues, the API doesn't modify the lock behavior, it just returns the decision to move the node. Developers are able to configure their locks in a fine-grained manner by implementing the desired behavior for each call. While not changing the behavior of the lock function, the weight profile analysis strategy may increase the length of the critical section, causing performance degradation.

Additionally, eBPF exposes the ability to chain multiple eBPF programs that users can use to write policies. Finally, we also rely on field dispatch data structures for modifying the data structures used by the locking primitives. For example, the queue-based lock node data structure can be extended with additional information used to encode information for specific use cases. In the worst case without executing user-space code, dynamically modifying the lock algorithm can incur up to 20% overhead.

4.3 Combination strategy

By tuning the kernel concurrency controls, applications can have more control over the software stack. Application developers provide a set of policies that require an application lock. Combining multiple strategies is a difficult task, especially when some strategies may conflict. By leveraging program composition to automate this process, it may be possible to move security properties entirely to verification in userspace, and also provide a safe way to compose conflicting policies.

Users cannot add too many policies because their execution may fall on the critical path. Allows one privileged user to modify the kernel lock, the model is only for one user using the entire system. However, to handle multi-tenancy in a cloud environment, a tenant-aware policy writer that does not violate the isolation between users is required. Synthesize policies in user space to avoid such conflicts, and add runtime checks to locking algorithms that are only used when policies can affect certain behaviors.

In addition to locks, there are other synchronization mechanisms that are heavily used in the kernel, such as RCU, seqlocks, wait events, and other extensions that will further allow applications to improve their performance. That said, userspace applications also have their own locks that are generic in nature. In contrast, existing techniques, such as library interpolation, only allow a single change for different lock implementations when the application starts executing.

5. Summary

Kernel-locked synchronization primitives have a huge impact on the performance and scalability of some applications, however, controlling kernel synchronization primitives is out of reach for application developers. If contextual concurrency control is used, it allows userspace applications to fine-tune kernel concurrency primitives. This is a way of thinking about the specialization of the software stack, which in part accelerates innovation in the area of ​​core synchronization.

[Reference materials and related reading]

Guess you like

Origin blog.csdn.net/wireless_com/article/details/131160180