Crimson: a new generation of Ceph OSD with high performance and high scalability

6d4dbde33e8f857692c873df641d92a2.gif

The new titanium cloud service has shared 734 technical dry goods for you

88d9dddf6618c8a0554c3ed6dc6aeddd.gif

67f82b79c8f28f3aa6f672ae6d3eab5e.png

 background

As physical hardware continues to evolve, so too does the hardware used to store software.

On the one hand, memory and IO technologies have been developing rapidly, and the performance of hardware is increasing rapidly. When Ceph was originally designed, Ceph was usually deployed on mechanical hard disks, which could provide hundreds of IOPS reads and writes and tens of gigabytes of disk capacity. However, the latest NVMe devices can provide millions of IOPS for reading and writing, and support terabytes of disk capacity. The capacity of DRAM has increased by a factor of 128 in about 20 years. For network IO, NIC devices are now capable of over 400Gbps, up from 10Gbps a few years ago.

8a1c0f69046f5717bf6b212f7aa7c5a0.png

On the other hand, CPU frequency and single-threaded performance per CPU core have plateaued for about a decade, with insignificant increases. In contrast, the number of logic cores grows rapidly with transistor size.

1cccf97b0aecf1981ff819cabfa628b5.png

Keeping Ceph performance up to speed with hardware has been challenging because Ceph's architecture is ten years old - its reliance on single-core CPU performance prevents it from taking full advantage of growing IO. In particular, when the Ceph Object Storage Daemon (OSD) relies on thread pools to handle different IOs, communicating across CPU cores incurs significant latency overhead. Reducing or eliminating these overhead costs is a core goal of the Crimson project.

The Crimson project uses  shared-nothing designs and  run-to-completion models to rewrite Ceph OSD to meet demanding hardware and software system scaling requirements, while also being compatible with existing clients and components.

To understand how Crimson OSD was redesigned for CPU scaling, we compared the architectural differences between traditional OSD and Crimson OSD to explain how and why the architecture was designed this way. We then discussed why Crimson is built on top of the Seastar framework, and how each core component is extensible.

Finally, we share an update on how we made this happen, while also providing a performance result that we ultimately hope to achieve.

Crimson and traditional OSD architecture

Ceph OSD is part of the Ceph cluster and is mainly responsible for providing object access through the network, maintaining data redundancy and high availability, and persisting objects to local storage devices. As a rewritten version of traditional OSD, Crimson OSD is compatible with the existing RADOS protocol from the perspective of client and OSD, and it provides the same interface and functions. The functionalities of Ceph OSD modules such as Messenger, OSD Service, and ObjectStore have not changed much, but the form of cross-component interaction and internal resource management has been greatly refactored to use design  shared-nothing and bottom-up user space task scheduling.

In the traditional OSD architecture, each component has a thread pool. For the multi-CPU core scenario, the efficiency of using shared queues to process tasks is very low. In a simple example, a PG operation needs to be processed by a messenger worker thread first, assemble or decode the raw data stream into a message, and then put it into the message queue for scheduling. Afterwards, a PG worker thread gets the message, and after necessary processing, the request is handed over to ObjectStore in the form of a transaction.

After the transaction is committed, PG will complete the operation and send the reply through the send queue and messenger worker thread again. Although it is possible to scale the workload across multiple CPUs by adding more threads to the pool, these threads share resources by default and thus require locks, which introduces contention issues.

9958999c1a476e745e2922f1c1c33cb8.png

A major challenge of the traditional architecture is that the lock contention overhead expands rapidly as the number of tasks and CPU cores increase. In some scenarios, each lock point may become an expansion bottleneck. Additionally, these locks and queues incur latency overhead even when there is no contention. Over the years, the community has done a lot of work analyzing and optimizing more fine-grained resource management and fast-path implementations to skip queues. In the future, there will be fewer and fewer results of this type of optimization, and scalability seems to have reached a certain bottleneck under the current design architecture. There are other challenges, too. Latency issues will be exacerbated with thread pools and task queues as tasks are distributed among worker threads. Locks can force a context switch, which can make things worse.

The Crimson project seeks   to address the CPU scalability problem through shared-nothing design and  modeling. run-to-completionThe point of this design is to force each core or CPU to run a fixed thread and distribute non-blocking tasks in user space. Because requests and their resources can be allocated to individual cores, they can be processed on the same core until processing is complete. Ideally, we would no longer need all the locks and context switches, since each running non-blocking task uses the CPU until it completes the task. No other thread can preempt the task at the same time. If there is no need to communicate with other shards in the data path, performance will ideally scale linearly with the number of cores until the IO device hits its limit. This design is very suitable for Ceph OSD, because at the OSD level, all IO has been fragmented by PG.

8e86c28233318597c36112cb8c59d201.png

While interregional communication cannot be completely eliminated, it is typically used for the maintenance of the OSD's global state, not in the data path. A major challenge here is that the most significant changes are fundamental requirements for OSD operation - a considerable portion of existing locking or threading code cannot be reused and needs to be redesigned while maintaining backwards compatibility.

A redesign requires a holistic understanding of the code, and the associated caveats. shared-nothing Implementing low-level one-thread-per-coreand user-space scheduling using  the architecture is another challenge.

Crimson attempted to redesign OSD on the basis of Seastar, an asynchronous programming framework with all the desirable properties to meet the above goals.

Setstar Framework

one-thread-per-core Seastar was an ideal choice for the Crimson project because it not only implements  the architecture in C++  shared-nothing , but also provides a comprehensive set of functions and models that have proven effective for performance and scaling in other applications. Resources are not shared between shards by default, Seastar implements its own memory allocator for lock-free allocation. The allocator also takes advantage of the NUMA topology to allocate the closest memory to the slice. For some unavoidable cross-core resource sharing and communication, Seastar mandates that they be handled explicitly. If a shard owns resources from another core, it must point to those resources through external pointers; if a shard needs to communicate with other shards, it must submit and forward tasks to them. This forces programs to limit their cross-core requirements and helps reduce the scope of analysis for CPU scalability issues. Seastar also implements high-performance non-blocking communication for cross-core communication.

de0e685821301fc5303524dee04a22c9.png

Traditional programs with asynchronous events and callbacks are very difficult to implement, understand and debug. Non-blocking task scheduling in user space requires pervasive asynchronicity. Seastar uses futures, promises and continuations (f/p/c) as building blocks to organize logic. futures and promises make code easier to implement and more readable by grouping together logically connected asynchronous constructs, rather than spreading them out into plain callbacks. Seastar also provides higher-level tools for loops, timers, and future-based control lifetimes and even CPU shares. To further simplify applications, Seastar encapsulates network and disk access into  shared-nothing and f/p/c-based design patterns. The complexity and fine-grained control of employing different I/O stacks (such as epoll, linux-aio, io-uring, DPDK, etc.) is transparent to the application code.

b9ac159fff25849675fd341b705e8ca0.png

Run-to-completion performance

The Crimson team has implemented most of the key features of OSD for RBD client read and write workloads. The currently completed tasks include reimplementing messenger V2 (msgr2), heartbeat, PG peering, backfill, recovery, object-classes, watch-notify, etc., and constantly working hard to add some CI testing components. Crimson has reached a milestone where we can verify run-to-completionthe design in a single shard with sufficient stability.

Considering real-world conditions, it is verified by comparing traditional and Crimson OSD with BlueStore backend under the same random 4KB RBD workload without replication  single-shard run-to-completion. Both OSDs are assigned 2 CPU resources. Crimson OSD is special because Seastar requires an exclusive CPU core to run  single-shard OSD logic. This means that the BlueStore thread must be pinned to another core, AlienStore is introduced to bridge the boundary between Seastar thread and BlueStore thread, and submit IO tasks between the two boundaries. In contrast, traditional OSDs do not have a limit of 2 CPUs allocated.

587bbea7658e7e6aae7f1199ba102fb9.png

The performance results show that when using BlueStore, the random read performance of Crimson OSD is improved by about 25%, and the random write IOPS is about 24% higher than that of traditional OSD. Further analysis shows that in the case of random writes, CPU utilization is low, as about 20% of the CPU is consumed in frequent queries, suggesting that Crimson OSD should not be the current bottleneck.

630d8875a23ec187a810c727b21ad846.png

There is also additional overhead for Crimson OSD to submit and complete IO tasks, and to synchronize between Seastar and BlueStore threads. Therefore, we repeated the same set of experiments for the MemStore background, with 1 CPU assigned to both OSDs. As shown in the figure below, Crimson OSD provides about 70% higher IOPS in random read and 25% higher than traditional OSD in random write, which is consistent with the conclusion in previous experiments that Crimson OSD can do better.

02d23dfc946ee0db23fbb4ad94656da4.png

Although the above scenarios only cover experimental  single-shard cases, the results show the performance benefits of using the Seastar framework - eliminating locks, removing context switches through user-space task scheduling, allocating memory closer to the CPU. Also, it is important to reiterate that run-to-completion the goal of the model is to better scale CPUs and remove performance bottlenecks caused by software using high-performance hardware.

Multi-shard Implementation

The path to multi-sharding is clear. Since the IO in each PG has been logically fragmented, there is not much change to the IO path. The main challenge is to identify unavoidable cross-core communication and design new solutions to minimize its impact on the IO path, which needs to be analyzed on a case-by-case basis. In general, when an IO operation is received from a Messenger, it is directed to an OSD shard according to the PG-core mapping, and runs in the context of the same shard/CPU until completion. Note that at this stage, the RADOS protocol has been chosen not to be modified by design for the sake of simplicity.

c8bd1ab7b3bac0e5c925d96cf6f55f85.png

Messenger

Messenger plays an important role in ensuring that the solution is scalable. There are some limitations that need to be carefully considered. One limitation comes from the RADOS protocol, which defines only one connection per client or OSD. A connection must exist on a particular core to efficiently and lock-free decode and encode messages based on its state. Shared connections to OSD peers mean that cross-core message delivery to multiple PG shards is unavoidable at the current stage, unless the protocol can be tuned to allow exclusive connections to each shard.

Another limitation of the Seastar framework is that it does not allow the Seastar socket to be  moved to another core after it has been deleted accept()ed or  connect()ed not. This is a challenge for lossless connections (msgr2) as it affects the interaction between Messenger and OSD services, in which case the connection may preemptively jump to another core due to network failure reconnection.

Much of the work of scaling Messenger is to optimally scale messaging workloads (encoding, decoding, compression, encryption, buffer management, etc.) IO path delivery, ideally, it keeps at most 1 hop for each message send and receive operation under the above constraints.

OSD

OSD is responsible for maintaining the global state and activities shared between PG slices, including heartbeat, authentication, client management, osdmap, PG maintenance, access to Messenger and ObjectStore, etc.

A simple principle of multicore Crimson OSD is to keep all shared state related processing on dedicated cores. If an IO operation accesses a shared resource, it either accesses a dedicated core sequentially, or accesses an exclusive copy of the shared information that is kept synchronized.

There are two main steps in achieving this goal. The first step is to let the IO operation run in multiple OSD shards according to the PG sharding strategy, and all global information including PG state is maintained in the first shard. This step enables sharding in the OSD, but requires all decisions about IO scheduling to be made in the first shard. Even though the Messenger can run on multiple cores at this step, the message still needs to be delivered to the first shard for preparation (such as PG peering) and to determine the correct PG shard before submitting to that shard. This causes additional overhead and unbalanced CPU usage (first OSD shard usage is high, other shards are low, etc.). Therefore, the next step is to extend the PG-core map to all OSD shards.

ObjectStore

Crimson supports three ObjectStore backends: AlienStore, CyanStore, and SeaStore. AlienStore provides backward compatibility with BlueStore. CyanStore is a virtual backend for testing, implemented in volatile memory. SeaStore is a new object store, designed for Crimson OSD, using  shared-nothing design. Depending on the specific goals of the backend, the path to achieve multi-shard support is different.

1

AlienStore

AlienStore is a thin proxy in Seastar threads used to communicate with BlueStore using POSIX threads. There is no special work to do for multiple OSD slices, since the IO task communication is synchronized. Nothing else in BlueStore is customized for Crimson, because it's not really possible to extend BlueStore to a shared-nothing design because it relies on the third-party RocksDB project, and RocksDB is still threaded. However, until Crimson can come up with a sufficiently optimized and stable native storage backend solution (SeaStore), reasonable overhead in exchange for a complex storage backend solution is acceptable.

2

CyanStore

CyanStore in Crimson OSD corresponds to MemStore in traditional OSD. The only change to multi-shard support is the creation of separate CyanStore instances for each shard. One goal is to ensure that virtual IO operations can be done in the same core to help identify scalability issues at the OSD level, if any. Another goal is to make a direct performance comparison with traditional OSD at the OSD level without being affected by the complexity factor of ObjectStore.

3

SeaStore

SeaStore is the original ObjectStore solution of Crimson OSD, which is developed by Seastar framework and adopts the same design principles.

Although challenging, Crimson had to build a new local storage engine for a number of reasons. The storage backend is the main consumer of CPU resources, and if the storage backend of the Crimson OSD doesn't change, it can't really scale with cores. Our experiments also demonstrate that Crimson OSD is not the bottleneck in random write scenarios.

Second, the CPU-intensive metadata management with transaction support in BlueStore is basically provided by RocksDB, and it cannot run in native Seastar threads without reimplementation. Rather than re-implementing a general-purpose key-value transaction store for BlueStore, it is better to rethink and customize the corresponding architecture - ObjectStore - at a higher level. Problems are easier to solve in native solutions than in third-party projects, because third-party projects must ensure the use of common scenarios.

The third consideration is to provide native support for heterogeneous storage devices and hardware accelerators, allowing users to balance cost and performance according to their needs. If Crimson can better control the entire storage stack, then Crimson will be more flexible to simplify the solution to deploy hardware combinations.

SeaStore can be used normally in terms of single-shard read and write, although there is still work to be done in terms of stability and performance improvement. Current efforts are still focused on architecture rather than corner-case optimizations. It is explicitly designed for multi-shard OSDs. As with CyanStore, the first step is to create separate SeaStore instances for each OSD shard, each running on a static partition of the storage device. The second step is to implement a shared disk space balancer to dynamically adjust partitions, which should be able to run asynchronously in the background, because PG has distributed user IO in a pseudo-random manner. The number of SeaStore instances may not need to be equal to the number of OSD fragments. According to performance analysis, adjusting this ratio is the third step of later work.

Summary and test configuration

In this post, we describe why and how Ceph OSDs were refactored to keep up with hardware developments. In addition, we also give the detailed design we have done, and a simple performance test result. It also provides most of the factors to be considered for Crimson OSD to truly realize multi-core scalability.

Test results may vary with different commit versions, software and hardware configurations. In order to ensure that our tests are repeatable, reproducible, and can be used as a reference in future scenarios, we list all settings and considerations that may have an impact.

We deployed a local Ceph cluster for both Crimson and traditional OSDs and performed FIO tests using CBT. Crimson still has issues with  tcmalloc  , so to be fair we configure both OSDs to use  libc*. We use BlueStore. RBD cache is disabled. The number of BlueStore threads is set to 4 for better results. When deploying Crimson, you need to specify *ceph-osd_cmd (  crimson-osd  ). CPU binding is specified via crimson_cpusets in the CBT configuration file  , and BlueStore threads are configured via  crimson_alien_thread_cpu_cores  and  crimson_alien_op_num_threads  . To deploy traditional OSDs, numactl  is used to control CPU binding. According to the CBT repository, the rest of the deployment process is unchanged.

testing scenarios:

  • Client: 4 FIO clients

  • IO mode: random write and then random read

  • Block size: 4KB

  • Time: 300s X 5 times to get the average results

  • IO-depth: 32 X 4 clients

  • Create 1 pool using 1 replica

  • 1 RBD image X 4 clients

  • The size of each image is 256GB

test environment:

  • Ceph version (SHA1): 7803eb186d02bb852b95efd1a1f61f32618761d9

  • Ubuntu 20.04

  • GCC-12

  • 1TB NVMe SSD as BlueStore block device

  • 50GB RAM for MemStore and CyanStore

Accelerate scientific and technological innovation

Breakthroughs have been achieved in some key core technologies, strategic emerging industries have grown stronger, and major achievements have been made in manned spaceflight, lunar and fire exploration, deep-sea and deep-earth exploration, supercomputers, satellite navigation, quantum information, nuclear power technology, large aircraft manufacturing, and biomedicine , into the ranks of innovative countries.

    recommended reading   

aea67fbde72b070ce697f61b9f1e1f89.png

efbf7c11f897ca938d36cf08da4999c4.png

    recommended video    

Guess you like

Origin blog.csdn.net/NewTyun/article/details/129700829