A Survey of Ceph Distributed Storage System Architecture Research

picture

picture

The Ceph project was developed in 2006 by Weil of the University of California, Santa Cruz. At that time, he found that the query and maintenance of metadata seriously affected the performance and scalability of distributed file systems such as Luster, so he designed a method called CRUSH that uses algorithms to determine the correspondence between data and storage nodes. The Linux kernel 2.6.34 released in May 2015 has started to support Ceph. Weil also established IntTank to focus on the development of Ceph. In May 2014, the company was acquired by RedHat. Ceph supports three storage access interfaces at the same time, so it is widely used in open source private cloud computing platforms to provide virtual machine storage and object access capabilities for cloud computing platforms.

After Ceph was open-sourced, a large number of companies and developers devoted their energy to developing its new features and functions. Table 1 lists important version updates of Ceph since open source, and it sorts important release versions in alphabetical order.

picture

The design goal of the Ceph storage system is to provide high-performance, high-scalability, and high-availability distributed storage services. It uses RADOS to provide a stable, scalable, high-performance single logical object storage interface and a storage system capable of node self-adaptation and self-management on dynamically changing and heterogeneous storage device clusters. The placement of data adopts the CRUSH algorithm, and the client determines the location of the object according to the algorithm and directly accesses the storage node without accessing the metadata server. The CRUSH algorithm has better scalability and performance. This article introduces Ceph's cluster architecture, data placement methods, and data read and write paths, and on this basis analyzes its performance characteristics and bottlenecks.

cluster architecture 

RADOS can provide highly reliable, high performance and fully distributed object storage services. The distribution of objects can be based on the real-time status of each node in the cluster, or you can customize the fault domain to adjust the data distribution. Both block devices and files are abstractly packaged as objects, and objects are abstract data types with both security and strong consistency semantics. Therefore, RADOS can realize dynamic data and load balancing in large-scale heterogeneous storage clusters.

The object storage device (OSD) is the basic storage unit of the RADOS cluster. Its main functions are to store, back up, and restore data, and perform load balancing and heartbeat checks with other OSDs. A hard disk usually corresponds to an OSD, which manages the hard disk storage, but sometimes a partition can also become an OSD, and each OSD can provide complete and local object storage services with strong consistency semantics. MDS is a metadata server that provides external requests for processing metadata issued by CephFS during service, and converts client requests for files into requests for objects. There can be multiple MDSs in RADOS to share the work of metadata query.

picture

data placement algorithm 

The key to high scalability of RADOS is to completely abandon the central metadata node in the traditional storage system, and to replace it with a controlled copy distribution algorithm based on scalable hash - CRUSH. Through the CRUSH algorithm, the client can calculate List the OSD where the object to be accessed resides.

Compared with previous methods, CRUSH's data management mechanism is better. It distributes work to all clients and OSDs in the cluster for processing, so it has great scalability. CRUSH uses intelligent data replication to ensure elasticity and is more suitable for ultra-large-scale storage. As shown in the figure, there are logical mappings from files to objects and PG (Placement Group). The mapping from PG to OSD uses the CRUSH algorithm to ensure that the corresponding data location can be found when adding or deleting cluster nodes.

picture

According to Weil's research, the CRUSH algorithm has fairly good scalability and can still ensure good load balancing in the case of thousands of OSDs. But this is more on a theoretical level, and no researchers have given test results in a production environment of several petabytes.

The CRUSH algorithm is one of the two original innovations of Ceph and the cornerstone of the entire RADOS. On the basis of consistent hashing, CRUSH has well considered the isolation of disaster recovery domains, and can implement replica placement rules for various loads, such as cross-computer room, rack awareness, etc. At the same time, the CRUSH algorithm supports two data redundancy methods, multi-copy and erasure code, and also provides four different types of Bucket (Uniform, List, Tree, Straw), fully considering the iterative deployment of hardware in the actual production process Way.

Although CRUSH provides a fast data location method, it also has certain defects. First of all, there will be a weight imbalance when selecting an OSD, that is, although a low-weight OSD is available in practice, it is quite different from other replication nodes and requires secondary hashing; secondly, there will be additional data when adding and deleting OSDs. Migration; finally, relying solely on the randomness of the hash may lead to uneven capacity utilization of the OSD, with a difference of more than 40% in the actual environment. Therefore, starting from the Luminous version released in 2017, Ceph provides a new mechanism called upmap, which is used to manually specify the distribution location of PG to achieve the effect of balanced data.

Unified Access Interface 

RADOS provides distributed object storage capabilities and extends block storage and file storage capabilities on this basis. The size of a single object in RADOS is specified according to the configuration file (usually 4M). Libraries provided by LIBRADOS can access the contents of arbitrary objects. RGW provides a Bucket-based object storage service. The services provided by RGW are compatible with S3 of AWS (Amazon WebServices) and Swift of Openstack. The object of RGW can be larger than 4M. When the size exceeds 4M, RGW will divide the object into the first object and multiple data objects.

picture

The block storage interface provides disk-like storage capabilities for contiguous byte sequences, which is the most widely used form of storing data. Disk arrays, storage area networks, and iSCSI can all provide block storage capabilities. Ceph's block storage utilizes RADOS functions to support features such as replication, snapshots, consistency, and high availability. Block devices are thin-provisioned and resizable. Ceph's RBD can use kernel modules or librbd to interact with OSD. RBD devices are by default in a resource pool named rbd. Each rbd will generate a name named rbdName after it is created.

The file system interface is implemented through CephFS. Both data (file content) and metadata (directories, files) in the CephFS file system are stored on the OSD in the form of objects. The client can be mounted in application layer mode using ceph fuse or in kernel mode using the kernel's mount ceph. Both modes communicate with the MDS to obtain the directory structure information of the file system and access the corresponding OSD. The metadata of CephFS is also stored in the object, the prefix of the object is msd_, and the stored content includes the index node, metadata log and snapshot of the file system. When MDS starts, it will read the file system metadata object and cache it in memory, and the client needs to communicate with it to query or update the metadata information.

Ceph is a general-purpose distributed file system, suitable for different scenarios. The optimization of the internal mechanism will improve the performance of all scenarios, but the difficulty and complexity of optimization are also the highest.

In a distributed storage system, data is distributed across a large number of storage servers, and most distributed storage systems directly use local file systems to store data, such as HDFS and Luster. A high-performance, highly reliable distributed storage system is inseparable from an efficient, consistent, stable, and reliable local file system. The code of the local file system has been tested and optimized for performance for a long time, and there are corresponding solutions for data persistence and space management. The file system provides a POSIX interface, through which the distributed file system can switch between different local file systems.

Early versions of Ceph used the method of storing objects in the storage backend of the local file system, which is called FileStore. FileStore puts objects and object attributes on the local file system through the POSIX interface, such as XFS, ext4, btrfs, etc. Initially, object attributes were stored in POSIX extended file attributes (xattrs), but when object attributes later exceeded the size or count limit of xattrs, FileStore stored object attributes in LevelDB. The reasons why the local file system cannot be well adapted to Ceph's object storage requirements mainly include the following aspects.

1) The separation of data and metadata is not complete, resulting in slow object addressing. FileStore places objects in different directories according to their prefixes. Objects need to be addressed multiple times when they are accessed, and files in the same directory are not sorted.

2) The local file system does not support object transaction operations. In order to support the characteristics of write transactions, FileStore uses the write-ahead log function to ensure the atomicity of transactions. This leads to the problem of "double writing" of data, resulting in a waste of half of the disk performance.

3) The local file system also has operations such as logs to ensure consistency, which further leads to write amplification.

The drawbacks of using the local file system as backend storage lead to poor performance of FileStore. Our test results show that the write performance of the block storage service under the condition of three copies cannot even reach 1/3 of the performance of the hard disk itself.

In response to the defects of FileStore, the Ceph community redeveloped BlueStore in 2015. The logical structures of the two storage backends are shown in the figure. BlueStore shortens the IO path by directly managing raw devices. The Ceph community designed a simplified file system, BlueFS, which bypasses the local file system layer and solves the problem of inefficient traversal of the file system hierarchy. It uses KV index to store metadata, strictly separates metadata and data, and improves indexing efficiency. These improvements solve the problem of "double writing" in the log and bring about a greater improvement in read and write performance.

picture

picture

The performance of BlueStore is more than double that of FileStore in the case of three copies, and the performance improvement is the highest in the case of erasure codes, which can reach 3 times of the original.

picture

Although BlueStore was designed with SSD and NVMeSSD flash memory in mind, it does not support new hardware or hybrid storage well. In addition, there are also some problems in the design of BlueStore, such as data and metadata are stored in different locations, the metadata structure and IO logic are complex, and there may be double-write problems in the case of small IO. Data takes up a lot of memory.

The flash disk needs to erase the original data before writing. Existing NVMe devices do not record which addresses need to be erased before writing, thus resulting in inefficient garbage collection inside the device. In principle, asynchronous garbage collection can improve writing efficiency. If the disk layout is not modified, the granularity of garbage collection will be smaller, but in fact, the effect of this operation on the device and the middle layer is not good. Aiming at the characteristics of flash disks, the Ceph community proposed a new disk layout, SeaStore, which performs garbage collection on higher-level drivers. The basic idea is to divide the device space into multiple free segments, each segment is 100MB to 10GB in size, and all data is sequentially streamed to the segment of the device. When deleting data, it only marks and does not perform garbage collection. When the utilization in a segment drops below a certain utilization threshold, the data in it is moved to another segment.

Cleanup work and write work are mixed together to avoid fluctuations in write latency. When the data is all cleaned up, the entire segment is discarded, and the device can erase and reclaim the segment. SeaStore is mainly used in NVMe devices. The SeaStore framework is used to implement the run-to-completion model based on the programming method of the future planning model. When it is used in combination with SPDK and DPDK, it can realize zero ( Min) copy. At present, the storage engine is still under development, and the effect of performance improvement is not yet clear.

The domestic Sangfor company has designed a user-mode local storage engine based on SPDK - PFStore to meet the needs of high-performance distributed storage. It uses the method of appending to the data, writes the metadata modification increment into the log, and refreshes it regularly in the later stage. Then write the data into RocksDB at disk time. In addition, the engine starts multiple instances to separately manage different partitions of the SSD, and starts multiple OSDs to improve performance.

In recent years, with the maturity and market of new hardware such as Openchannel SSD, 3DXPoint, non-volatile memory, and SMR, some storage backend optimization technologies for specific hardware have emerged. We will introduce the new hardware-oriented technologies in detail in the next content optimization technology.

Authors: Zhang Xiao, Zhang Simeng, Shi Jia, Dong Cong, Li Zhanhuai

After sharing today's content, in-depth technical details and solutions, please refer to:

Performance tuning of Ceph storage system

Ceph technical architecture, ecology and characteristics analysis (second edition)

High Performance Computing Technology, Solution and Industry Solution (Second Edition)

Summary of InfiniBand Architecture and Technical Practice (Second Edition)

RDMA principle analysis, comparison and technology implementation analysis

Or get the whole store information package, and then get all the new additions and updates of the whole store for free.

Full store technical data package (full)

picture

Reprint statement: Please indicate the author and source when reprinting this article. If there are copyright issues in this article, please leave a message and contact us. Thank you.

recommended reading

For more summaries of architecture-related technical knowledge, please refer to the related e-book " Package of Architect's Whole Store Technical Documentation " ( the details of the packaging summary of 37 technical documents can be obtained through " Read the original text ").

The content of the whole store is continuously updated. Now place an order for " Technical Data Package of the Whole Store (Full) ", and you can enjoy the " free " subscription of the whole store content update in the future. The price is only 198 yuan (the original total price is 350 yuan).

Kind tips:

Scan the QR code to follow the official account, and click the link to read the original text to get the details of the e-book " Architect Technology Store Data Package Summary (Complete) ".

picture

picture

Guess you like

Origin blog.csdn.net/swingwang/article/details/118456241