"Ceph Analysis" Series (3)-Ceph Structure

This article will analyze Ceph from the perspective of logical structure.

4.1 Hierarchy of Ceph system

        The logical hierarchy of the Ceph storage system is shown in the following figure [1] .

        


        From bottom to top, the Ceph system can be divided into four levels:

        (1) Basic storage system RADOS (Reliable, Autonomic, Distributed Object Store, that is, reliable, automated, distributed object storage)

        As the name implies, this layer itself is a complete object storage system. In fact, all user data stored in the Ceph system is ultimately stored by this layer. Ceph's high reliability, high scalability, high performance, high automation and other features are essentially provided by this layer. Therefore, understanding RADOS is the foundation and key to understanding Ceph.

        Physically, RADOS consists of a large number of storage device nodes, each node has its own hardware resources (CPU, memory, hard disk, network), and runs the operating system and file system. Section 4.2 and 4.3 will introduce RADOS.

        (2) The basic library librados

        The function of this layer is to abstract and encapsulate RADOS and provide APIs to the upper layer for application development based directly on RADOS (not the entire Ceph). In particular, RADOS is an object storage system, so the API implemented by librados is only for object storage functions.

        RADOS is developed in C ++. The native librados API provided includes C and C ++. For the documentation, see [ 2 ]. Physically, librados and the applications developed on it are located on the same machine, so they are also called local APIs. The application calls the librados API on the local machine, and the latter communicates with the nodes in the RADOS cluster through the socket and completes various operations.

        (3) High-level application interface

        This layer includes three parts: RADOS GW (RADOS Gateway), RBD (Reliable Block Device) and Ceph FS (Ceph File System). Its function is to provide a higher level of abstraction on the basis of the librados library, which is more convenient for application or Upper-layer interface used by the client.

        Among them, RADOS GW is a gateway that provides RESTful APIs compatible with Amazon S3 and Swift for the development of corresponding object storage applications. The level of abstraction provided by RADOS GW is higher, but the function is not as powerful as librados. Therefore, developers should choose to use according to their needs.

        RBD provides a standard block device interface, which is often used to create volumes for virtual machines in virtualized scenarios. Currently, Red Hat has integrated the RBD driver in KVM / QEMU to improve virtual machine access performance.

        Ceph FS is a POSIX compatible distributed file system. Since it is still under development, Ceph's official website does not recommend its use in production environments.

        (4) Application layer

        This layer is a variety of application methods for various application interfaces of Ceph in different scenarios, such as object storage applications developed directly based on librados, object storage applications developed based on RADOS GW, cloud hard drives based on RBD, and so on.

        In the above introduction, there is a place that may easily cause confusion: Since RADOS itself is already an object storage system and can also provide the librados API, why do we need to develop a RADOS GW separately?

 Understanding this issue actually helps to understand the essence of RADOS, so it is necessary to analyze it here. At first glance, the difference between librados and RADOS GW is that librados provides a local API, while RADOS GW provides a RESTful API. The programming model and actual performance of the two are different. Furthermore, it is related to the difference between the target application scenarios at these two different levels of abstraction. In other words, although RADOS and S3 and Swift are both distributed object storage systems, the functions provided by RADOS are more basic and richer. This can be seen by comparison.

        Since the API functions supported by Swift and S3 are similar, Swift is used as an example here. The API functions provided by Swift mainly include:

  • User management operations: user authentication, obtaining account information, listing container lists, etc .;
  • Container management operations: create / delete containers, read container information, list objects in containers, etc .;
  • Object management operations: object writing, reading, copying, updating, deleting, access permission setting, metadata reading or updating, etc.

        It can be seen that there are only three "objects" operated by the API provided by Swift (and S3): user accounts, containers where users store data objects, and data objects. Moreover, all operations do not involve the underlying hardware or system information of the storage system. It is not difficult to see that this API design is completely aimed at object storage application developers and object storage application users, and it is assumed that the content of their developers and users is more focused on account and data management, and not interested in the details of the underlying storage system , Not to mention the in-depth optimization of efficiency and performance.

        The design concept of librados API is completely different from this. On the one hand, there are no high-level concepts such as accounts and containers in librados; on the other hand, the librados API opens a lot of RADOS state information and configuration parameters to developers, allowing developers to observe the state of the RADOS system and the objects stored therein And powerfully control the system storage strategy. In other words, by calling the librados API, the application can not only implement operations on data objects, but also manage and configure the RADOS system. This is unimaginable for the RESTful API design of S3 and Swift, and it is not necessary.

        Based on the above analysis and comparison, it is not difficult to see that librados is actually more suitable for advanced users who have a deep understanding of the system and at the same time have a strong demand for function customization expansion and deep performance optimization. Development based on librados may be more suitable for developing dedicated applications on private Ceph systems, or developing background data management and processing applications for Ceph-based public storage systems. RADOS GW is more suitable for the development of common web-based object storage applications, such as object storage services on public clouds.

4.2 The logical structure of RADOS

        The logical structure of RADOS system is shown in the following figure [ 3 ]:

        

        As shown in the figure, the RADOS cluster is mainly composed of two types of nodes. One is a large number of OSD (Object Storage Device) responsible for completing data storage and maintenance functions, and the other is a number of monitors responsible for completing system state detection and maintenance. The node status information is transmitted between OSD and monitor to get the overall working status of the system and form a global system status record data structure, so-called cluster map. This data structure is combined with the specific algorithm provided by RADOS, which realizes the core mechanism of Ceph "no need to look up the table, it is good" and several excellent features.

        When using the RADOS system, a large number of client programs obtain the cluster map through interaction with the OSD or monitor, and then directly calculate locally to obtain the storage location of the object, and then directly communicate with the corresponding OSD to complete various data. operating. It can be seen that, in this process, as long as the cluster map is not updated frequently, the client can obviously not rely on any metadata server and perform any table lookup operations to complete the data access process. During the operation of RADOS, the update of the cluster map depends entirely on the change of the state of the system, and there are only two common events that lead to this change: OSD failure, or the expansion of RADOS. In normal application scenarios, the frequency of these two events is obviously much lower than the frequency of client access to data.

4.3 Logical structure of OSD

        According to the definition, OSD can be abstracted into two components, namely the system part and the daemon (OSD deamon) part.

        The OSD system part is essentially a computer with an operating system and file system installed. Its hardware part includes at least a single-core processor, a certain amount of memory, a hard disk, and a network card.

        Because such a small-scale x86 architecture server is not practical (in fact, it can not be seen), in practice, multiple OSDs are usually deployed on a larger server. When choosing a system configuration, it should be possible to ensure that each OSD takes up a certain amount of computing power, a certain amount of memory, and a hard disk. At the same time, it should be ensured that the server has sufficient network bandwidth. The specific hardware configuration selection can refer to [ 4 ].

        On the above system platform, each OSD has its own OSD deamon. This deamon is responsible for completing all the logical functions of the OSD, including communicating with the monitor and other OSDs (in fact, the deamon of other OSDs) to maintain and update the system status, cooperating with other OSDs to complete data storage and maintenance, and communicating with the client to complete various data Object operations, etc.

        The logical structure of the Ceph system is introduced here. The next article will focus on explaining the working principle and operation process of Ceph (mainly RADOS).

Published 59 original articles · 69 praises · 270,000+ views

Guess you like

Origin blog.csdn.net/pansaky/article/details/102454359