"Ceph Analysis" series (4)-Ceph working principle and process

 This article will briefly introduce the working principle of Ceph and several key workflows. As mentioned earlier, because the implementation of Ceph's functions is essentially dependent on RADOS, the introduction here is actually for RADOS. For the upper part, especially the RADOS GW and RBD, because the existing documents (including Sage's paper) are not described in detail, so this article may be unclear, please readers forgive me.

        This article will first introduce the core computing-based object addressing mechanism in RADOS, then explain the workflow of object access, then introduce the working process of RADOS cluster maintenance, and finally review its technical advantages in combination with the structure and principles of Ceph Anatomy.

5.1 Addressing process

        The addressing process in the Ceph system is shown in the following figure [ 1 ].

        
        The concepts on the left side of the above diagram are explained as follows:

        File-The file here is the file that the user needs to store or access. For an object storage application developed based on Ceph, this file corresponds to the "object" in the application, that is, the "object" directly operated by the user.

        Ojbect-The object here is the "object" that RADOS sees. The difference between Object and the file mentioned above is that the maximum size of object is limited by RADOS (usually 2MB or 4MB) in order to achieve the organization and management of the underlying storage. Therefore, when the upper-layer application stores a file with a large size to RADOS, it is necessary to divide the file into a series of objects of uniform size (the last one can have different sizes) for storage. In order to avoid confusion, in this article, the term "object" in Chinese will be avoided as much as possible, and file or object will be used directly for explanation.

        PG (Placement Group)-As the name suggests, the purpose of PG is to organize and map the storage of objects. Specifically, a PG is responsible for organizing several objects (which can be thousands or more), but an object can only be mapped to one PG, that is, there is a "one-to-many" mapping relationship between PG and object. At the same time, a PG will be mapped to n OSDs, and each OSD will carry a large number of PGs, that is, there is a "many to many" mapping relationship between PG and OSD. In practice, n is at least 2, or at least 3 if used in a production environment. There can be hundreds of PGs on an OSD. In fact, the setting of PG number involves the uniformity of data distribution. On this point, the following will be expanded.

        OSD-that is, object storage device, which has been introduced in detail above, and will not be expanded here. The only thing that needs to be explained is that the number of OSDs is actually related to the uniformity of data distribution in the system, so the number should not be too small. In practice, it should be at least tens or hundreds of orders of magnitude to help the design of Ceph system to play its due advantages.

 Failure domain-This concept is not defined in the paper. Fortunately, readers who have a certain concept of distributed storage systems should be able to understand the general idea.

        Based on the above definition, the addressing process can be explained. Specifically, addressing in Ceph must undergo at least the following three mappings:

        (1) File-> object mapping

        The purpose of this mapping is to map the file that the user wants to operate into an object that RADOS can handle. The mapping is very simple. In essence, it divides the file according to the maximum size of the object, which is equivalent to the striping process in RAID. The advantages of this segmentation are twofold: one is to make the file of unlimited size into an object with the same maximum size and can be efficiently managed by RADOS; the other is to make the serial processing of a single file into multiple objects Parallel processing.

        Each object generated after segmentation will get a unique oid, namely object id. The method of generation is also linear mapping, which is extremely simple. In the figure, ino is the metadata of the file to be operated, which can be simply understood as the unique id of the file. ono is the serial number of an object generated by the file segmentation. The oid is obtained by simply concatenating the serial number after the file id. For example, if a file with the id filename is divided into three objects, the object serial numbers are 0, 1, and 2, and the resulting oids are filename0, filename1, and filename2 in that order.

        The implicit problem here is that the uniqueness of the ino must be guaranteed, otherwise the subsequent mapping cannot be performed correctly.

        (2) Object-> PG mapping

        After the file is mapped to one or more objects, you need to map each object to a PG independently. This mapping process is also very simple, as shown in the figure, the calculation formula is:

        hash(oid) & mask -> pgid

        This shows that the calculation consists of two steps. The first is to use a static hash function specified by the Ceph system to calculate the hash value of oid, and map oid into a pseudo-random value with an approximately uniform distribution. Then, the pseudo-random value and the mask are combined in phases to obtain the final PG serial number (pgid). According to the design of RADOS, given the total number of PG is m (m should be an integer power of 2), the value of mask is m-1. Therefore, the overall result of the hash value calculation and the bitwise AND operation is actually to randomly select one of all m PGs approximately uniformly. Based on this mechanism, when there are a large number of objects and a large number of PGs, RADOS can guarantee an approximately uniform mapping between objects and PGs. And because the object is divided from the file, the size of most objects is the same, so this mapping finally guarantees that the total data amount of the object stored in each PG is approximately uniform.

        It is not difficult to see from the introduction that the "large amount" has been repeatedly emphasized here. Only when the number of objects and PGs is large, the approximate uniformity of this pseudo-random relationship can be established, and the data storage uniformity of Ceph can be guaranteed. In order to ensure the establishment of a large number, on the one hand, the maximum size of the object should be reasonably configured so that the same number of files can be divided into more objects; on the other hand, Ceph also recommends that the total number of PGs should be the number of total OSD One hundred times to ensure that there are a sufficient number of PGs available for mapping.

        (3) PG-> OSD mapping

        The third mapping is to map the PG as the logical organization unit of the object to the actual storage unit OSD of the data. As shown in the figure, RADOS uses an algorithm called CRUSH, substitutes pgid into it, and then gets a set of n OSDs. The n OSDs are jointly responsible for storing and maintaining all objects in a PG. As mentioned earlier, the value of n can be configured according to the reliability requirements in practical applications, which is usually 3 in a production environment. Specific to each OSD, the OSD deamon running on it is responsible for storing, accessing, and maintaining metadata in the local file system of objects mapped to the local.

        Unlike the hash algorithm used in the "object-> PG" mapping, the result of this CRUSH algorithm is not absolutely constant, but is affected by other factors. There are two main influencing factors:

One is the current system state, which is the cluster map mentioned in the " Ceph Analysis" Series 4-Logical Structure . When the state and quantity of OSD in the system change, the cluster map may change, and this change will affect the mapping between PG and OSD.

        The second is storage strategy configuration. The strategy here is mainly related to security. With policy configuration, system administrators can designate three OSDs carrying the same PG to be located on different servers or even racks in the data center, thereby further improving storage reliability.

        Therefore, only when the system state (cluster map) and storage strategy do not change, the mapping relationship between PG and OSD is fixed. In actual use, the strategy usually does not change once configured. The change in system state is due to equipment damage or to the expansion of the storage cluster. Fortunately, Ceph itself provides automated support for this change, so even if the mapping relationship between PG and OSD changes, it will not cause problems for the application. In fact, Ceph needs to use this dynamic mapping relationship purposefully. It is by taking advantage of the dynamic characteristics of CRUSH that Ceph can dynamically migrate a PG to different OSD combinations as needed, thereby automatically achieving features such as high reliability and data distribution re-blancing.

        The reason why the CRUSH algorithm is used in this mapping instead of other hash algorithms is that one of the reasons is that CRUSH has the above configurable characteristics, and the physical location mapping strategy of the OSD can be determined according to the configuration parameters of the administrator; on the other hand, it is because CRUSH has a special "stability", that is, when a new OSD is added to the system, resulting in an increase in the size of the system, the mapping relationship between most PGs and OSDs will not change, only a small number of PGs Changes occur and trigger data migration. This kind of configurability and stability are not provided by ordinary hash algorithms. Therefore, the design of the CRUSH algorithm is also one of the core contents of Ceph. For a detailed introduction, please refer to [ 2 ].

        So far, Ceph has completed the entire mapping process from file to object, PG and OSD through three mappings. Throughout the entire process, we can see that there is no need for any global table lookup operation. As for the only global data structure cluster map, will be introduced later. It can be pointed out here that the maintenance and operation of the cluster map are lightweight, and will not adversely affect the scalability and performance of the system.

        One possible confusion is: why do you need to design the second and third mappings at the same time? Is n’t it repeated? On this point, Sage does not explain much in his paper, and the author's personal analysis is as follows:

        We can conversely imagine what would happen if there were no PG layer mapping? In this case, an algorithm must be used to map the object directly to a set of OSDs. If this algorithm is a fixed mapping hash algorithm, it means that an object will be fixedly mapped on a group of OSDs. When one or more OSDs are damaged, the object cannot be automatically migrated to other OSDs (because The mapping function is not allowed). When the system adds an OSD for expansion, the object cannot be re-balanced to the new OSD (also because the mapping function is not allowed). These restrictions are contrary to the original design intention of Ceph system for high reliability and high automation.

 

        If you use a dynamic algorithm (such as the CRUSH algorithm) to complete this mapping, it seems that you can avoid the problems caused by static mapping. However, as a result, the amount of local metadata handled by each OSD will explode, and the resulting computational complexity and maintenance workload will be unbearable.

        For example, in Ceph's existing mechanism, an OSD usually needs to exchange information with other OSDs that share the same PG with it to determine whether they are working properly and whether maintenance operations are required. An OSD carries about hundreds of PGs, and each PG usually has three OSDs. Therefore, within a period of time, an OSD needs to exchange hundreds to thousands of OSD information.

        However, if there is no PG, an OSD needs to exchange information with other OSDs that share the same object. Since the objects carried on each OSD are likely to be up to millions, therefore, within the same length of time, the information exchange between OSDs required by an OSD will skyrocket to millions or even tens of millions of times. The maintenance cost of this state is obviously too high.

        In summary, the author believes that the benefits of introducing PG are at least two: on the one hand, it realizes the dynamic mapping between object and OSD, thus leaving room for the realization of Ceph's reliability, automation and other features; on the other hand Effectively simplifies the data storage organization, greatly reducing the maintenance and management overhead of the system. Understanding this is very important for a thorough understanding of Ceph's object addressing mechanism.

5.2 Data operation process

        Here, the file writing process will be taken as an example to explain the data operation flow.

        In order to simplify the description and facilitate understanding, several assumptions are made here. First, assume that the file to be written is small and does not need to be split, and is only mapped as an object. Second, assume that one PG in the system is mapped to three OSDs.

        Based on the above assumptions, the file writing process can be represented by the following figure [ 3 ]:

         

        As shown in the figure, when a client needs to write a file to the Ceph cluster, it first needs to complete the addressing process described in Section 5.1 locally, turn the file into an object, and then find out the group that stores the object Three OSDs. The three OSDs have different serial numbers. The first OSD with the serial number is the Primary OSD in this group, and the latter two are the Secondary OSD and Tertiary OSD in turn.

        After finding the three OSDs, the client will directly communicate with the Primary OSD and initiate the write operation (step 1). After receiving the request, the Primary OSD initiates write operations to the Secondary OSD and Tertiary OSD (steps 2 and 3). After the Secondary OSD and Tertiary OSD complete their write operations, they will send confirmation messages to the Primary OSD (steps 4 and 5). When the Primary OSD is sure that the other two OSDs have been written, it also completes the data writing itself, and confirms to the client that the object writing operation is complete (step 6).

 

        The reason why such a writing process is adopted is essentially to ensure the reliability of the writing process and avoid data loss as much as possible. At the same time, because the client only needs to send data to the Primary OSD, the external network bandwidth and overall access delay in the Internet usage scenario have been optimized to a certain extent.

        Of course, this reliability mechanism will inevitably cause a long delay. In particular, if all OSDs write data to the disk and then send a confirmation signal to the client, the overall delay may be unbearable. Therefore, Ceph can confirm with the client twice. After each OSD writes data to the memory buffer, it first sends a confirmation to the client, and then the client can execute down. After each OSD writes the data to the disk, it will send a final confirmation signal to the client. At this time, the client can delete the local data as needed.

        Analysis of the above process shows that under normal circumstances, the client can independently complete the OSD addressing operation without having to rely on other system modules. Therefore, a large number of clients can operate in parallel with a large number of OSDs. At the same time, if a file is divided into multiple objects, these multiple objects can also be sent to multiple OSDs in parallel.

        From the perspective of OSD, since the role of the same OSD in different PGs is different, its work pressure can also be shared as evenly as possible, thereby avoiding a single OSD from becoming a performance bottleneck.

        If you need to read data, the client only needs to complete the same addressing process and directly contact the Primary OSD. In the current Ceph design, the read data is provided only by the Primary OSD. But there are also discussions about dispersing the read pressure to improve performance.

5.3 Cluster maintenance

        As mentioned in the previous introduction, several monitors are jointly responsible for the discovery and recording of all OSD states in the entire Ceph cluster, and together form the master version of the cluster map, and then spread to all OSDs and clients. OSD uses cluster map for data maintenance, while client uses cluster map for data addressing.

        In the cluster, the functions of each monitor are generally the same, and the relationship between them can be simply understood as the master-slave backup relationship. Therefore, no distinction is made between each monitor in the following discussion.

        It is slightly unexpected that the monitor does not actively poll the current status of each OSD. On the contrary, OSD needs to report status information to monitor. There are two common reports: one is that a new OSD is added to the cluster, and the other is that an OSD finds itself or other OSDs to be abnormal. After receiving the reported information, the monitor will update the cluster map information and spread it. The details will be described below.

        The actual content of the cluster map includes:

        (1) Epoch, the version number. The epoch of Cluster map is a monotonically increasing sequence. The larger the Epoch, the newer the cluster map version. Therefore, OSDs or clients with different versions of cluster maps can simply determine who should follow the version by comparing epochs. The monitor must have the cluster map with the largest and latest version of epoch. When any two parties find that the epoch value of each other is different during communication, the default is to synchronize the cluster map to the state of the higher version party first, and then perform subsequent operations.

        (2) The network address of each OSD.

        (3) The status of each OSD. The description of the OSD state is divided into two dimensions: up or down (indicating whether the OSD is working properly), in or out (indicating whether the OSD is in at least one PG). Therefore, for any OSD, there are four possible states:

        —— Up and in: This indicates that the OSD is operating normally and has already carried data of at least one PG. This is a standard working state of OSD;

        —— Up and out: The OSD is running normally, but it does not carry any PG, and there is no data in it. A new OSD will be in this state just after being added to the Ceph cluster. After a faulty OSD is repaired, it is also in this state when it rejoins the Ceph cluster;

        —— Down and in: The OSD is abnormal, but it still carries at least one PG, in which data is still stored. The OSD in this state has just been found to be abnormal, and may still be able to return to normal, or it may not work at all;

        —— Down and out: This indicates that the OSD has completely failed and no longer bears any PG.

        (4) CRUSH algorithm configuration parameters. It shows the physical hierarchy of Ceph cluster (cluster hierarchy), location mapping rules (placement rules).

        According to the definition of cluster map, it can be seen that the version change is usually only triggered by the changes of (3) and (4) two pieces of information. Compared with the two, (3) the probability of change is higher. This can be reflected by the following introduction to the OSD working state change process.

 

        After a new OSD goes online, it first communicates with the monitor according to the configuration information. Monitor adds it to the cluster map, sets it to up and out, and sends the latest version of the cluster map to this new OSD.

        After receiving the cluster map from the monitor, this new OSD calculates the PG it carries (to simplify the discussion, here we assume that this new OSD starts to carry only one PG), and other OSDs that carry the same PG . Then, the new OSD will get in touch with these OSDs.

If the PG is currently in a degraded state (that is, the number of OSDs carrying the PG is less than the normal value, such as 3 should be normal, there are only 2 or 1 at this time. This situation is usually caused by the OSD failure) The OSD will copy all objects and metadata in this PG to the new OSD. After the data copy is completed, the new OSD is set to up and in state. The contents of the cluster map will be updated accordingly. This is actually an automated failure recovery process. Of course, even if no new OSD is added, the downgraded PG will calculate other OSD to achieve failure recovery.

        If everything is normal with this PG, this new OSD will replace one of the existing OSDs (Primary OSD will be re-selected in the PG) and assume its data. After the data copying is completed, the new OSD is set to up and in state, and the replaced OSD will exit the PG (but the state is usually still up and in, because there are other PGs to be carried). The contents of the cluster map will be updated accordingly. This is actually an automated data re-balancing process.

        If an OSD finds that another OSD that carries a PG together cannot communicate, it will report this situation to the monitor. In addition, if an OSD deamon finds that its working status is abnormal, it will also actively report the abnormal situation to the monitor. In the above case, the monitor will set the state of the problematic OSD to down and in. If the OSD is still unable to return to normal if it exceeds a certain booking time limit, its status will be set to down and out. Conversely, if the OSD can return to normal, its state will return to up and in. After these state changes occur, the monitor will update the cluster map and diffuse. This is actually an automated failure detection process.

        It can be seen from the previous introduction that for a Ceph cluster, even if it consists of thousands or more OSDs, the size of the cluster map data structure is not surprising. At the same time, the status update of the cluster map does not happen frequently. Even so, Ceph still optimizes the diffusion mechanism of cluster map information in order to alleviate the pressure on related computing and communication.

        First, cluster map information is diffused in increments. If the two parties in any communication find that their epochs are inconsistent, the one with the updated version will send the difference between the two cluster maps to the other party.

        Second, cluster map information is spread in an asynchronous and lazy form. In other words, the monitor will not broadcast the new version to all OSDs after each cluster map version update, but will reply the update to the other party when the OSD reports information to itself. Similarly, when each OSD communicates with other OSDs, it sends updates to counterparts whose version is lower than itself.

        Based on the above mechanism, Ceph avoids the broadcast storm caused by the update of the cluster map version. Although this is an asynchronous and lazy mechanism, according to the conclusion in the Sage paper, for a Ceph cluster composed of n OSDs, any version update can spread to the cluster within O (log (n)) time complexity On any OSD.

        A question that may be asked is: Since this is an asynchronous and lazy diffusion mechanism, during the version diffusion process, the system must have inconsistent cluster maps seen by each OSD. Will this cause problems? The answer is: no. In fact, if a client and each OSD inside the PG it wants to access have the same cluster map status, the access operation can be performed correctly. If an OSD in this client or PG is inconsistent with the cluster maps of other parties, according to the Ceph mechanism design, these parties will first synchronize the cluster map to the latest state and perform necessary data re-balancing operations, and then You can continue to visit normally.

        Through the above introduction, we can briefly understand whether Ceph is based on the cluster map mechanism, and monitor, OSD and client cooperate to complete the maintenance of the cluster state and data access. In particular, based on this mechanism, in fact, automated data backup, data re-balancing, fault detection and fault recovery can be completed naturally, without complicated special design. This is really impressive.

 

        So far, this series of articles has systematically introduced Ceph's design ideas, logical architecture, working principles and main operating procedures. The most technical part is over. After that, there should be two more articles, which will introduce the story of Ceph and OpenStack, as well as some personal thoughts on Ceph.

Published 59 original articles · 69 praises · 270,000+ views

Guess you like

Origin blog.csdn.net/pansaky/article/details/102454394