【Ceph】OSD , OSDMap 和 PG, PGMap

Ceph is committed to providing PB-level cluster storage capabilities, and provides automatic fault recovery, convenient expansion and contraction capabilities, these capabilities require a Metadata Server to provide in a typical distributed storage system, because a fully distributed system for data migration and Capacity expansion has very strong pain points, but Metadata Server needs to avoid single points of failure and data bottlenecks on the other hand. Here, Ceph should provide more free and powerful cluster automatic fault handling and recovery capabilities, which makes Metadata Server indispensable, but in order to avoid the Metadata Server bottleneck problem, which Metadata to maintain has become the most important issue. Monitor as Ceph's Metada Server maintains cluster information. It includes 6 Maps, namely MONMap, OSDMap, PGMap, LogMap, AuthMap, and MDSMap. Among them, PGMap and OSDMap are the most important two Maps, which will be mainly covered in this article.

OSDMap

OSDMap is the information of all OSDs in the Ceph cluster. All OSD node changes such as process exit, node join and exit or node weight changes will be reflected on this Map. Not only will this Map be mastered by Monitor, but OSD nodes and Clients will also get this table from Monitor, so in fact we need to deal with all "Client" (including OSD, Monitor and Client) OSDMap holdings, in fact, each A “Client” may have different versions of OSDMap. When the authoritative OSDMap changed by Monitor changes, it will not send OSDMap to all “Clients”, but need to understand that the changed “Client” will be pushed, such as The addition of a new OSD will lead to the migration of some PGs, then the OSD of these PGs will be notified. In addition, Monitor will also randomly select some OSDs to send OSDMap. So how to let OSDMap spread slowly? For example, OSD.a, OSD.b got a new OSDMap, then OSD.c and OSD.d may part of the PG will also be on OSD.a, OSD.b, then their communication will be accompanied by epoch of OSDMap, If the version is lower, OSD.c and OSD.d will actively pull OSDMap to Monitor, and in some cases OSD.a, OSD.b will also actively push their own OSDMap to OSD.c and OSD.d (if updated). Therefore, OSDMap will gradually spread among nodes in the next period of time. When the cluster is idle, it is likely that it will take longer to complete the update of the new map, but this will not affect the state consistency between the OSDs, because the OSD does not get the new map, so they do not need to know the new OSDMap changes.

Ceph manages multiple versions of OSDMap to avoid cluster state synchronization, which makes Ceph not afraid of changes in thousands of OSD-scale nodes that may cause cluster state synchronization.

Next, I will briefly introduce the situation where the OSD changes the OSDMap when it starts. When a new OSD starts, the latest OSDMap that the Monitor has does not have the OSD. Therefore, the OSD will apply to the Monitor to join. After verifying its information, it will be added to OSDMap and marked as IN, and it will be put in the Pending Proposal in the next Monitor "discussion". After receiving the reply from the Monitor, OSD finds that it is still not in OSDMap and will continue to try Apply to join, then Monitor will initiate a Proposal, apply to add this OSD to OSDMap and mark it as UP. Then according to the process of Paxos, from proposal-> accept-> commit to the final agreement, OSD successfully joined OSDMap finally. When the new OSD gets the latest OSDMap and finds it already in it. At this time, the OSD really started to establish connections with other OSDs, and the Monitor will then start assigning PGs to him.

When an OSD crashes due to an accident, other OSDs that maintain Heartbeat with the OSD will find that the OSD cannot be connected. After reporting to the Monitor, the OSD will be temporarily marked as OUT, and all Primary PGs on the OSD will mark the Primary The role is given to other OSDs (explained below).

PG and PGMap

PG (Placement Group) is a very important concept in Ceph. It can be regarded as a virtual node in a consistent hash, which maintains a part of data and is the smallest unit of data migration and change. It plays a very important role in Ceph. There are a certain number of PGs in a pool (which can be dynamically increased or decreased). These PGs will be distributed in multiple OSDs. The distribution rules can be defined by CRUSH RULE. Monitor maintains all the PG information in each pool. For example, when the number of copies is three, the PG will be distributed in three OSDs, one of which will be the Primary role, and the other two OSD roles will be Replicated. The Primary PG is responsible for the object write operation of the PG, and the read operation can be obtained from the Replicated PG. The OSD is just the carrier of the PG. Each OSD will have a part of the PG role of Primary and the other part of Replicated. When the OSD fails (accidentally crashes or the storage device is damaged), Monitor will treat all the roles on the OSD as Primary The OSD of the Replicated role of PG is promoted to Primary PG, and all PGs of this OSD will be in Degraded state. Then wait for the administrator's next decision. If the original OSD fails to start, the OSD will be kicked out of the cluster, and these PGs will be assigned to the new OSD by the Monitor according to the OSD.

In Ceph, PGs have state machines with up to ten states and dozens of events to deal with the anomalies that PG may face. Each PG is like a family, the data held by PG is its wealth, and OSD is just a castle Each castle provides shelter for multiple families, but in order to ensure the inheritance of wealth, each family will establish residence in multiple castles. OSD If the castle is just providing a communication address (IP: Port) and some infrastructure (such as OSDMap and message communication mechanism) for the PG, when an accident occurs in the castle, all families ’residences in other castles will update the status in time and re-select new Castle as a residence. Or the castle recovered from the accident, all the families of this castle will communicate with their family's residences in other castles to learn about the changes in wealth during the accident. This example is to illustrate that the Object (ie user data) follows the PG, not the OSD.

From the above description, we can understand that Monitor has mastered the OSD status and PG status of the entire cluster. Each PG is the owner of a part of the Object. It is also the responsibility of each PG to maintain the information of the Object. The Monitor will not control the Object Level. information. Therefore, each PG needs to maintain the state of the PG to ensure the consistency of the Object. However, the data of each PG and the records necessary for recovery and migration are maintained by each PG, that is, exist on the OSD where each PG is located.

PGMap is the state of all PGs maintained by Monitor. Each OSD will know the state of the PG it owns. The PG migration requires Monitor to make a decision and reflect it on PGMap. The relevant OSD will be notified to change its PG state. After a new OSD starts and joins OSDMap, Monitor will notify the PG that the OSD needs to create and maintain. When there are multiple copies, the Primary OSD of the PG will actively communicate with the PG of the Replicated role and communicate the status of the PG, including PG's recent history. Generally speaking, the new OSD will get all the data of other PGs and then gradually reach agreement, or the OSD already has the PG information, then the Primary PG will compare the history of the PG and then reach the agreement of the PG information. This process is called Peering. It is a "discussion" initiated by the Primary PG OSD. Multiple OSDs that also master this PG compare PG information and history with each other to finally reach consensus.

Published 13 original articles · Likes6 · Visitors 10,000+

Guess you like

Origin blog.csdn.net/majianting/article/details/102990009