Analysis of Ceph OSDMap mechanism

The OSDMap mechanism is a very important part of the Ceph architecture. The distribution and monitoring of PG on the OSD is performed by the OSDMap mechanism. The OSDMap mechanism and the CRUSH algorithm together form the cornerstone of Ceph's distributed architecture.

The OSDMap mechanism mainly includes the following three aspects:

1. Monitor monitors OSDMap data, including Pool collection, number of copies, PG number, OSD collection and OSD status.

2. The OSD reports its status to the Monitor, and monitors and reports the status of the Peer OSD.

3. The OSD monitors the PGs assigned to it, including creating new PGs, migrating PGs, and deleting PGs.

In the entire OSDMap mechanism, OSD fully trusts Monitor and believes that the OSDMap data it maintains is absolutely correct. All actions taken by the OSD on the PG are based on the OSDMap data, which means that the Monitor directs the OSD how to distribute the PG.

In the OSDMap data, the Pool collection, the number of copies, the number of PGs, and the OSD collection are specified by the operation and maintenance personnel. Although the state of the OSD can also be changed by the operation and maintenance personnel, the actual operation of Ceph cluster A is distributed from time to time. It can be seen that the operation and maintenance personnel have a very small percentage of time involved in the Ceph cluster, so the failure of the OSD (OSD state) is the main target of Monitor monitoring.

OSD fault monitoring is done by Monitor and OSD together. On the Monitor side, the PaxosService thread called OSDMonitor is used to monitor the report data sent by the OSD in real time (of course, it also monitors the operations performed by the operation and maintenance personnel on the OSDMap data). On the OSD side, run a Tick thread, on the one hand, periodically report its status to the Monitor; on the other hand, the OSD performs Heartbeat monitoring on the Peer OSD, and if a Peer OSD failure is found, timely feedback to the Monitor. Specific OSD fault monitoring details are not analyzed in this article.

The first and second points in the OSDMap mechanism are relatively easy to understand. The following article mainly analyzes the third point in detail.

image.png

As shown in the above figure, in the Ceph cluster of 3 OSDs, the number of Pool copies is 3, and the Primary OSD of a PG is OSD0. When the Monitor detects any OSD failure of the 3 OSDs, it sends the latest OSDMap data Go to the remaining 2 OSDs and notify them to take corresponding measures.

image.png

As shown in the figure above, after OSD receives MOSDMap, it mainly deals with three aspects

ObjectStore :: Transaction :: write (coll_t :: meta ()) Update OSDMap to disk and save it in the directory / var / lib / ceph / OSD / ceph- <id> / current / meta / to make OSDMap data persistent To a role similar to log.

OSD :: consume_map () performs PG processing, including deleting PGs that do not exist in Pool; updating PG epoch (OSDmap epoch) to disk (LevelDB); generating AdvMap and ActMap events, triggering PG state machine state_machine to update the state.

OSD :: activate_map () decides whether to start the recovery_tp thread pool for PG recovery as needed.

On the OSD side, PG is responsible for I / O processing, so the state of PG directly affects I / O, and pgstate_machine is the control mechanism of PG state, but the state transition inside is very complicated, and no specific analysis will be done here.

The following starts to analyze the creation, deletion, migration of PG

The creation of PG is triggered by the operation and maintenance personnel. Specify the number of PGs when creating a new pool, or increase the number of existing pool PGs. At this time, OSDMonitor monitors the change of OSDMap and sends the latest MOSDMap to all OSDs.

On a group of OSDs corresponding to PGs, the OSD :: handle_pg_create () function creates PG directories on disk, writes PG metadata, and updates Heartbeat Peers and other operations.

The deletion of PG is also triggered by the operation and maintenance personnel. OSDMonitor sends MOSDMap to OSD. On a group of OSDs corresponding to PG, the OSD :: handle_PG _remove () function is responsible for deleting the directory where the PG is located from the disk and deleting the PG from PGMap. Delete PG metadata and other operations.

PG migration is more complicated and involves the collaborative processing of two OSDs and monitors. For example, adding OSD3 to a cluster with 3 existing OSDs causes CRUSH to redistribute PGs. The result of a PG allocation change is [0, 1, 2]-> [3, 1, 2]. Of course, the distribution of CRUSH is random. In different PGs, OSD3 may become either Primary OSD or Replicate OSD. Here we take OSD3 as the Primary OSD as an example.

The newly added OSD3 replaces the original OSD0 to become the Primary OSD. Since the PG is not created on the OSD3 and there is no data, the I / O on the PG cannot be performed. Therefore, the PG Temp mechanism is introduced here, that is, the OSD3 sends MOSDPG Temp to the Monitor , Designate Primary OSD as OSD1, because the data of PG is saved on OSD1, the requests sent by Client to PG are forwarded to OSD1; at the same time, OSD1 sends the data of PG to OSD3, until PG data copy is completed, OSD1 will primary The role of OSD is returned to OSD3, and the client's I / O request is directly sent to OSD3, thus completing the migration of PG. The whole process is shown in the figure below.

image.png

Another PG migration scenario is that when OSD3 is used as Replicate OSD, PG data migration from Primay OSD to OSD3 is simpler than the above PG migration process, and will not be described in detail here.

This article explains the basic principles of the OSDMap mechanism from the perspective of PG, and describes the relationship between Monitor, OSD, and PG. In actual operation and maintenance, we are often confused about the change of PG state caused by the change of OSD state and quantity, and hope this article can bring inspiration to solve the PG state problem.



Author: Lucien_168
link: https: //www.jianshu.com/p/8ecd6028f5ff
Source: Jane books
are copyrighted by the author. For commercial reproduction, please contact the author for authorization, and for non-commercial reproduction, please indicate the source.

Published 13 original articles · Likes6 · Visitors 10,000+

Guess you like

Origin blog.csdn.net/majianting/article/details/102990025