Ceph Monitor principle and code flow introduction

Monitor Introduction

Monitor plays the role of manager in the Ceph cluster and maintains the state of the entire cluster. The state of the cluster is abstracted into several Map objects, including monmap, osdmap, mdsmap, authmap, logmap, etc., to ensure that the relevant components of the cluster can be accessed at the same time. Consensus is the equivalent of leadership. Among them, the update of osdmap adopts a mechanism similar to that of grayscale release, which may cause the version of osdmap held by all OSDs or clients in the cluster to be inconsistent at a certain moment.

In one sentence, the role of Monitor is to be responsible for collecting cluster information, updating cluster information, and publishing cluster information.

If there is only one Monitor, things will be much easier. The addition, deletion, modification and query of cluster information are all done by this Monitor, but this will lead to single point of failure or performance hot spots. As a distributed solution, Ceph will deploy multiple Monitors to avoid single point of failure. This introduces new problems, such as: how to manage multiple Monitor nodes? Who will update the data? How to synchronize data between multiple monitors? How to maintain data consistency? etc. In this way, in the Ceph cluster, what Monitor does can be summarized into two points: 1) Manage yourself well: how to update data? How to synchronize data? wait. 2) Manage the cluster well: what data, where does it exist? data consistency etc.

The basic structure of Monitor

Ceph's Monitor maintains a master copy of the cluster Map. Ceph usually contains a series of Monitors that are mapped to Clients. A Monitor includes K/V store, Paxos, and PaxosService. The K/V store is used to persistently store Monitor data. Paxos provides the consistency of the data access logic of PaxosService. Each PaxosService represents a state information of the cluster, which is written into the PaxosService layer in the form of key-value.

Monitor initialization main process:

When the Monitor starts or restarts, Connect connects to other Monitors in the monmap. If it is started for the first time, Monitor will build a monmap according to the configuration file and store it in the MonitorDBStore database. If it is not the first start, Monitor will read the monmap from the MonitorDBStore database. Messenger is a network thread module, and Monitor will initialize it and register the callback processing function of the request. Bootstrap processing will be called multiple times, which is very important in the entire life cycle of Monitor. After bootstrap, the Monitor is in the STATE_PROBING state, and the Monitor communicates with other Monitors and synchronizes information, and then the cluster starts elections to determine the role of the Monitor.

Monitor state transition:

STATE_PROBING: During the boostrap process, nodes detect each other and find data gaps;

  • STATE_SYNCHRONIZING: When the data gap is large and cannot be filled through the follow-up mechanism, full synchronization is performed;
  • STATE_ELECTING: Monitor is electing the master;
  • STATE_LEADER: The current Monitor becomes the Leader;
  • STATE_PEON: non-Leader node;

Data Consistency in Distributed Systems

For a distributed system, data consistency is particularly important. In the monitor node, there are two roles of Leader and Peon. Both the Leader and Peon can handle the read operations of the client, and the write operations are sent to the Leader node. The Leader node Distributed to Peon nodes. The Paxos algorithm ensures that only one value can be approved for one modification operation, thus ensuring the data consistency of the distributed system.

Communication Model Between Nodes

There are usually two types: shared memory and message passing. Paxos is a communication model based on message passing.

Paxos conversion timing:

#1. Complete the initialization of Paxos when the monitor starts;

#2. When the monitor enters the bootstrap, Paxos restarts;

#3. Monitor Paxos is initialized to the corresponding Leader or Peon according to the election result;

#4. After the monitor is abnormal, Paxos enters the recovery phase;

#5. During the running of the monitor, make a Paxos resolution;

concept explanation

Epoch value

Every time a new Leader is elected, a new Epoch will also be generated. If there is no election, the Epoch will not be changed. All messages sent by the Leader will carry this Epoch. If there is a new election due to network partition and other phenomena, it will be found that the Leader has changed according to the Epoch. Epoch is not needed without Leader.

Rank value

The Rank value can be understood as an ID value, which represents the position of the host node in the monmap, and is related to the IP address. If the host is not in the monmap, rank=-1 at this time. The rank value will be used in the election, and the selection of the Leader is based on the rank value. The rule is that the leader with the smaller rank value is the leader.

PN(Proposal Number)

After the Leader is elected, a Phase 1 process is first performed to determine the PN. When it is a leader, all Phase 2 operations use this PN, thus omitting a large number of Phase 1 operations, which is why Paxos can reduce network overhead. PN is a must, regardless of whether there is a Leader or not, there must be a PN.

Version

It can be understood as the instance ID of Paxos.

The value at the beginning of "uncommitted" will not exist if all proposals are committed normally, such as: normal shutdown, and only under abnormal circumstances will the proposals that have not been submitted be stored.

Several states of Paxos:

1) Recovery state: Enter this state after the Leader election is over, the purpose is to synchronize the state among Quorum members.

2) Active state: idle state, no proposal approved, waiting.

3) Updating status: The approval proposal is being executed.

4) Updating Previous status: The last proposal is being approved, that is, the proposal proposed by the old Leader before the Leader election but not yet approved.

5) Writing status: Submit the proposal data.

6) Writing Previous status : Submit the last unapproved proposal data.

7) Refresh status: The proposal has been submitted.

Monitor message distribution

The Monitor process only creates one Messenger, which means it has only one dispatch_queue queue and one dispatcher thread, and all requests will be queued. Monitor will also initialize a timer, which will create a thread to process all message timeout events, including probe, propose, lease and other messages, so the processing of these messages is also serial.

How does Monitor handle Client requests?

When the Client sends a request to the Monitor, the Monitor first distributes the request to the corresponding PaxosService, and the PaxosService will call the method according to the read or write operation, and the PaxosService decides whether to trigger the propose process.

Guess you like

Origin blog.csdn.net/weixin_43778179/article/details/132692039