Elasticsearch principle analysis-cluster startup process

Elasticsearch principle analysis-cluster startup process


Let's start with the startup process, first look at how the entire cluster is started at a macro level, how the cluster status changes from Red to Green, without code involved, and then analyze the processes of other modules.

In this book, refers to the cluster cluster startup process completely Launch restart process, to go through during the election held master node, fragmented distribution (distribution main and auxiliary slice), the index data recovery and other important stage to understand the principles and details of which, for It is important to solve or avoid problems such as split brain, no ownership, slow recovery, and data loss that may be encountered during cluster maintenance .

The overall process of cluster startup is shown in the following figure:

Insert picture description here

1. Election of the master node

Assuming that several nodes are starting up, the first thing the cluster starts is to select one from the list of known active machines as the master node. The process after the election is triggered by the master node.

ES’s master selection algorithm is based on the improvement of Bully’s algorithm. The main idea is to sort the node IDs, and take the node with the largest ID value as the master, and each node runs this process . Is it very simple? The purpose of selecting the master is to determine the only master node. Beginners may think that the elected master node should hold the latest metadata information. In fact, this problem is broken down into two steps in implementation: first determine the only one that is recognized by everyone. The master node, then find a way to copy the latest machine metadata to the elected master node .

The simple election algorithm based on node ID sorting has three additional agreed conditions:

  1. Participants need to be more than half, and after reaching the quorum (majority), a temporary leader is selected .

    Why is it temporary? Each node runs an algorithm that takes the maximum value of the sort, and the results are not necessarily the same. For example, the cluster has 5 hosts, and the node IDs are 1, 2, 3, 4, and 5. When the network partition is generated or the node startup speed difference is relatively large, the node list seen by node 1 is 1, 2, 3, 4, and 4 is selected; the node list seen by node 2 is 2, 3, 4, 5, and Out of 5. The results are inconsistent, resulting in the second restriction below.

  2. More than half of the votes are required .

    When a node is selected as the master node, it must be judged that more than half of the nodes that have joined it can be confirmed as the master. Solve the first problem.

  3. When a node leaving event is detected, it must be judged whether the current number of nodes is more than half .

    If quorm is not reached, give up the Master status and rejoin the cluster. If you don't do this, imagine the following situation: Suppose a cluster of 5 machines produces a network partition, 2 machines in a group A, and 3 machines in a group B. Before the partition is generated, the master is located in A of the two. At this time, a group of three nodes will successfully elect the Master again, resulting in a dual master, commonly known as split brain .

The cluster does not know how many nodes it has. The value of quornum is read from the configuration. We need to set the configuration items:

discovery.zen.minimun_master_nodes: 3

2. Election cluster meta information

The elected Master has nothing to do with the oldness of the cluster meta information. Because its first task is to elect meta-information , let each node send its stored meta-information, determine the latest meta-information according to the version number, and then broadcast this information , all nodes in this cluster have the latest Meta information.

The election of cluster meta information includes two levels: cluster level and index level . Does not include information on which node the shard is stored in. This kind of information is subject to the node disk storage and needs to be reported . why? Because the read and write process does not go through the Master, the Master does not know the direct data differences of each shard copy. HDFS also has a similar mechanism, and the block information depends on the report of the DataNode.

For cluster consistency, the number of meta-information participating in the election needs to be more than half, and the master's rule for publishing the cluster status is also more than half of the number of nodes waiting to be successfully published.

During the election process, no new node's request to join is accepted.

After the cluster meta-information election is completed, the Master announces the first cluster status, and then starts to elect shard-level meta-information.

3. Shard allocation (allocation) process

The election of shard-level meta-information and the construction of the content routing table are completed in the allocation module. In the initial stage, all shards are in **UNASSIGNED (unassigned unassigned)** state. The ES determines which fragment is located on which node through the allocation process, and reconstructs the content routing table. At this point, the first thing to do is to allocate the main shard.

3.1 Select the main shard

Now let's see how a certain main segment [website] [0] is allocated. All the allocation work is done by the Master. At this time, the Master does not know where the main shard is. It asks all nodes in the cluster: Everyone sends me the meta information of the [website] [0] shard. Then, the Master waits for all the requests to return. Normally, it has the information of the shard, and then selects a shard as the main shard according to a certain strategy. Is the efficiency somewhat low? The amount of such queries = the number of shards * the number of nodes. So we better control the total scale of shards not to be too large.

Now there are multiple pieces of information about the shard [website] [0], the specific number depends on how many copies are set. Now consider which slice to use as the main slice. Versions below ES 5.x are determined by comparing the version number of the shard-level meta-information. In the case of multiple copies, considering that if only one shard information is reported, it will definitely be selected as the main shard, but maybe the data is not the latest, and the node with a larger version number has not been started yet. To solve this problem, ES 5.x began to implement a new strategy: set a UUID for each shard, and then record which shard is the latest in the cluster-level meta information , because ES writes the main score first Then, the master shard node forwards the request to write the replica shard, so the node where the master shard is located must be the latest one. If it fails to forward, the Master is required to delete that node. Therefore, starting from ES 5.x, the primary shard election process is to determine the primary shard through the " latest primary shard list " recorded in the cluster meta-information : it exists in the report information and also exists in this list .

If the cluster is set up:

cluster.routing.allocation.enable: none

The allocation of shards is prohibited, and the cluster will still force the allocation of primary shards. Therefore, when the above options are set, the status of the cluster after restart is Yellow instead of Red.

3.2 Select Vice Shard

After the primary shard election is successful, a copy is selected as the secondary shard from the shard information summarized in the previous process. If the summary information does not exist, a brand new copy is allocated. The operation depends on the delay configuration item:

indexunassigned.node_left.delayed_timeout: 100000

The largest cluster in our online environment has 100+ nodes. It is not uncommon to lose nodes. In many cases, it cannot be handled in the first time. This delay is generally configured in days.

Finally, the newly started node is allowed to join the cluster during the allocation process.

4. index recovery

After the fragment allocation is successful, it enters the recovery process. The recovery of the primary shard does not wait for its replica shards to be allocated successfully before starting the recovery. They are independent processes, but the recovery of the replica shards does not start until the primary shard is restored.

Why do we need recovery? For the primary shard, there may be some data that has not had time to flush; for the replica shards: one is that there is no flushing, and the other is that the primary shard has been written, and the replica shards have not had time to write, the primary and secondary shards data Inconsistent.

4.1 Primary shard recovery

Since each write operation will record a transaction log (translog), which operation is recorded in the transaction log, and related data. Therefore, the translog after the last submission (a submission of Lucene is a fsync flashing process) is replayed, the Lucene index is built, and the recovery of the main shard is completed.

4.2 Secondary shard recovery

The recovery of secondary shards is more complicated. In the version iteration of ES, the secondary shard recovery strategy has been adjusted a lot.

The secondary fragment needs to be restored to be consistent with the primary fragment, and at the same time, new index operations are allowed during the restoration. In the current 6.0 version, recovery is performed in two stages.

  • phase1: On the node where the primary shard is located, acquire the translog retention lock. Starting from the acquisition of the retention lock, the translog will be retained without being affected by its flushing. Then call the Lucene interface to take a snapshot of the shard, which is the fragmented data in the disk that has been flushed. Copy these shard data to the replica node. Before phase1 is completed, the secondary shard node will be notified to start the engine. Before phase2, the replica shard can process write requests normally.

  • phase2: Take a snapshot of tanslog. This snapshot contains the new indexes from phase1 to the execution of the translog snapshot. Send these translogs to the node where the replica shards are located for replay.

Due to the need to support new write operations during recovery (to make ES more usable), the following issues need to be focused on during these two stages:

  1. Fragment data integrity : How to ensure that the copy fragment does not lose data? The tanslog snapshot of the second stage includes all the new operations in the first stage. So during the execution of the first stage, if a **"Lucene commit" occurs (flushing the data in the file system write buffer to the disk and clearing the translog) , what to do if the translog is cleared? Before ES 2.0, the refresh operation was blocked so that all translogs were retained. Starting from version 2.0, in order to avoid this approach generating too large translog, the concept of translog.view is introduced, creating a view can get all subsequent operations. Starting from version 6.0, translog.veiw has been removed. Introduced TranslogDetetionPolicy concept **, it will take a snapshot of translog to keep translog not been cleaned up . This allows Lucene Commit in the first phase .
  2. Data consistency : Prior to ES 2.0, the secondary shard recovery process had three stages. The third stage would block new index operations and transmit the newly added translog during the second stage. This time is very short. Since version 2.0, the third stage has been deleted, and there is no write blocking during the recovery period. On the secondary shard node, when replaying the translog, the timing errors and conflicts between the write operation between phase1 and phase2 and the replay operation of phase2 are handled through the write process, and the version number is compared to filter out expired operations.

In this way, operations with errors in timing will be ignored. For a specific doc, only the latest operation takes effect, ensuring that the primary and secondary fragments are consistent.

The first stage is particularly lengthy because it needs to pull the full amount of data from the main shard. In ES 6.x, the first stage is optimized again: mark each operation. In normal write operations, each successful write operation is assigned a sequence number, and the difference range can be calculated by comparing the sequence numbers. In terms of implementation, global checkpoint and local checkpoint are added. The main shard is responsible for maintaining global checkpoint. It means that all shards have been written to the position of this sequence number. The local checkpoint represents the latest position where the current shard has been successfully written. When recovering, the missing data range is calculated by comparing the two sequence numbers, and then replaying this part through translog Data, and translog will keep it for a longer time.

Therefore, there are two opportunities to skip phase1 of the secondary fragment recovery:

  • Based on SequenceNumber, recover data from the translog of the primary shard node
  • The primary and secondary fragments have the same syncid and the same doc number, so phase1 can be skipped.

5. Summary

When the primary segment of an index is successfully allocated, write operations to this segment are allowed. When all primary shards of an index are allocated successfully, the index becomes Yellow. When the primary shards of all indexes are allocated successfully, the entire cluster becomes Yellow. When an index is allocated successfully, the index becomes Green. When the index fragments of all indexes are successfully allocated, the entire cluster becomes Green.

Index data recovery is the longest process. When the total number of shards reaches the 100,000 level, the time for the clusters of versions before 6.x to change from Red to Green may take hours. The copy in ES 6.X allows recovery from local translog is a major improvement, avoiding the node where the ongoing primary shard is located to pull the full amount of data, saving a lot of time for the recovery process.

When the volume reaches the 100,000 level, the time before the 6.x version cluster changes from Red to Green may take hours. The copy in ES 6.X allows recovery from local translog is a major improvement, avoiding the node where the ongoing primary shard is located to pull the full amount of data, saving a lot of time for the recovery process.

6. Follow me

Search WeChat public account: the road to a strong java architecture
Insert picture description here

Guess you like

Origin blog.csdn.net/dwjf321/article/details/104003852