"Elasticsearch Source Code Analysis and Optimization Actual Combat" Chapter 5: Main Selection Process

"Elasticsearch Source Code Analysis and Optimization Actual Combat" Chapter 5: Main Selection Process - Motianlun

1. Introduction

The Discovery module is responsible for discovering the nodes in the cluster and electing the master node. ES supports a variety of discovery types, the built-in implementation is called Zen Discovery
, and others include Amazon's EC2 on the public cloud platform, Google's GCE, etc.

This chapter discusses the built-in Zen Discovery implementation. Zen Discovery encapsulates the implementation process of node discovery (Ping), master election, etc. Now we will discuss the master election process first, and introduce the Discovery module as a whole in the following chapters.

2. Design thinking

All distributed systems need to deal with consistency in some way. In general, strategies can be divided into two groups :

  • try to avoid inconsistencies

  • and how to reconcile them after inconsistencies in definitions occur . The latter is very powerful in applicable scenarios, but has relatively strict restrictions on the data model. So here's a look at the former, and how to deal with network failures.

3. Why use the master-slave mode

In addition to the master-slave (Leader/Follower) mode, another option is the distributed hash table (DHT), which can support the departure and joining of thousands of nodes per hour, which can be used in heterogeneous networks that do not understand the underlying network topology In the middle of the work, the query response time is about 4 to 10 hops (transit times), for example, Cassandra uses this scheme. But in a relatively stable peer-to-peer network, the master-slave mode will be better.

Another simplification in the typical scenario of ES is that there are not so many nodes in the cluster. Typically, the number of nodes is much smaller than the number of connections a single node can maintain, and the network environment does not have to deal with node joining and leaving frequently. This is why the master-slave mode is more suitable for ES.

4. Election Algorithm

In the selection of the master node election algorithm, the basic principle is not to reinvent the wheel. It is better to implement a well-known algorithm, where the advantages and disadvantages are known. The selection of the ES election algorithm mainly considers the following two types.

4.1. Bully Algorithm

One of the basic algorithms for Leader election. It assumes that all nodes have a unique ID, and uses that ID to sort the nodes. The current Leader at any time is the highest ID node participating in the cluster.  The advantage of this algorithm is that it is easy to implement. However, there are problems when the node with the largest ID is in an unstable state . For example, the Master is overloaded and suspended animation, and the node with the second largest ID in the cluster is elected as the new master. At this time, the original Master recovers, is elected as the new master again, and then dies in suspended animation...

ES solves the above problem by postponing the election until the current Master fails. As long as the current master node does not hang up, the master will not be re-elected. However, it is easy to produce split brain (dual masters) . Therefore, the problem of split brain is solved by "more than half of the statutory votes".

4.2. Paxos Algorithm

Paxos is very powerful, especially the flexibility in when and how to conduct elections has a great advantage over the simple Bully algorithm, because in real life, there are more failure modes than network connection anomalies. But Paxos is very complicated to implement.

Five, related configuration

Important configurations related to the master election process include the following, but not all configurations.

discovery.zen.minimum_master_nodes
: The minimum number of master nodes, which is an extremely important parameter to prevent split-brain and data loss. The actual effect of this parameter has already surpassed its superficial meaning. In addition to being used to determine the "majority" when choosing a leader, it is also used for many important judgments, including at least the following timing:

  • Trigger leader election:  Before entering the leader election process, the number of participating nodes needs to reach a quorum.

  • Determine the Master:  After the temporary Master is selected, the temporary Master needs to judge that the number of nodes joining it has reached a quorum before confirming the success of the master election.

  • Gateway Election Meta Information:  Initiate a request to a node with Master qualification to obtain metadata, and the number of responses obtained must reach a quorum, that is, the number of nodes participating in the meta information election.

  • Master releases the cluster status:  the number of successful releases is the majority.

In order to avoid split-brain, its value should be more than half (quorum):(master_eligible_nodes 2)+1

例如:如果有3个具备Master资格的节点,则这个值至少应该设置为(3/2) + 1=2。

该参数可以动态设置:
PUT _cluster/settings
{
    "persistent" : {
        "discovery.zen.minimum master_nodes" : 2
    }
}

discovery.zen.ping.unicast.hosts
: **The seed node list of the cluster. When building a cluster, this node will try to connect to this node list. Then the hosts in the list will see which hosts are in the entire cluster. ** Can be configured for some or all cluster nodes. Can be specified like this:

discovery.zen.ping.unicast.hosts:
    -192.168.1.10:9300
    -192.168.1.11
    -seeds.mydomain.com

Port 9300 is used by default. If you need to change the port number, you can manually specify the port after the IP. It is also possible to set a domain name so that the domain name can be resolved to multiple IP addresses, and ES will try to connect to all addresses in the IP list.

  • discovery.zen.ping.unicast.hosts.resolve_timeout: DNS
    Parsing timeout, the default is 5 seconds.

  • discovery.zen.join_timeout
    : The timeout period when the node joins the existing cluster, the default is 20 times of ping_timeout.

  • discovery.zen.join_retry_attemptsjoin_timeout
    : The number of retries after timeout, the default is 3 times.

  • discovery.zen.join_retry_delayjoin_timeout
    : After the timeout, the delay time before retrying, the default is 100 milliseconds.

  • discovery.zen.master_election.ignore_non_master_ pings
    : When set to true, the master election phase will ignore ping requests from nodes that do not have Master qualifications (node.master: false), and the default is false.

  • discovery.zen.fd.ping_interval
    : Fault detection interval period, the default is 1 second.

  • discovery.zen.fd.ping_timeout
    : Fault detection request timeout, the default is 30 seconds.

  • discovery.zen.fd.ping_retries
    : The number of retries after failure detection timeout, the default is 3 times.

6. Process analysis

6.0. Process overview

6.0.1. ZenDiscovery's master selection process is as follows:

  • Each node calculates the smallest known node ID , which is the temporary Master . Send a leadership vote to this node .

  • If a node receives enough votes, and the node also votes for itself, it assumes the role of leader and starts publishing the cluster state .

  • All nodes will participate in the election and vote, but only the votes of nodes that are eligible to become Master (node.maste is true) are valid.

How many votes can be obtained to win an election is the so-called quorum. In ES, the quorum size is a configurable parameter. Configuration item:  discovery.zen.minimum master_ nodes
. To avoid split-brain, the minimum value should be the number of master-qualified nodes n/2+1
.

6.0.2. The overall process can be summarized as:

  • Election of temporary Master;

  • Voting - confirm the Master, if the node is elected, wait for the establishment of the Master, if other nodes are elected, try to join the cluster, and then start the node failure detector;

  • Failure node detection;

The details are shown in the figure below.

Thread pool for executing this process: generic.

Below we analyze the implementation of each step in detail.

6.1. Election of temporary Master

The implementation of the election process is located in ZenDiscovery#findMaster
. This function looks up the active master of the current cluster, or selects a new master from the candidates. Returns the selected Master if the master selection is successful, otherwise returns null.

Why is it a temporary Master? Because it still needs to wait for the next step, the node will be established as the real Master when it has enough votes.

The election process of the temporary Master is as follows:

  • "Ping" all nodes, get the node list fullPingResponses, the ping result does not include this node, add this node to fullPingResponses separately.

  • Build two lists.

    • activeMasters list:  the current active Master list of the storage cluster.

    • masterCandidates list:  store the list of master candidates

activeMasters list: the current active Master list of the storage cluster.  Traverse all the nodes obtained in the first step, and add the current Master node that each node considers to the activeMasters list (excluding this node). During the traversal process, if the configuration discovery.zen.master_election.ignore_non_master_pings
is true (the default is false), and the node does not have the Master qualification, the node will be skipped.

The specific process is shown in the figure below.

insert image description here

This process is to add the currently existing Master in the cluster to the activeMasters list, normally there is only one. If the cluster already has a Master, each node records which is the current Master. Considering abnormal circumstances, each node may see a different current Master. In the process of building the activeMasters list, if the node does not have the Master qualification, you can ignore_non_master_pings
ignore the Master it thinks through the option.

masterCandidates list: Stores a list of master candidates.  Traverse the first step to obtain the list, remove the nodes that do not have Master qualifications, and add them to this list. If activeMasters is empty, the election will be made from masterCandidates, and the election may succeed or fail. If not empty, choose the most suitable one from activeMasters as Master.

The overall process is shown in the figure below.

insert image description here

6.1.1. masterCandidates
Choose the master from

The implementation of the specific details of master selection is encapsulated in the ElectMasterService class, for example, judging whether there are enough candidates, selecting a specific node as the Master, etc.

When selecting a master from masterCandidates, it is first necessary to determine whether the current number of candidates has reached a quorum, otherwise the master election will fail.

public boolean hasEnoughCandidates (Collection<MasterCandidate> candidates) {
    //候选者为空,返回失败
    if (candidates.isEmpty()) {
        returnfalse;
    }
    //默认值为-1, 确保单节点的集群可以正常选主
    if (minimumMasterNodes < 1) {
        returntrue;
    }
    return candidates .size () >= minimumMasterNodes;
}

When the number of candidates reaches the quorum, select one of the candidates to be the Master:

public MasterCandidate electMaster (Collection<MasterCandidate> candidates) {
    List<MasterCandidate> sortedCandidates = new ArrayList<> (candidates);
    //通过自定义的比较函数对候选者节点从小到大排序
    sortedCandidates.sort (MasterCandidate :: compare);
    //返回最新的作为Master
    return sortedCandidates.get(0);
}

It can be seen that here only the smallest node is selected as the Master after sorting the nodes. However, the custom comparison function MasterCandidate::compare is used for sorting. In earlier versions, only node IDs were sorted. Now, nodes with higher cluster state version numbers will be prioritized first.

When the default comparison function is used, the sort result is sorted from small to large. Refer to the implementation of the comparison function of the Long type:

public static int compare (1ong X,long y) {
    return(x<y)?-1:((x==y)?0:1);
}

自定义比较函数的实现:
public static int compare(MasterCandidate cl, MasterCandidate c2) {
    //先比较集群状态版本,注意此处c2在前,c1在后
    int ret = Long.compare(c2.clusterStateVersion, c1.clusterStateVersion);
    //如果版本号相同,则比较节点ID
    if(ret==0){
        ret = compareNodes (c1.getNode(),c2.getNode());
    }
    return ret;
}

Implementation of the node comparison function compareNodes:  for sorting effects

  • If one of the two incoming nodes has the Master qualification and the other does not, the master-qualified node will be ranked first.

  • If neither has the Master qualification , or if both have the Master qualification, then compare the node IDs.

However, the nodes in the masterCandidates list are all eligible for Master . The two if judgments of the compareNodes comparison function are because there may be nodes in the node list that do not have Master qualifications in other function calls. So only node IDs are compared here.

private static int compareNodes (DiscoveryNode o1, DiscoveryNode o2) {
    //两个if处理两节点中一个具备Master资格而另一个不具备的情况
    if (o1.isMasterNode() && !o2.isMasterNode () ) {
        return -1;
    }
    if (!o1.isMasterNode() && o2.isMasterNode()) {
        return1;
    }
    //通过节点ID排序
    return o1.getId().compareTo(o2.getId());
}

The selection list from the activeMasters list stores the currently active Masters in the cluster, and one of these known Master nodes is selected as the election result. The selection process is very simple, take the minimum value in the list, the comparison function is still implemented through compareNodes, and the nodes in the activeMasters list are theoretically qualified as Masters.

public DiscoveryNode tieBreakActiveMasters (Collection<DiscoveryNode> activeMasters)
    return activeMasters.stream().min(ElectMasterService ::compareNodes).get();
}

6.1.2. Realization of Voting and Voting

In ES, sending a vote is sending a JoinRequest request. Votes are the number of requests to join the node. The implementation of collecting votes and making statistics is in ZenDiscovery#handleJoinRequest
the method, and the received connections are stored in ElectionContext#joinRequestAccumulator
. When a node checks whether the received votes are sufficient, it checks whether the number of connections joining it is sufficient, and the votes of nodes without Master qualification will be removed.

public synchronized int getPendingMasterJoinsCount() {
    int pendingMasterJoins = 0;
    //遍历当前收到的join请求
    for (DiscoveryNode node : joinReques tAccumulator .keySet()) {
        //过滤不具备master资格的节点
        if (node. isMasterNode()) {
            pendingMasterJoins++;
        }
    }
    return pendingMasterJoins;
}

6.1.3. Establish a Master or join a cluster

There are two situations for the elected temporary Master: the temporary Master is the local node or a non-local node. This is handled separately. Now ready to send votes to it.

6.1.3.1. If the temporary Master is the current node:

  • Wait for enough master-qualified nodes to join this node (the voting reaches a quorum) to complete the election. If the number of join requests is not satisfied after the timeout (default is 30 seconds, configurable), the election fails and a new round of election is required.

  • Post a new clusterState on success.

6.1.3.2. If other nodes are selected as Master:

  • Join requests from other nodes are no longer accepted.

  • Send a join request to the Master and wait for a reply. The timeout defaults to 1 minute (configurable), and if an exception is encountered, it retries 3 times by default (configurable). This step is implemented in the joinElectedMaster method.

  • The finally elected Master will publish the cluster status before confirming the client's join request. Therefore, joinElectedMaster returns the confirmation that the join request has been received and the cluster status has been received. In this step, check if the Master node in the received cluster status is empty, or if the elected Master is not the previously selected node, re-elect.

6.2. Node failure detection

So far, the master selection process has been completed, the Master identity has been confirmed, and the non-Master nodes have joined the cluster. Node failure detection will monitor whether the node is offline, and then handle the abnormality in it. Failure detection is an indispensable step after the master selection process. Failure to perform failure detection may result in split-brain (dual or multiple masters). Here we need to enable two failure detectors:

  • On the Master node , start  NodesFaultDetection
    , referred to as NodesFD. Regularly detect whether the nodes joining the cluster are active.

  • On non-Master nodes , start MasterFaultDetection
    , referred to as MasterFD. Regularly detect whether the Master node is active.

Both NodesFaultDetection and MasterFaultDetection detect whether the node is normal through the ping request sent periodically (1 second by default) , and start processing when the failure reaches a certain number of times (3 times by default), or when a node offline notification is received from the underlying connection module Node leave event.

6.2.1, NodesFaultDetection event processing

Check whether the total number of nodes in the current cluster reaches the statutory number of nodes (more than half). If not, it will give up the Master status and rejoin the cluster.  Why do you want to do this? Imagine the following scenario, as shown in the figure below.

insert image description here

Assume that a cluster composed of 5 machines generates a network partition, 2 machines form a group, and the other 3 machines form a group. Before the partition occurs, the original Master is Node1. At this time, a group of 3 nodes will re-elect and successfully select Noded3 as the Master. Will there be dual masters?  NodesFaultDetection
It is to avoid dual masters in the above scenario.

The corresponding event processing is mainly implemented as follows: ZenDiscovery#handleNodeFailure
Execute in NodeRemoval-ClusterStateTaskExecutor#execute
.

public ClusterTasksResult<Task> execute (final ClusterState currentState, final List<Task> tasks) throws Exception {
    //判断剩余节 点是否达到法定人数
    if (electMasterService.hasEnoughMasterNodes (remainingNodesClusterState.nodes()) == false) {
        finalint masterNodes = electMas terService.countMasterNodes(remainingNodesClusterState.nodes());
        rejoin.accept(LoggerMessageFormat.format("not enough master nodes(has \[{}\], but needed \[{}\])", masterNodes, electMasterService.minimumMasterNodes()));
        return resultBuilder .build (currentState) ;
    } else {
        return resultBuilder.build (allocationService.deassociateDeadNodes(remainingNodesClusterState, true, describeTasks(tasks)));
    }
}

When the master node detects that the node is offline, if it finds that the number of nodes in the current cluster is insufficient for the quorum, it will give up the Master status to avoid dual masters.

6.2.2, MasterFaultDetection event processing


**The process of detecting that the Master is offline is very simple, rejoin the cluster. **Essentially, the node re-executes the process of electing the master. The corresponding event processing is mainly implemented as follows: ZenDiscovery#handleMasterGone

private void handleMasterGone (final DiscoveryNode masterNode, final Throwable cause, final String reason) {
    synchronized(stateMutex) {
        if (localNodeMaster() == false && masterNode.equals (committedState.get().nodes ().getMasterNode())) {
            pendingStatesQueue.failAllStatesAndClear (new ElasticsearchException("master left\[\[)\]", reason));
            //重新加入集群
            rejoin ("master left (reason = " + reason + ")");
        }
    }
}

summary

The master election process is started in the cluster, and it is executed from the state of no master to the generation of a new master. At the same time, during the normal operation of the cluster, the master detects that the node leaves, and the non-master node detects that the master leaves.

Guess you like

Origin blog.csdn.net/qq_32907195/article/details/131980402