Analysis of Elasticsearch principle-main selection process

Analysis of Elasticsearch principle-main selection process


The Discovery module is responsible for discovering the nodes in the cluster and selecting the master node. ES supports a variety of different Discovery types. The built-in implementation is called Zen Discovery. Others include Amazon EC2, Google's GCE, and so on. This chapter discusses the built-in Zen Discovery implementation. Zen Discorvery encapsulates the implementation process of node discovery (Ping), master selection, etc. Now let's discuss the master selection process first, and introduce the Discovery module as a whole in the following chapters.

1. Design Ideas

All distributed systems need to deal with consistency issues in some way. In general, strategies can be divided into two groups: trying to avoid inconsistencies and how to coordinate them after inconsistent definitions. The latter is very powerful in applicable scenarios, but has relatively strict restrictions on the data model. So here to study the former and how to deal with network failures.

2. Why use master-slave mode

In addition to the Leader/Follower mode, another option is distributed hash table (DHT) , which can support the departure and addition of thousands of nodes per hour , which can be used in heterogeneous networks that do not understand the underlying network topology Medium work, the query response time is about 4 to 10 hops (transfer times). For example, Cassandra uses this solution. But in a relatively stable peer-to-peer network, the master-slave mode is better .

Another simplification in typical ES scenarios is that there are not so many nodes in the cluster. Generally, the number of nodes is much smaller than the number of connections that a single node can maintain, and the network environment does not have to deal with the joining and leaving of nodes frequently. This is why the master-slave mode is more suitable for ES.

3. Election Algorithm

In the selection of the primary node election algorithm, the basic principle is not to reinvent the wheel. It is best to implement a well-known algorithm, the advantage of which is that its advantages and disadvantages are known. The selection of ES election algorithm mainly considers the following two.

  1. Bully algorithm

    One of the basic algorithms of Leader election. It assumes that all nodes have a unique ID and use that ID to sort the nodes. The current Leader at any time is the highest ID node participating in the cluster. The advantage of this algorithm is that it is easy to implement. However, there are problems when the node with the largest ID is in an unstable state. For example, if the Master is overloaded and suspended, the node with the second largest ID in the cluster is selected as the new master. At this time, the original Master is restored, and it is selected as the new master again, and then suspended again...

    ES solves the above problems by postponing the election until the current Master fails. As long as the current master node does not hang up, it will not re-elect the master. However, it is easy to produce split brain (double master). For this reason, the problem of split brain is solved through "the quorum of more than half of the votes."

  2. Paxos algorithm

    Paxos is very powerful, especially in terms of timing and how to conduct elections. It has great advantages over simple Bully algorithm. Therefore, in real life, there are more failure modes than abnormal network connections. But Paxos is very complicated to implement.

4. Related configuration

The important configurations related to the main selection process are as follows, not all configurations.

  1. discovery.zen.minimum_master_nodes : The minimum number of master nodes, which is an extremely important parameter to prevent split brain and prevent data loss. The actual effect of this parameter has long gone beyond its apparent meaning. In addition to determining the "majority" when selecting the master, it is also used for many important judgments, including at least the following timing:

    • Trigger the election of the master : Before entering the master election process, the number of nodes participating in the election must reach a quorum.
    • Determine Master : After the temporary Master is selected, the temporary Master needs to determine that the number of nodes joining it has reached a quorum before confirming the success of the election.
    • Gateway election meta-information : Initiate a request to a master-qualified node to obtain metadata. The number of responses obtained must reach a quorum, that is, the number of nodes participating in the meta-information election.
    • Master publishes the cluster status : the number of successful publishing is the majority.

    In order to avoid split brain, its value should be more than half (quorum):

    (master_eligible_nodes/2) + 1
    

    For example, if there are 3 nodes with Master qualification, this value should be at least (3/2) + 1 = 2.

    This parameter can be set dynamically:

    PUT /_cluster/settings
    {
          
          
        "persistent":{
          
          
            "discovery.zen.minimun_master_nodes":2
        }
    }
    
  2. discovery.zen.ping.unicast.hosts : The seed node list of the cluster. When constructing the cluster, this node will try to connect to this node list, and the hosts in the list will see which hosts are in the entire cluster. It can be configured as part or all of the cluster nodes. It can be specified as follows:

    discovery.zen.ping.unicast.hosts: 192.168.137.14:9300,192.168.137.6:9300,192.168.137.122:9300
    

    The default port is 9300. If you need to change the port number, you can manually specify the port after the IP. You can also set a domain name and let the domain name resolve to multiple IP addresses, and ES will try to connect to all addresses in this IP list.

  3. discovery.zen.ping.unicast.hosts.resolve_timeout : DNS resolution timeout, the default is 5 seconds.

  4. discovery.zen.join_timeout : The timeout period for a node to join an existing cluster. The default is 20 times of ping_timeout.

  5. discovery.zen.join_retry_attempts : The number of retries after joining timeout, 3 times by default

  6. discovery.zen.join_retry_delay join_timeout : After timeout, the delay before retrying, the default is 100 milliseconds.

  7. discovery.zen.master_election.ignore_non_master_pings : When set to true, the master selection phase will ignore ping requests from nodes that do not have Master qualifications (node.master: false), and the default is false.

  8. discovery.zen.fd.ping_interval : Fault detection interval period, the default is 1 second.

  9. discovery.zen.fd.ping_timeout : The timeout time of the fault detection request, the default is 30 seconds.

  10. discovery.zen.fd.ping_retries : The number of retries after failure detection timeout, the default is 3 times.

5. Process Overview

The main selection process of ZenDiscovery is as follows:

  1. Each node calculates the smallest known node ID, which is the temporary Master. Send a leader vote to the node.
  2. If a node receives enough votes and the node also votes for itself, then it will play the role of leader and begin to publish the cluster status.

All nodes will participate in the election and participate in voting. However, only the vote of the node ( node.master: true ) that is eligible to become the Master is valid.

How many votes can be obtained to win the election is the so-called quorum. In ES, the legal size is a configurable parameter. Configuration items:

discovery.zen.minimum_master_nodes

In order to avoid split-brain, the minimum value should be the number of master-qualified nodes n/2 + 1 .

6. Process analysis

The overall process can be summarized as: elect a temporary Master, if the node is elected, wait for the Master to be established, if other nodes are elected, try to join the cluster, and then start the node failure detector. The details are shown in the figure below:

Insert picture description here

The thread pool that executes this process: generic .

Below we analyze the realization of each step in detail.

6.1 Election of Temporary Master

The implementation of the election process is located in ZenDiscovery#findMaster . This function finds the active Master in the current cluster, or selects a new Master from the candidates. If the master selection is successful, the selected master is returned, otherwise it returns empty.

Why is it a temporary Master? Because it needs to wait for the next step, the node can only be established as the real Master when it has enough votes.

The election process of the temporary Master is as follows:

  1. " Ping " all nodes, get the node list fullPingResponses , the ping result does not include this node, add this node to fullPingResponses separately .

  2. Build two lists.

    ActiveMasters list : A list of currently active Masters in the storage cluster. Traverse all the nodes obtained in the first step, and add the current Master node that each node thinks to the activeMasters list (not including this node). If it is configured to discovery.zen.master_election.ignore_non_master_pingsbe true during the traversal (default is false), and the node does not have the Master qualification, the node is skipped.

    The specific process is shown in the figure below:
    Insert picture description here

    This process is to add the currently existing Masters in the cluster to the activeMasters list. Normally, there is only one. If a master already exists in the cluster, each node records which master is the current one. Taking into account abnormal circumstances, the current master may be different for each node. In the process of constructing the activeMasters list, if the node does not have the Master qualification, it can ignore_non_master_pingsignore the Master it thinks through the option.

    masterCandiddates list : stores a list of master candidates. Traverse the first step to get the list, remove the nodes that do not have the Master qualification, and add them to this list.

  3. If the activeMasters is empty, the election is from the masterCandidates. The result may be a successful election or a failed election. If it is not empty, select the most suitable one from activeMasters as the Master.

    The overall process is shown in the following figure:
    Insert picture description here
    select the master from masterCandidates

    The specific details of master selection and master selection from masterCandidates are encapsulated in ElectMasterServiceclasses, for example, to determine whether the candidates are sufficient, and to select specific nodes as the Master.

    To choose a master from masterCandidates , we first need to determine whether the current number of candidates reaches a quorum, otherwise the master selection will fail.

    public boolean hasEnoughCandidates (Collection<MasterCandidate> candidates) {
          
          
        //后选者为空,返回失败
        if(candidates.isEmpty()){
          
          
            return false;
        }
        //默认值为-1,确保单节点的集群可以正常选主
        if(minimumMasterNodes < 1){
          
          
            return true;
        }
        return candidates.size >= minimumMasterNodes;
    }
    

    When the number of candidates reaches a quorum, select one of the candidates to become the Master:

    public MasterCandidate electMaster (Collection<MasterCandidate> candidates) {
          
          
        Collection<MasterCandidate> sortedCandidates = new ArrayList<>(candidates);
        //通过自定义的比较函数对候选者节点从小到大排序
        sortedCandidates.sort(MasterCandidate::compare);
        //返回最小的作为Master
        return sortedCandidates.get(0);
    }
    

    It can be seen that only the smallest node is selected as the Master after sorting the nodes. However, when sorting, a custom comparison function is used MasterCandidate::compare. In the early version, only the node ID was sorted. Now , the node with the higher cluster status version number is put first .

    In the case of using the default comparison function, the sort result is sorted from small to large. Refer to the implementation of the Long type comparison function:

    public static int compare (long x, long y) {
          
          
        return (x<y) ? -1 : ((x == y) ? 0:1);
    }
    

    Implementation of custom comparison function:

    public static int compare (MasterCandidate c1, MasterCandidate c2) {
          
          
        //先比较集群状态版本,注意此处c2在前,c1在后
        int ret = Long.compare(c2.clusterStateVersion, c1.clusterStateVersion);
        //如果版本号相同,则比较节点ID
        if(ret == 0){
          
          
            ret = compareNodes(c.getNode(), c2.getNode());
        }
        return ret;
    }
    

    compareNodesImplementation of node comparison function : For the sorting effect, if one of the two incoming nodes has the Master qualification and the other does not, the nodes with the Master qualification will be ranked first. If you do not have the Master qualification, or both have the Master qualification, compare the node IDs.

    But masterCandidatesthe nodes in the list are all qualified as Masters. The two if judgments of the compareNodes comparison function are because there may be nodes in the node list that do not have the Master qualification in other function calls. Therefore, only node IDs are compared here.

    private static int compareNodes(Discovery o1, Discovery o2) {
          
          
        //两个if处理两个阶段中一个具备Master资格而另一个不具备的情况
        if(o1.isMasterNode() && !o2.isMasterNode()){
          
          
            return -1;
        }
        if(!o1.isMasterNode() && o2.isMasterNode()){
          
          
            return 1;
        }
        //通过节点ID排序
        return o1.getId().compareTo(o2.getId());
    }
    

    Select from the list of aciveMasters

    The list stores the currently active Masters of the cluster, and one of these known Master nodes is selected as the election result. The selection process is very simple. Take the minimum value in the list, and the comparison function is still compareNodesimplemented. The nodes in the activeMasters list are all qualified as Masters in theory.

    public DiscoveryNode tieBreakActiveMasters (Collection<DiscoveryNode> activeMasters) {
          
          
        return activeMasters.stream().min(ElectMasterService:compareNodes).get();
    }
    

6.2 Realization of voting and obtaining votes

In ES, sending a vote is sending a join request ( JoinRequest ). The votes are the number of requests applying to join the node.

Collecting votes and statistics are implemented in the ZenDiscovery#handleJoinRequestmethod. When a node checks whether the received votes are sufficient, it is to check whether the number of connections that have joined it is sufficient, and the votes of nodes without Master qualifications will be removed.

public synchronized int getPendingMasterJoinsCount () {
    
    
    int pendingMasterJoins = 0;
    //遍历当前收到的join请求
    for(DiscoveryNode node : joinRequestAccumulator.keySet()) {
    
    
        //过滤不具备Master资格的节点
        if(node.isMasterNode()) {
    
    
            pendingMasterJoins++;
        }
    }
    return pendingMasterJoins;
}

6.3 Establish Master or join the cluster

There are two situations for the elected temporary Master: the temporary Master is the current node or the non-local node. This is handled separately. Now you are ready to send a vote to it.

  1. If the temporary Master is the current node:
    1. Wait for enough master-qualified nodes to join this node (voting to reach a quorum) to complete the election.
    2. After the timeout (30 seconds by default, configurable), if the number of join requests is not met, the election fails and a new round of election is required.
    3. After success, a new clusterState is released.
  2. If other nodes are selected as Master:
    1. No more join requests from other nodes are accepted.
    2. Send a join request to the Master and wait for a reply. The default timeout period is 1 minute (configurable). If an exception is encountered, it will retry three times by default (configurable). This step joinElectedMasteris implemented in the method.
    3. The finally elected Master will first publish the cluster status before confirming the client's join request. Therefore, the joinElectdMasterreturn represents the confirmation that the join request has been received and the cluster status has been received. In this step, check if the master node in the received cluster status is empty or the elected Master is not the node previously selected, then re-elect.

7. Node failure check

So far, the master selection process has been executed, the Master identity has been determined, and the non-Master nodes have joined the cluster.

The node failure check will monitor whether the node is offline, and then deal with the exception. Failure detection is an indispensable step after the trial selection of the main process. Failure to perform failure detection may result in split brain (dual master or multi-master). Here we need to activate two failure detectors:

  • In the Master node, start NodesFaultDetection, referred to as NodesFD. Periodically detect whether the nodes joining the cluster are active.
  • Start on a non-Master node MasterFaultDetection, referred to as MasterFD. Periodically detect whether the Master node is active.

NodesFaultDetectionMasterFaultDetectionBoth and ping requests occur periodically (1 second by default) to detect whether the node is normal. When the failure reaches a certain number of times (the default is 3 times), or the node offline notification from the underlying connection module is received, start processing the node leaving event.

7.1 NodesFaultDetection event processing

Check whether the current cluster summary points reach the number of legal nodes (more than half), if not enough, it will give up the Master status and rejoin the cluster. Why do you do this? Imagine the following scenario, as shown in the figure below.
Insert picture description here
Suppose a cluster composed of 5 machines produces a network partition, 2 machines form a group, and the other 3 machines form a group. Before the partition is generated, the original Master is Node1. At this time, a group of 3 nodes will re-elect and successfully elect Node3 as Master, will there be dual masters?

NodesFaultDetectionIt is to avoid dual masters in the above scenarios. The corresponding event processing is mainly implemented as follows:

ZenDiscovery#handleNodeFailureIn execution NodeRemovalClusterStateTaskExecutor#execute.

public ClusterTaskResult<Task> execute(final ClusterState currentState, final List<Task> tasks) throw Exception {
    
    
    //判断剩余节点是否大豆法定人数
    if(electMasterService.hasEnoughMasterNodes(remainingNodesClusterState.nodes()) == false){
    
    
        final int masterNodes = electMasterService.countMasterNodes(remainingNodesClusterState.nodes());
        rejoin.accept(
            LoggerMessageFormat.format("not enough master nodes(has [{}], but needed [{}])", masterNodes, electMasterService.minimumMasterNodes());
        return resultBuilder.build(currentState);
        
    }else{
    
    
        return resultBuilder.build(allocationService.deassociateDeadNodes(remainingNodesClusterState,true,describeTasks(tasks)));
    }
    
}

When the master node detects the node offline event processing, if the number of nodes is less than the quorum, it will give up the Master status to avoid dual masters.

7.2 MasterFaultDetection event handling

The process of detecting that the Master is offline is simple, and rejoin the cluster. Essentially, the node re-executes the main selection process.

The corresponding event processing is mainly implemented as follows:

ZenDiscorvery#handleMasterGone

private void handleMasterGone(final DiscoveryNode master, final Throwable cause, final String reason){
    
    
    synchronized (stateMutex){
    
    
        if(localNodeMaster() == false && masterNode.equals(committedState.get().nodes.getMasterNode())){
    
    
            pendiingStatesQueue.failAllStatesAndClear(new ElasticsearchException("master left [{}]",reason));
            //重新加入集群
            rejoin("master left (reason = "+ resaon +")");
        }
    }
}

8. Summary

The master selection process is started in the cluster and is executed from the no-master state to when a new master is generated. At the same time, when the cluster is running normally, the Master detects the departure of the node, and the non-Master node detects the departure of the Master node.

9. Follow me

Search WeChat public account: the road to a strong java architecture
Insert picture description here

Guess you like

Origin blog.csdn.net/dwjf321/article/details/104701340