[Elacticsearch] Cluster discovery mechanism, sharding & replica mechanism, load mechanism, fault tolerance mechanism, expansion mechanism, shard routing principle

Cluster Discovery Mechanism

  Elasticsearch adopts the master-slave mode. ES will select a node in the cluster to become the master node. Only the Master node is eligible to maintain the global cluster state. When a node joins or exits the cluster, it will redistribute the fragments and The latest status of the cluster is sent to other nodes in the cluster, and the master node will periodically ping to verify whether other nodes are alive.

Elasticsearch's election algorithm before 7.x was based on the Bully election algorithm, and ES after 7.x adopted a new master election algorithm Raft  ;

election timing

  • Cluster initialization

  • When the Master of the cluster crashes

  • When any node finds that the Master node in the current cluster is not approved by n/2 + 1 nodes, it triggers an election;

Fundamentals of Election

ES elects all Master Eligible Nodes in the current cluster to obtain the master node. In order to avoid the split-brain phenomenon, ES chooses the quorum (majority) idea common in distributed systems, that is, only nodes that have obtained more than half of the votes can become master. Use the attribute to set quorum in ES  discovery.zen.minimum_master_nodes , this attribute is generally set to  eligibleNodesNum / 2 + 1.

The election process is described as follows

  1. The node node sends an election message to all nodes larger than itself (the election is an election message)
  2. If the node node does not get any reply (the reply is an alive message), then the node node becomes the master and announces itself to all other nodes as the master (announced as a Victory message)
  3. If the node gets any reply, the node node must not be the master, and wait for the Victory message at the same time, and re-initiate the election if waiting for the Victory timeout;

Bully Algorithm

      One of the basic algorithms for Leader election.

In the bully algorithm, each node has a number, and only the surviving node with the largest number can become the master node.

Discovery module: responsible for discovering the nodes in the cluster and selecting the master node. ES supports a variety of different Discovery types, and the built-in implementation is called Zen Discovery.

Zen Discovery encapsulates the implementation process of node discovery (Ping) and master election.

The algorithm assumes that all nodes have a unique ID which ranks the nodes. The current Leader at any time is the highest id node participating in the cluster. The advantage of this algorithm is that it is easy to implement, but there will be problems when the node with the largest id is in an unstable state, for example, the master is overloaded and suspended animation, and the node with the second largest id in the cluster is selected as the new master At that time, the original Master recovered, was elected as the new master again, and then died in suspended animation...

Elasticsearch solves the above problems by postponing the election until the current Master fails; but it is prone to split brain, and then   solves the split brain by more than half of the quorum of votes ;

     In es, sending a vote is to send a request to join the cluster. The vote is counted in the handleJoinRequest process, and the received connection is stored in pendingJoinRequests.
Check whether the vote is sufficient in checkPendingJoinsAndElectIfNeeded, which will filter out votes for nodes without Master qualifications;

Code implementation logic:

1. Filter activeMasters list

Ping all nodes and get PingResponse;

  1. Filter nodes eligible to become Master
  2. Three lists are created;

    Among them, joinedOnceActiveNodes.size <= activeNodes.size, the difference is whether it contains localnode or not, and the other contents are the same, all from the result of ping

The master of Es is elected from the activeMasters list or the masterCandidates list, so before the election, es first needs to obtain these two lists. Elasticsearch node members first send Ping requests to all members in the cluster. Elasticsearch waits for the discovery.zen.ping_timeout time by default, and then elasticsearch filters all the responses obtained, and filters out the list of activeMasters. The list of activeMasters is what other nodes think of the current cluster. Master node

 2. Filter the list of masterCandidates

The masterCandidates list is the node in the current cluster that is eligible to become a Master. If we configure the following parameters in elasticsearch.yml, then this node is not eligible to become a Master node and will not be screened into the masterCandidates list;

Any node of Elasticsearch can set node.master and node.data properties

  配置某个节点没有成为master资格 node.master:false;

3. Elect the Master node from the activeMasters list

The activeMaster list is the list of master nodes of the current cluster that other nodes consider. If the activeMasters list is not empty, elasticsearch will preferentially elect from the activeMasters list, which corresponds to the blue box in the flow chart. The election algorithm is the Bully algorithm. The author The Bully algorithm was introduced in detail in the previous article. The Bully algorithm involves priority comparison. When comparing the priorities of the activeMasters list, if the node has the qualification to become a master, then the priority is relatively high. If there are multiple nodes in the activeMasters list with master qualification, then select the node with the smallest id

code show as below

private static int compareNodes(DiscoveryNode o1, DiscoveryNode o2) {
    if (o1.isMasterNode() && !o2.isMasterNode()) {
        return -1;
    }
    if (!o1.isMasterNode() && o2.isMasterNode()) {
        return 1;
    }
    return o1.getId().compareTo(o2.getId());
}

public DiscoveryNode tieBreakActiveMasters(Collection<DiscoveryNode> activeMasters) {
    return activeMasters.stream().min(ElectMasterService::compareNodes).get(); 
}

4. Elect the Master node from the list of masterCandidates

This section corresponds to the red part in the red flow chart. If the activeMaster list is empty, it will be elected in masterCandidates. The election of masterCandidates will also involve priority comparison. The priority comparison of masterCandidates election is different from the priority comparison of masterCandidates election. . It will first determine whether the number of members in the masterCandidates list has reached the minimum number discovery.zen.minimum_master_nodes. If the priority is compared when it is reached, when the priority is compared, the cluster state version number owned by the node is first compared, and then the id is compared. The purpose of this process is to make the node with the latest cluster state become the master

public static int compare(MasterCandidate c1, MasterCandidate c2) {
    int ret = Long.compare(c2.clusterStateVersion, c1.clusterStateVersion);
    if (ret == 0) {
        ret = compareNodes(c1.getNode(), c2.getNode());
    }
    return ret;
}

5. The local node is the master

After the above election, a quasi-master node will be elected, and the quasi-master node will wait for the votes of other nodes. If there are discovery.zen.minimum_master_nodes-1 nodes voting that the current node is the master, then the election will be successful, and the quasi-master will wait discovery.zen.master_election.wait_for_joins_timeout time, if it times out, it will fail. In terms of code implementation, the quasi-master is implemented by registering a callback, and at the same time, it is implemented with the help of concurrent construction such as AtomicReference and CountDownLatch

if (clusterService.localNode().equals(masterNode)) {
    final int requiredJoins = Math.max(0, electMaster.minimumMasterNodes() - 1); 
    nodeJoinController.waitToBeElectedAsMaster(requiredJoins, masterElectionWaitForJoinsTimeout,
            new NodeJoinController.ElectionCallback() {
                @Override
                public void onElectedAsMaster(ClusterState state) {
                    joinThreadControl.markThreadAsDone(currentThread);
                    nodesFD.updateNodesAndPing(state); // start the nodes FD
                }
                @Override
                public void onFailure(Throwable t) {
                    logger.trace("failed while waiting for nodes to join, rejoining", t);
                    joinThreadControl.markThreadAsDoneAndStartNew(currentThread);
                }
            }
    );

When the local node is the Master, the Master node will enable error detection (NodeFaultDetection mechanism), and its nodes will periodically scan all members of the cluster, remove inactivated members from the cluster, and publish the latest cluster status to the cluster, cluster members After receiving the latest cluster status, corresponding adjustments will be made, such as re-selecting the primary shard, performing data replication and other operations

6. The local node is not the master

The current node determines that if it cannot be the master node under the current state of the cluster, it will first prohibit other nodes from joining itself, and then vote for the quasi-Master node. At the same time, monitor the cluster status released by the master (MasterFaultDetection mechanism). If the master node displayed by the cluster status is not the same as the master node that the current node thinks, the current node will re-initiate the election.

Non-Master nodes will also monitor the Master node for error detection. If a member node finds that the master cannot be connected, it will rejoin a new Master node. If it finds that many nodes in the current cluster cannot connect to the master node, the election will be re-initiated.

Raft Algorithm Selection Process

As a distributed consensus protocol, Raft itself not only describes the election process , but also provides a description of log synchronization and security- related behaviors;

Its design principles are as follows:

  • easy to understand
  • Reduce the number of states and eliminate uncertainty as much as possible

In Raft, there are three possible states of a node, and the transition relationship is as follows:

Under normal circumstances, there is only one Leader in the cluster, and all other nodes are Followers. Followers receive requests passively and never actively send any requests. Candidate - Candidate, candidate; Candidate is an intermediate state from Follower to Leader.

The concept of term is introduced in Raft, and there is at most one Leader in each term. term acts as a logical clock in the Raft algorithm. This term will be carried when communicating between servers . If the node finds that the term in the message is smaller than its own term, it will reject the message; if it is greater than its own term, it will update its own term. If a Candidate or Leader finds that its term has expired, it will immediately return to the Follower state.

The Raft election process is:

  • Increase the local current term of the current node and switch to the Candidate state;
  • The current node votes for itself, and sends RequestVote RPC to other nodes in parallel (let everyone vote for him);

Then wait for the response of other nodes, there will be the following three results:

  • If votes from most servers are received, then it becomes the Leader. After becoming a Leader, it sends heartbeat messages to other nodes to determine its status and prevent new elections.
  • If someone else's voting request is received, and the other person's term is larger than your own, then the candidate degenerates into a Follower;
  • If the election process times out, another round of election is initiated;

ES realizes the main flow of Raft algorithm selection

In the ES implementation, the candidate does not vote for himself first, but directly initiates RequestVote in parallel, which is equivalent to the candidate having the opportunity to vote for other candidates. The advantage of this is that to a certain extent, it can avoid the situation that when three nodes become candidates at the same time, they all vote for themselves and cannot successfully elect the leader.

ES does not limit each node to only one vote on a certain term, and a node can vote multiple votes, which will result in the election of multiple masters:

  • Node2 is selected as the master, and the votes received are: Node2, Node3;
  • Node3 is selected as the master, and the votes received are: Node3, Node1;

In this case, the ES process is to let the last elected Leader succeed as the Leader. If a RequestVote request is received, he will unconditionally exit the Leader state. In this example, Node2 is elected as the leader first, and then he receives the RequestVote request from Node3, then he exits the Leader state, switches to CANDIDATE, and agrees to vote for the candidate who initiated the RequestVote. Therefore, in the end Node3 was successfully elected as the Leader.

Dynamically maintain the candidate node list

Before this, the premise of our discussion is that when the number of cluster nodes remains unchanged, now consider how to deal with cluster expansion, shrinkage, and temporary or permanent offline nodes. In versions prior to 7.x, users need to manually configure minimum_master_nodesto clearly tell the number of nodes in the cluster that is more than half, and adjust it when the cluster expands or shrinks. Now the cluster can maintain itself.

After canceling discovery.zen.minimum_master_nodes the configuration, the current practice no longer records the specific value of the "quorum" quorum, but instead records a node list, which stores all master-qualified nodes (in some cases this is not the case, for example, the cluster originally only 1 node, when it is increased to 2, this list remains unchanged, because if it becomes 2, when any node in the cluster is offline, it will cause the failure to elect the master. If you add another node at this time, the cluster will become 3, This list will be updated to 3 nodes), called VotingConfiguration, he will persist to the cluster state.

As nodes join or leave the cluster, Elasticsearch automatically makes VotingConfiguration changes to ensure the cluster is as resilient as possible. It is important to wait for this adjustment to complete before removing more nodes from the cluster. You cannot stop half or more nodes at once. (I feel that this operation is more touching when shrinking a large area, shrinking part by part). By default, ES automatically maintains VotingConfiguration. It is easier to handle when a new node joins, but when a node leaves, it may be temporarily restarted or permanently offline. You can also manually maintain VotingConfiguration, the configuration items are: cluster.auto_shrink_voting_configuration, when you choose manual maintenance, some nodes will go offline permanently, and you need to voting exclusions APIexclude the nodes. If you use the default automatic maintenance VotingConfiguration, you can also use voting exclusions APIto exclude nodes, such as offline more than half of the nodes at one time.

If VotingConfigurationthe number of nodes is found to be even during maintenance, ES will exclude one of them and ensure VotingConfigurationthat it is an odd number. Because when it is an even number, the network partition divides the cluster into two parts of equal size, then neither sub-cluster can reach the "majority" condition.

Fragmentation & Copy Mechanism

Fragmentation : The Elasticsearch cluster allows the system to store more data than a single machine, which is achieved through shard. In an index index, data (document) is processed by sharding (sharding) on ​​multiple shards. That is to say: each shard holds a part of all the data.

A shard is an instance of Lucene, which is a complete search engine in itself. Documents are stored into shards, but applications interact directly with the index rather than with the shards.

Replica : In order to solve the problem that a single machine cannot handle all requests when the access pressure is too high, the Elasticsearch cluster introduces the replica strategy replica. The replica policy creates redundant copies of each shard in the index.

The copy works as follows:

1. Improve system fault tolerance

When the machine where the shard is located goes down, Elasticsearch can use its copy to recover, thereby avoiding data loss.

2. Improve ES query efficiency

When processing queries, ES will treat replica shards and primary shards fairly, and load balance query requests to replica shards and primary shards.

The more replica shards, the better, because of the following two reasons:

(1) Multiple replicas can improve the throughput and performance of search operations, but adding more replica shards to a cluster with the same number of nodes will not improve performance, because the resources obtained by each shard from the nodes will be At this time, you need to add more hardware resources to improve throughput.

(2) More replica shards increase data redundancy and ensure data integrity. However, according to the interaction principle between primary and secondary shards above, data synchronization between shards will occupy a certain amount of network bandwidth. , affecting efficiency, so the number of fragments and replicas of the index is not the more the better.

Fragmentation and Replication Mechanism

  1. The index contains multiple shards;
  2. Each shard is a minimum work unit, which bears part of the data; each shard is a Lucene instance, which has the complete ability to build indexes and process requests; 3. When adding or subtracting
  nodes , the shard will automatically load balance among the nodes;
  4, primary shard and replica shard, each document only exists in a primary shard and its corresponding replica shard, and cannot exist in multiple primary shards; 5,
  replica shard is The copy of the primary shard is responsible for fault tolerance and read request load;
  6. The number of primary shards is fixed when the index is created, and the number of replica shards can be modified at any time.
  7. The default number of primary shards is 5, and the default number of replicas is 1. (There are 10 shards, 5 primary shards, and 5 replica shards by default in total); 8. Primary shards cannot be placed on the same node
  as their own replica shards (Otherwise, if the node goes down, both the primary shard and the copy will be lost, which will not be fault-tolerant, but it can be placed on the same node as the replica shard of other primary shards);

load mechanism

fault tolerance mechanism

Expansion mechanism

Vertical expansion
Buy more powerful servers, the price is unlimited! ! ! And the bottleneck will still exist. For example, if you have 10T of data now, the disk is full and cannot fit it. Now that the total amount of business data can reach 100T, then you can buy another 100T disk? Then you are really rich, what should you do if the 100T is full?

2. Horizontal capacity expansion
The solution often adopted in the industry is to purchase more and more 10T servers. The performance is relatively average, but many 10T servers can be organized together to form a powerful storage capacity. (Recommended. Cost-effective, and will not be a bottleneck)
 

Shard routing principle

When Elasticsearch indexes a piece of data (a document), how does it know which shard a piece of data should be stored in?

Determined by this formula:

shard = hash(routing) % number_of_primary_shards

Routing is a variable value, the default is the _id of the document, and it can also be set to a custom value. Routing generates a number through the Hash function, and then divides this number by number_of_primary_shards (the number of primary shards) to get the remainder. This remainder is between 0 and number_of_primary_shards-1, which is the location of the shard where the document we are looking for resides.

That's why the number of shards cannot be changed after the index is created. Because if the number changes, all the values ​​​​of the previous routes will be invalid, and the document will never be found again.

In the ES cluster, each node knows the storage location of the documents in the cluster through the above calculation formula, so each node has the ability to process read and write requests. After a write request is sent to a certain node, the node becomes the coordinating node. The coordinating node will calculate which shard needs to be written to according to the routing formula, and then forward the request to the primary shard node of the shard.

In order to improve the writing ability of ES, this process is written concurrently. At the same time, in order to solve the problem of data conflicts in the process of concurrent writing, ES controls through optimistic locking. Each document has a _version (version) number. When the document The version number is incremented when modified. Once all replica shards report success, they report success to the coordinator node, and the coordinator node reports success to the client.

write data example


As shown above, each node has 5 shards, and each shard has a replica on another node. 

1. The client sends a write request to node 1 (coordinating node), and the value obtained through the routing calculation formula is 1, then the current data should be written to 2. The primary shard (1 (primary)).
2. Node 1 forwards the request to the node (node ​​2) where the primary fragment (1 (primary)) is located, and node 2 accepts the request and writes it to the disk.
3. Concurrently copy the data to the replica shard (1 (secondary)), where data conflicts are controlled through optimistic concurrency.
4. Once all replica shards report success, node 2 will report success to the coordinator node (node ​​1)
5. The coordinator node (node ​​1) reports success to the client.
 

updating. . .

Guess you like

Origin blog.csdn.net/zy_jun/article/details/131290052