Zookeeper core knowledge points

ZooKeeper core concepts

Node types in ZooKeeper

Apache ZooKeeper is a reliable and scalable coordination service for distributed systems. It usually appears as a unified naming service, unified configuration management, registry (distributed cluster management), distributed lock service, leader election service, etc. . Many distributed systems rely on ZooKeeper clusters to achieve coordinated scheduling between distributed systems, such as Dubbo, HDFS 2.x, HBase, Kafka, etc. ZooKeeper has become the standard configuration of modern distributed systems.

ZooKeeper itself is also a distributed application. The following figure shows the core architecture of the ZooKeeper cluster.

  • Client node: From a business perspective, this is a node in a distributed application that maintains a long connection with a Server instance in the ZooKeeper cluster through ZkClient or other ZooKeeper clients, and sends heartbeats regularly. From the perspective of the ZooKeeper cluster, it is a client of the ZooKeeper cluster, which can actively query or manipulate the data in the ZooKeeper cluster, and it can also add monitoring on some ZooKeeper nodes (ZNodes). When the monitored ZNode node changes, for example, the ZNode node is deleted, a child node is added, or its data is modified, etc., the ZooKeeper cluster will immediately notify the Client through a persistent connection.
  • Leader node: The master node of the ZooKeeper cluster. It is responsible for the write operations of the entire ZooKeeper cluster to ensure the order of transaction processing in the cluster. At the same time, it is also responsible for the data synchronization of all Follower nodes and Observer nodes in the entire cluster. 
  • Follower node: The slave node in the ZooKeeper cluster can receive client read requests and return results to the client. It does not process the write request, but forwards it to the Leader node to complete the write operation. In addition, Follower nodes will also participate in the election of Leader nodes.
  • Observer node: The special slave node in the ZooKeeper cluster will not participate in the election of the leader node, and other functions are the same as the follower node. The purpose of introducing the role of Observer is to increase the throughput of ZooKeeper cluster read operations. If you simply rely on increasing the follower nodes to increase the read throughput of ZooKeeper, then there is a serious side effect, that is, the write capacity of the ZooKeeper cluster will be greatly reduced, because ZooKeeper writes The leader is required to synchronize the write operation to more than half of the follower nodes during data. The introduction of the Observer node allows the ZooKeeper cluster to greatly improve the throughput of read operations without reducing the write capacity.

The node type of storage in Zookeeper

ZooKeeper logically stores data according to a tree structure, and the nodes in it are called ZNodes. Each ZNode has a name identifier, that is, the path from the root of the tree to the node (separated by "/"). Each node in the ZooKeeper tree can have child nodes, which is similar to the directory tree of the file system.

There are four types of ZNode nodes as follows:

  • Persistent node: After a  persistent node is created, it will always exist and will not be deleted due to the failure of the Client session that created the node.
  • Persistent sequential nodes: The basic characteristics of persistent sequential nodes are the same as those of persistent nodes. In the process of creating a node, ZooKeeper will automatically add a monotonically increasing number suffix after its name as the new node name.
  • Temporary node: After the ZooKeeper Client session that created the temporary node fails, the temporary node created by it will be automatically deleted by the ZooKeeper cluster. Another difference from a persistent node is that no child nodes can be created under a temporary node.
  • Temporary sequence node: The  basic characteristics are the same as the temporary node. In the process of creating a node, ZooKeeper will automatically add a monotonically increasing number suffix after its name as the new node name.

A stat structure is maintained in each ZNode, which records the metadata of the ZNode, including information such as version number, operation control list (ACL), timestamp, and data length, as shown in the following table:

In addition to basic operations such as adding, deleting, modifying, and checking ZNodes through ZooKeeper Client, we can also register Watcher to monitor changes in ZNode nodes, the data in them, and child nodes. Once the change is monitored, the corresponding Watcher will be triggered, and the corresponding ZooKeeper Client will be notified immediately. Watcher has the following features:

  • Proactively push. When the Watcher is triggered, the ZooKeeper cluster actively pushes the update to the client without polling by the client.
  • One time. When the data changes, the Watcher will only be triggered once. If the client wants to be notified of subsequent updates, it must register a Watcher again after the Watcher is triggered.
  • Visibility. If a client attaches a Watcher to the read request, and the Watcher is triggered and reads the data again, it is definitely impossible for the client to see the updated data before getting the Watcher message. In other words, the update notification precedes the update result.
  • Sequential. If multiple updates trigger multiple Watchers, the order in which Watchers are triggered is the same as the update order.

Overview of the message broadcast process

The nodes in the three roles (Leader, Follower, and Observer) in the ZooKeeper cluster can all process the client's read request, because each node saves the same data copy, which can be read directly and returned to the client.

For write requests, if the Client is connected to the Follower node (or Observer node), the write request received at the Follower node (or Observer node) will be forwarded to the Leader node. The following is the core process of Leader processing write request:

  1. After the Leader node receives the write request, it will assign a globally unique zxid (64-bit self-increasing id) to the write request, and the sequential consistency of the write operation can be achieved by comparing the size of the zxid.
  2. The leader distributes the message with zxid as a proposal to all the follower nodes through the first-in first-out queue (a queue is created for each follower node to ensure the order of sending).
  3. When the follower node receives the proposal, it will first write the proposal to the local transaction log, and then return an ACK response to the Leader node after the transaction is successfully written.
  4. When the leader node receives more than half of the follower's ACK response, the leader node sends a COMMIT command to all the follower nodes and executes the submission locally.
  5. When the follower receives the COMMIT command of the message, it will also submit the operation, and the write operation is complete.
  6. Finally, the Follower node will return the corresponding response to the Client write request.

The following figure shows the core flow of the write operation:

Crash recovery

In the above request processing flow, if the Leader node goes down, the entire ZooKeeper cluster may be in the following two states:

When the leader node receives the ACK response from more than half of the follower nodes, it broadcasts the COMMIT command to each follower node, and also executes the COMMIT locally and responds to the connected client. If the leader goes down before each follower receives the COMMIT command, it will cause the remaining servers to fail to execute the message.

When the Leader node went down after generating the proposal, and other followers did not receive the proposal (or only a small part of the follower nodes received the proposal), then the write operation failed.

Leader election

After the leader goes down, ZooKeeper will enter the crash recovery mode and re-election the leader node.

ZooKeeper has the following two requirements for the new Leader:

  1. For the proposal that the original leader has submitted, the new leader must be able to broadcast and submit it, so the node with the largest zxid value needs to be selected as the leader.
  2. For the proposal that the original leader has not broadcast or only partially broadcast successfully, the new leader can notify the original leader and the synchronized follower to delete, so as to ensure the consistency of cluster data.

Examples of the election process are as follows:

ZooKeeper's master selection uses the ZAB protocol. If we expand the introduction, there will be a lot of content. Here we will briefly introduce the general process of ZooKeeper's master selection through an example.

For example, there are 5 ZooKeeper nodes in the current cluster. The sid is 1, 2, 3, 4, and 5, and the zxid is 10, 10, 9, 9 and 8. At this time, the node with sid 1 is the leader node. In fact, zxid contains two parts: epoch (upper 32 bits) and increment counter (lower 32 bits) . Among them, epoch means "epoch", which identifies the current leader cycle. The epoch part is incremented during each election. This prevents the old leader from the previous cycle from reconnecting to the cluster after network isolation, causing unnecessary reelections . In this example, we assume that the epoch of each node is the same.

How to judge as Leader in election:

  • First elect the largest epoch
  • If the epochs are equal, choose the one with the largest zxid
  • If both epoch and zxid are equal, choose the one with the largest server id, which is the myid in zoo.cfg

At a certain moment, the server of node 1 went down, and the ZooKeeper cluster began to elect the master. Since the status information of other nodes in the cluster (in the Looking state) cannot be detected, each node will vote for itself as an elected object. So the nodes with sid 2, 3, 4, and 5 will vote for (2,10), (3,9), (4,9), (5,8), and each node will also receive from other The vote of the node (here, a voting information is identified in the form of (sid, zxid)).

  • For node 2, it receives votes of (3,9), (4,9), (5,8), and finds that its zxid is the largest after comparison, so there is no need to make any voting changes.
  • For node 3, it receives votes of (2,10), (4,9), (5,8). After comparison, since the zxid of 2 is larger than its own zxid, it needs to change the vote and vote (2 ,10), and send the vote after the change to other nodes.
  • For node 4, it has received votes of (2,10), (3,9), (5,8). After comparison, since the zxid of 2 is larger than its own zxid, you need to change the vote and vote (2 ,10), and send the vote after the change to other nodes.
  • The same is true for node 5, which will eventually change to (2,10).

After the second round of voting, each node in the cluster will again receive votes from other machines, and then start counting votes. If more than half of the nodes vote for the same node, the node becomes the new leader. Obviously, the node becomes 20%. A new Leader node is created.

The Leader node will add 1 to the epoch value at this time, and distribute the newly generated epoch to each Follower node. After each follower node receives a brand new epoch, it returns an ACK to the leader node, and brings its largest zxid and historical transaction log information. Leader selects the largest zxid and updates its historical transaction log. Node 2 in the example does not need to be updated. The Leader node will then synchronize the latest transaction log to all Follower nodes in the cluster. Only when half of the Followers are synchronized successfully, this quasi-Leader node can become an official Leader node and start working.

 

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_32323239/article/details/109412061