Distributed【Zookeeper】

1.1 What is ZooKeeper

ZooKeeper is a top-level project of Apache. ZooKeeper provides efficient and reliable distributed coordination services for distributed applications, and provides basic distributed services such as unified naming services, configuration management, and distributed locks. In terms of solving distributed data consistency, ZooKeeper does not directly use the Paxos algorithm, but uses a consistency protocol called ZAB.

ZooKeeper is mainly used to solve the consistency problem of application systems in distributed clusters. It can provide data storage based on a directory node tree similar to a file system. However, ZooKeeper is not used specifically to store data. Its main role is to maintain and monitor the status changes of stored data. By monitoring changes in the status of these data, data-based cluster management can be achieved.

Many well-known frameworks are based on ZooKeeper to achieve distributed high availability, such as Dubbo, Kafka, etc.

1.2 Features of ZooKeeper

ZooKeeper has the following features:

  • **Sequential consistency:** The server data model seen by all clients is consistent; transaction requests initiated from a client will eventually be applied to ZooKeeper in strict accordance with the order in which they were initiated. The specific implementation can be seen below: atomic broadcast.
  • **Atomicity:** The processing results of all transaction requests are applied consistently on all machines in the entire cluster, that is, the entire cluster either successfully applies a transaction or does not apply it at all. The implementation method can be seen below: transactions.
  • **Single view:** No matter which Zookeeper server the client is connected to, the server data model it sees is consistent.
  • **High performance:**ZooKeeper stores all data in memory, so its performance is very high. It should be noted that since all updates and deletions of ZooKeeper are transaction-based, ZooKeeper has better performance in application scenarios with more reads and fewer writes. If write operations are frequent, performance will decline significantly.
  • **High availability:**ZooKeeper's high availability is based on the replica mechanism. In addition, ZooKeeper supports fault recovery, as shown below: Election of Leader.
1.3 Design goals of ZooKeeper
  • simple data model
  • A cluster can be built
  • sequential access
  • high performance
2. ZooKeeper core concepts
2.1 Data model

ZooKeeper's data model is a tree-structured file system.

The nodes in the tree are called znodes, where the root node is /, and each node stores its own data and node information. A znode can be used to store data and has an ACL associated with it (see ACL for details). The design goal of ZooKeeper is to implement a coordination service, not to actually store it as a file, so the size of znode stored data is limited to 1MB.

**ZooKeeper's data access is atomic. **The read and write operations are either all successful or all failed.

znode is referenced by path. The znode node path must be an absolute path.

There are two types of znodes:

  • **Temporary (EPHEMERAL):** ZooKeeper will delete the temporary znode when the client session ends.
  • **Persistent (PERSISTENT):** ZooKeeper will not delete persistent znodes unless the client actively performs a deletion operation.
2.2 Node information

There is a sequential flag ( SEQUENTIAL ) on the znode . If the sequence flag (SEQUENTIAL) is set when creating a znode , ZooKeeper will use a counter to add a monotonically increasing value to the znode, that is, zxid. ZooKeeper uses zxid to implement strict sequential access control capabilities.

While storing data, each znode node maintains a data structure called Stat, which stores all status information about the node. as follows:

picture

2.3 Cluster role

The Zookeeper cluster is a high-availability cluster based on master-slave replication. Each server assumes one of the following three roles.

  • **Leader:** It is responsible for initiating and maintaining heartbeats with each Follower and Observer. All write operations must be completed through the Leader and then the Leader broadcasts the write operations to other servers. A Zookeeper cluster will only have one actual working Leader at a time.
  • **Follower:** It will respond to the Leader's heartbeat. The Follower can directly process and return the client's read request, while forwarding the write request to the Leader for processing, and is responsible for voting on the request when the Leader processes the write request. A Zookeeper cluster may have multiple Followers at the same time.
  • **Observer:** Role is similar to Follower, but has no voting rights.
2.4 ACL

ZooKeeper uses ACL (Access Control Lists) policies for permission control.

Each znode is created with an ACL list that determines who can perform what operations on it.

ACL relies on ZooKeeper's client authentication mechanism. ZooKeeper provides the following authentication methods:

  • digest: username and password to identify the client
  • **sasl:** Identify the client via kerberos
  • **ip:**Identify the client by IP

ZooKeeper defines the following five permissions:

  • **CREATE:** Allows creation of child nodes;
  • **READ:** Allows getting data from a node and listing its child nodes;
  • WRITE: allows setting data for nodes;
  • **DELETE:** Allows deletion of child nodes;
  • ADMIN: Allows setting permissions for nodes.
3. How ZooKeeper works
3.1 Read operation

Leader/Follower/Observer can directly handle read requests, just read data from local memory and return it to the client.

Since processing read requests does not require interaction between servers, the more Followers/Observers there are, the greater the read request throughput of the overall system , which means the better the read performance.

3.2 Write operation

All write requests are actually handed over to the Leader for processing. The Leader sends the write request to all Followers in the form of a transaction and waits for ACK. Once it receives ACKs from more than half of the Followers, the write operation is considered successful.

3.2.1 Writing Leader

Writing operations through Leader are mainly divided into five steps:

  1. The client initiates a write request to the Leader.
  2. The Leader sends the write request to all Followers in the form of transaction proposal and waits for ACK.
  3. Follower returns ACK after receiving Leader's transaction proposal.
  4. After the Leader gets more than half of the ACKs (the Leader has one ACK for itself by default), it sends Commit to all Followers and Observers.
  5. Leader returns the processing results to the client.

Notice

  • Leader does not need to get ACK from Observer, that is, Observer has no voting rights.
  • The Leader does not need to get ACKs from all Followers, it only needs to receive more than half of the ACKs. At the same time, the Leader itself has an ACK for itself. There are 4 Followers in the above picture, only two of them need to return ACK, because (2 + 1) / (4 + 1) > 1 / 2 (2+1) / (4+1) > 1/2(2+1)/(4+1)>1/2
  • Although the Observer has no voting rights, it must still synchronize the Leader's data so that it can return the latest data as possible when processing read requests.
3.2.2 写 Follower/Observer

picture

Both Follower/Observer can accept write requests, but they cannot process them directly. Instead, they need to forward the write requests to the Leader for processing.

Except for the additional step of request forwarding, the other process is no different from writing the Leader directly.

3.3 Transactions

ZooKeeper has strict sequential access control capabilities for each update request from the client.

In order to ensure the sequential consistency of transactions, ZooKeeper uses an increasing transaction ID number (zxid) to identify transactions.

**The Leader service will allocate a separate queue to each Follower server, then put transaction proposals into the queue in turn, and send messages according to the FIFO (first in, first out) policy. **After receiving the Proposal, the Follower service will write it to the local disk in the form of a transaction log, and feedback an Ack response to the Leader after the writing is successful. **When the Leader receives the Ack response from more than half of the Followers, it will broadcast a Commit message to all Followers to notify them of transaction submission. **Afterwards, the Leader itself will also complete the transaction submission. Each Follower completes the transaction submission after receiving the Commit message.

All proposals (proposals) are added with zxid when they are proposed. zxid is a 64-bit number, and its high 32 bits are epoch, which is used to identify whether the Leader relationship has changed. Every time a Leader is elected, it will have a new epoch, identifying the current reign period of that leader. The lower 32 bits are used for counting up.

The detailed process is as follows:

  • Leader waits for Server connection;
  • Follower connects to Leader and sends the largest zxid to Leader;
  • Leader determines the synchronization point based on Follower's zxid;
  • After the synchronization is completed, the follower is notified that it has become the uptodate state;
  • After the Follower receives the uptodate message, it can again accept the client's request for service.
3.4 Observation

The client registers to monitor the znode it cares about. When the znode status changes (data changes, child nodes increase or decrease), the ZooKeeper service will notify the client.

There are generally two forms of maintaining a connection between the client and the server:

  • The client continuously polls the server
  • Server pushes status to client

Zookeeper's choice is to actively push status from the server, which is the observation mechanism (Watch).

ZooKeeper's observation mechanism allows users to register listeners for events of interest on specified nodes. When an event occurs, the listener will be triggered and event information will be pushed to the client.

When the client uses interfaces such as getData to obtain the znode status, a callback for processing node changes is passed in, then the server will actively push the node changes to the client:

The Watcher object passed in from this method implements the corresponding process method. Every time the status of the corresponding node changes, the WatchManager will call the method passed in the Watcher in the following way:

Set<Watcher> triggerWatch(String path, EventType type, Set<Watcher> supress) {    
WatchedEvent e = new WatchedEvent(type, KeeperState.SyncConnected, path);    
Set<Watcher> watchers;    
synchronized (this) {        
	watchers = watchTable.remove(path);    
}    
for (Watcher w : watchers) {        
	w.process(e);    
}    
return

All data in Zookeeper is actually managed by a data structure called DataTree. All requests to read and write data will eventually change the content of the tree. When issuing a read request, a callback function may be registered by the Watcher. The write request may trigger the corresponding callback, and the WatchManager will notify the client of data changes.

The implementation of the notification mechanism is actually relatively simple. Set up a Watcher to listen for events through a read request. When an event is triggered by a write request, the notification can be sent to the specified client.

3.5 Session

The ZooKeeper client connects to the ZooKeeper service cluster through a TCP persistent connection. The session has been established from the first connection, and the effective session state is maintained through the heartbeat detection mechanism . Through this connection, the client can send requests and receive responses, and can also receive notifications of Watch events.

A list of ZooKeeper server clusters is configured in each ZooKeeper client configuration. On startup, the client walks through the list trying to establish a connection. If that fails, it tries to connect to the next server, and so on.

Once a client establishes a connection with a server, the server creates a new session for the client. **Each session will have a timeout period. If the server does not receive any request within the timeout period, the corresponding session will be considered expired. **Once a session expires, it cannot be reopened and any temporary znodes associated with the session will be deleted.

Generally speaking, sessions should be long-lived, and this needs to be guaranteed by the client. The client can keep the session from expiring through heartbeat mode (ping).

picture

ZooKeeper sessions have four properties:

  • **sessionID:** Session ID, uniquely identifies a session. Every time the client creates a new session, Zookeeper will assign it a globally unique sessionID.
  • **TimeOut:** Session timeout. When the client constructs a Zookeeper instance, it will configure the sessionTimeout parameter to specify the session timeout. After the Zookeeper client sends this timeout to the server, the server will adjust it according to its own timeout. Limit the timeout for finalizing a session.
  • **TickTime: **The next session timeout time point. In order to facilitate Zookeeper to implement the "bucket strategy" management of sessions, and to implement session timeout checking and cleanup efficiently and cost-effectively, Zookeeper will mark a next time for each session. Session timeout time point, its value is roughly equal to the current time plus TimeOut.
  • **isClosing:** Marks whether a session has been closed. When the server detects that the session has timed out, it will mark the isClosing of the session as "closed", thus ensuring that new messages from the session are no longer processed. Requested.

Zookeeper's session management is mainly through SessionTracker, which adopts a bucketing strategy (putting similar sessions in the same block for management) for management, so that Zookeeper can isolate sessions in different blocks and process sessions in the same block. Unified processing.

4. ZAB Agreement

ZooKeeper does not directly use the Paxos algorithm, but uses a consensus protocol called ZAB. The ZAB protocol is not the Paxos algorithm, but it is similar. The two operations are not the same.

The ZAB protocol is an atomic broadcast protocol specially designed by Zookeeper to support crash recovery.

The ZAB protocol is ZooKeeper's data consistency and high availability solution.

The ZAB protocol defines two processes that can loop infinitely :

  • **Leader election:** Used for fault recovery to ensure high availability.
  • **Atomic broadcast:** Used for master-slave synchronization to ensure data consistency.
4.1 Elect Leader

ZooKeeper failure recovery

The ZooKeeper cluster adopts a single master (called Leader) and multiple slaves (called Followers) model. The master and slave nodes ensure data consistency through the copy mechanism.

  • If the Follower node hangs - each node in the ZooKeeper cluster will maintain its own state in memory, and communication between nodes is maintained. As long as half of the machines in the cluster can work normally, the entire cluster can provide normal services. Serve.

  • If the Leader node hangs - If the Leader node hangs, the system will not work properly. At this time, fault recovery needs to be performed through the Leader election mechanism of the ZAB protocol.

Simply put, the leader election mechanism of the ZAB protocol is: a new leader is generated based on the majority election mechanism, and then other machines will synchronize their status from the new leader. When more than half of the machines complete status synchronization, they will exit the leader election mode and enter atomic broadcast. model.

4.1.1 Terminology

**myid:**Each Zookeeper server needs to create a file named myid under the data folder, which contains the unique ID (integer) of the entire Zookeeper cluster.

**zxid:** Similar to the transaction ID in RDBMS, it is used to identify the Proposal ID of an update operation. To ensure ordering, the zkid must increase monotonically. Therefore, Zookeeper uses a 64-bit number to represent it. The upper 32 bits are the epoch of the Leader. Starting from 1, each time a new Leader is selected, the epoch increases by one. The lower 32 bits are the sequence number within the epoch. Each time the epoch changes, the lower 32 bits of the sequence number are reset. This ensures the global incrementability of zkid.

4.1.2 Server status

  • **LOOKING:** Unsure of Leader status. The server in this state believes that there is no leader in the current cluster and will initiate leader election.
  • **FOLLOWING:**Follower status. Indicates that the current server role is Follower and that it knows who the Leader is.
  • **LEADING:** Leader status. Indicates that the current server role is Leader, which maintains heartbeats with Followers.
  • **OBSERVING:** Observer status. Indicates that the current server role is Observer. The only difference from Follower is that it does not participate in elections and does not participate in voting during cluster write operations.

4.1.3 Ballot data structure

When each server conducts leadership election, it will send the following key information:

  • **logicClock:** Each server will maintain a self-increasing integer named logicClock, which indicates how many rounds of voting this server has initiated.
  • **state:**The current status of the server.
  • **self_id:** The myid of the current server.
  • **self_zxid:** The maximum zxid of the data saved on the current server.
  • **vote_id:** The myid of the recommended server.
  • **vote_zxid:** The maximum zxid of the data stored on the recommended server.

4.1.4 Voting process

(1) Self-increasing election rounds

Zookeeper stipulates that all valid votes must be in the same round. When each server starts a new round of voting, it will first increment the logicClock it maintains.

(2) Initialize votes

Each server will clear its own ballot box before broadcasting its own votes. This ballot box records the votes received. Example: Server 2 votes for Server 3, and Server 3 votes for Server 1. Then Server 1's ballot box is (2, 3), (3, 1), (1, 1). Only the last vote of each voter will be recorded in the ballot box. If a voter updates his or her own ballot, other servers will update the server's ballot in their own ballot box after receiving the new ballot.

(3) Send initialization votes

Each server initially votes for itself through broadcast.

(4) Receive external votes

The server will try to get votes from other servers and add them to its own ballot box. If it cannot obtain any external votes, it will confirm that it maintains a valid connection to other servers in the cluster. If so, send your vote again; if not, establish a connection with it immediately.

(5) Determine the election round

After receiving the external vote, different processing will be carried out first according to the logicClock contained in the voting information:

  • The external voting logicClock is greater than the own logicClock. It means that the election round of this server lags behind the election rounds of other servers. Immediately clear your own ballot box and update your own logicClock to the received logicClock, and then compare your previous votes with the received votes to determine whether it is needed. Change your vote and eventually broadcast your vote again.
  • The external voting logicClock is smaller than its own logicClock. The current server simply ignores the vote and continues processing the next vote.
  • The external voting logickClock is equal to own . At that time, ballot PK was conducted.

(6) Ballot PK

The vote PK is based on the comparison between (self_id, self_zxid) and (vote_id, vote_zxid):

  • If the logicClock of the external vote is greater than its own logicClock, its own logicClock and the logicClock of its own vote will be changed to the received logicClock.
  • If the logicClock is consistent , compare the vote_zxid of the two. If the vote_zxid of the external vote is larger, update the vote_zxid and vote_myid in your own vote to the vote_zxid and vote_myid in the received vote and broadcast it. In addition, the received vote and put your updated tickets into your own ballot box. If the same ballot (self_myid, self_zxid) already exists in the ballot box, it will be overwritten directly.
  • If the vote_zxid of the two is consistent, compare the vote_myid of the two. If the vote_myid of the external vote is larger, update the vote_myid in your own vote to the vote_myid in the received vote and broadcast it. In addition, the received vote and your own Put the updated ticket into your own ticket box.

(7) Count votes

If it is determined that more than half of the servers have recognized their vote (possibly an updated vote), the vote is terminated. Otherwise, continue to receive votes from other servers.

(8) Update server status

After the voting is terminated, the server starts updating its own status. If more than half of the votes vote for you, update your server status to LEADING, otherwise update your status to FOLLOWING.

Through the above process analysis, we can easily see that in order for the Leader to obtain the support of the majority of servers, the number of ZooKeeper cluster nodes must be an odd number. And the number of surviving nodes must not be less than N + 1 .

The above process will be repeated after each Server starts. In recovery mode, if the server has just recovered from a crash state or has just been started, data and session information will be restored from disk snapshots. ZooKeeper will record transaction logs and take snapshots regularly to facilitate state recovery during recovery.

4.2 Atomic Broadcast

ZooKeeper achieves high availability through a replica mechanism.

So, how does ZooKeeper implement the replica mechanism? The answer is: atomic broadcast of the ZAB protocol.

picture

Atomic broadcast requirements of the ZAB protocol:

**All write requests will be forwarded to the Leader, and the Leader will notify the Follower through atomic broadcast. When more than half of the Followers have updated the status and persisted, the Leader will submit the update, and then the client will receive a response that the update is successful. **This is somewhat similar to the two-phase commit protocol in the database.

During the entire message broadcast process, the Leader server will generate a corresponding Proposal for each transaction request, assign it a globally unique incremental transaction ID (ZXID), and then broadcast it.

5. ZooKeeper application

ZooKeeper can be used for functions such as publish/subscribe, load balancing, command services, distributed coordination/notification, cluster management, Master election, distributed locks, and distributed queues.

5.1 Naming service

In a distributed system, a globally unique name is usually required, such as generating a globally unique order number, etc. ZooKeeper can generate a globally unique ID through the characteristics of sequential nodes, thereby providing naming services for the distributed system.

5.2 Configuration management

Using ZooKeeper's observation mechanism, it can be used as a highly available configuration store, allowing participants of distributed applications to retrieve and update configuration files.

5.3 Distributed lock

Distributed locks can be implemented through ZooKeeper's temporary nodes and Watcher mechanism.

For example, there is a distributed system with three nodes A, B, and C trying to obtain distributed locks through ZooKeeper.

(1) Access /lock (the directory path is determined by the program itself) and create a temporary node (EPHEMERAL) with a serial number.

(2) When each node tries to acquire a lock, it gets all the child nodes (id_0000, id_0001, id_0002) under the /locks node to determine whether the node it created is the smallest.

  • If so, get the lock.

    Release the lock: After performing the operation, delete the created node.

  • If not, listen for changes in nodes that are 1 smaller than itself.

(3) Release the lock, that is, delete the node you created.

NodeA deletes the node id_0000 it created. NodeB detects the change and finds that its node is already the smallest node, so it can obtain the lock.

5.4 Cluster management

ZooKeeper can also solve most problems in distributed systems:

  • For example, you can create a temporary node to establish a heartbeat detection mechanism. If a service node in the distributed system goes down, the session it holds will time out, the temporary node will be deleted, and the corresponding listening event will be triggered.

  • Each service node of the distributed system can also write its own node status to the temporary node to complete status reports or node work progress reports.

  • Through data subscription and publishing functions, ZooKeeper can also decouple modules and schedule tasks in distributed systems.

  • Through the monitoring mechanism, the service nodes of the distributed system can also be dynamically brought online and offline, thereby achieving dynamic expansion of the service.

5.5 Elect Leader node

An important mode of distributed systems is the master-slave mode (Master/Salves). ZooKeeper can be used for Matser election in this mode. All service nodes can be allowed to competitively create the same ZNode. Since ZooKeeper cannot have ZNodes with the same path, only one service node must be created successfully, so that the service node can become the Master node.

5.6 Queue management

ZooKeeper can handle two types of queues:

  • When all the members of a queue are gathered, the queue is available. Otherwise, it will wait for all members to arrive. This is a synchronous queue.
  • The queue performs enqueue and dequeue operations in a FIFO manner, such as implementing the producer and consumer models.

The implementation idea of ​​synchronous queue using ZooKeeper is as follows:

lines to achieve dynamic expansion of services.

5.5 Elect Leader node

An important mode of distributed systems is the master-slave mode (Master/Salves). ZooKeeper can be used for Matser election in this mode. All service nodes can be allowed to competitively create the same ZNode. Since ZooKeeper cannot have ZNodes with the same path, only one service node must be created successfully, so that the service node can become the Master node.

5.6 Queue management

ZooKeeper can handle two types of queues:

  • When all the members of a queue are gathered, the queue is available. Otherwise, it will wait for all members to arrive. This is a synchronous queue.
  • The queue performs enqueue and dequeue operations in a FIFO manner, such as implementing the producer and consumer models.

The implementation idea of ​​synchronous queue using ZooKeeper is as follows:

Create a parent directory /synchronizing, and each member monitors whether the flag (Set Watch) bit directory /synchronizing/start exists, and then each member joins the queue. The way to join the queue is to create a temporary directory node of /synchronizing/member_i. Then each member gets all directory nodes of the /synchronizing directory, which is member_i. Determine whether the value of i is already the number of members. If it is less than the number of members, wait for /synchronizing/start to appear. If it is already equal, create /synchronizing/start.

Guess you like

Origin blog.csdn.net/weixin_46645965/article/details/135335593