Learn more about ZooKeeper

ZooKeeper is a distributed coordination service, maintained by Apache.

ZooKeeper can be regarded as a highly available file system.

ZooKeeper can be used for publish/subscribe, load balancing, command service, distributed coordination/notification, cluster management, master election, distributed locks, and distributed queues.

1. Introduction to ZooKeeper

1.1 What is ZooKeeper

ZooKeeper is the top project of Apache. ZooKeeper provides efficient and reliable distributed coordination services for distributed applications, and provides distributed basic services such as unified naming services, configuration management, and distributed locks. In terms of solving distributed data consistency, ZooKeeper does not directly adopt the Paxos algorithm, but uses a consistency protocol called ZAB.

ZooKeeper is mainly used to solve the consistency problem of application systems in distributed clusters. It can provide data storage based on a directory node tree similar to a file system. However, ZooKeeper is not specifically used to store data. Its main function is to maintain and monitor the status changes of stored data. By monitoring the changes of these data states, data-based cluster management can be achieved.

Many well-known frameworks are based on ZooKeeper to achieve distributed high availability, such as Dubbo, Kafka, etc.

1.2 Features of ZooKeeper

ZooKeeper has the following features:

  • Sequential consistency: The server data model seen by all clients is consistent; a transaction request initiated from a client will eventually be applied to ZooKeeper strictly in the order of its initiation. The specific implementation can be seen below: Atomic Broadcast.

  • Atomicity: The application of the processing results of all transaction requests on all machines in the entire cluster is consistent, that is, the entire cluster has either successfully applied a certain transaction or no application. The implementation can be seen below: transaction.

  • Single view: No matter which Zookeeper server the client is connected to, the server data model it sees is the same.

  • High performance: ZooKeeper stores all data in memory, so its performance is high. It should be noted that since all updates and deletions of ZooKeeper are based on transactions, ZooKeeper has better performance in application scenarios with more reads and less writes. If write operations are frequent, the performance will be greatly reduced.

  • High availability: ZooKeeper's high availability is based on the replication mechanism. In addition, ZooKeeper supports failure recovery, see below: Election Leader.

1.3 Design goals of ZooKeeper

  • Simple data model

  • Can build a cluster

  • Sequential access

  • high performance

Two, ZooKeeper core concept

2.1 Data model

The data model of ZooKeeper is a tree structured file system.

The nodes in the tree are called znodes, where the root node is /, and each node saves its own data and node information. A znode can be used to store data and has an ACL associated with it (see ACL for details). The design goal of ZooKeeper is to implement coordination services, not as a file storage, so the size of znode storage data is limited to less than 1MB.

Learn more about ZooKeeper

ZooKeeper's data access is atomic. The read and write operations either all succeed or all fail.

znode is referenced by path. The znode node path must be an absolute path.

There are two types of znodes:

  • Temporary (EPHEMERAL): When the client session ends, ZooKeeper will delete the temporary znode.

  • Persistent (PERSISTENT): Unless the client actively performs the delete operation, ZooKeeper will not delete the persistent znode.

2.2 Node information

There is a sequence mark (SEQUENTIAL) on the znode . If the sequence flag (SEQUENTIAL) is set when the znode is created , ZooKeeper will use the counter to add a monotonically increasing value to the znode, namely zxid. ZooKeeper uses zxid to achieve strict sequential access control capabilities.

While storing data, each znode node maintains a data structure called Stat, which stores all status information about the node. as follows:

Learn more about ZooKeeper Learn more about ZooKeeper

2.3 Cluster role

The Zookeeper cluster is a highly available cluster based on master-slave replication. Each server assumes one of the following three roles.

  • Leader: It is responsible for initiating and maintaining the heartbeat with each Follwer and Observer. All write operations must be completed by the Leader and then the Leader broadcasts the write operations to other servers. A Zookeeper cluster will only have one actual working leader at a time.

  • Follower: It will respond to Leader's heartbeat. Follower can directly process and return the client's read request, and at the same time forward the write request to the Leader for processing, and is responsible for voting on the request when the Leader processes the write request. There may be multiple followers in a Zookeeper cluster at the same time.

  • Observer: The role is similar to that of Follower, but without voting rights.

2.4 ACL

ZooKeeper uses ACL (Access Control Lists) policies to control permissions.

Each znode is created with an ACL list, which is used to determine who can perform what operations on it.

ACL relies on ZooKeeper's client authentication mechanism. ZooKeeper provides the following authentication methods:

  • digest: username and password to identify the client

  • sasl: Identify the client through kerberos

  • ip: Identify the client by IP

ZooKeeper defines the following five permissions:

  • CREATE: Allow to create child nodes;

  • READ: Allow to get data from the node and list its child nodes;

  • WRITE: Allow to set data for the node;

  • DELETE: Allow deleting child nodes;

  • ADMIN: Allow to set permissions for the node.

Three, ZooKeeper working principle

3.1 Read operation

Leader/Follower/Observer can directly process read requests, read data from local memory and return it to the client.

Since the processing of read requests does not require interaction between servers, the more Follower/Observer, the greater the overall system read request throughput , that is, the better the read performance.

Learn more about ZooKeeper Learn more about ZooKeeper

3.2 Write operation

All write requests are actually handed over to Leader for processing. The Leader sends the write request to all Followers in the form of a transaction and waits for ACK. Once more than half of the Followers' ACK is received, the write operation is considered successful.

3.2.1 Write Leader

Learn more about ZooKeeper Learn more about ZooKeeper

As can be seen from the above figure, the writing operation through Leader is mainly divided into five steps:

  1. The client initiates a write request to the Leader.

  2. Leader sends the write request to all followers in the form of transaction proposal and waits for ACK.

  3. Follower will return ACK after receiving Leader's transaction Proposal.

  4. The Leader gets more than half of the ACKs (Leader has an ACK for himself by default) and then sends Commmits to all Followers and Observers.

  5. The Leader returns the processing result to the client.

note

  • Leader does not need to get ACK from Observer, that is, Observer has no voting rights.

  • The Leader does not need to get all the ACKs from the Followers, as long as it receives more than half of the ACKs, and the Leader itself has an ACK for itself. There are 4 followers in the above picture, only two of them need to return ACK, because $$(2+1) / (4+1)> 1/2$$.

  • Although the Observer has no voting rights, it still needs to synchronize Leader's data so that it can return as new data as possible when processing read requests.

3.2.2 写 Follower/ObserverLearn more about ZooKeeper

Learn more about ZooKeeper

Follower/Observer can accept write requests, but cannot process them directly. Instead, write requests need to be forwarded to Leader for processing.

Except for one more step to request forwarding, the other processes are no different from writing Leader directly.

3.3 Affairs

For each update request from the client, ZooKeeper has strict sequential access control capabilities.

In order to ensure the sequence consistency of transactions, ZooKeeper uses an increasing transaction id number (zxid) to identify transactions.

The Leader service will allocate a separate queue for each follower server, and then put the transaction proposal into the queue in turn, and send messages according to the FIFO (first in first out) strategy. After the Follower service receives the Proposal, it will write it to the local disk in the form of a transaction log, and feedback an Ack response to the Leader after the write is successful. When the Leader receives more than half of the Followers' Ack responses, it will broadcast a Commit message to all Followers to notify them to commit the transaction, and then the Leader itself will complete the commit of the transaction. Each follower completes the commit of the transaction after receiving the Commit message.

All proposals add zxid when they are made. zxid is a 64-bit number, and its high 32-bit is the epoch used to identify whether the leader relationship has changed. Every time a leader is selected, it will have a new epoch that identifies the current period of the leader's reign. The lower 32 bits are used to count up.

The detailed process is as follows:

  • Leader waits for Server connection;

  • Follower connects to Leader and sends the largest zxid to Leader;

  • Leader determines the synchronization point according to follower's zxid;

  • After synchronization is completed, notify the follower that it has become uptodate;

  • After the follower receives the uptodate message, it can again accept the client's request for service.

3.4 Observation

The client registers and listens to the znode it cares about. When the state of the znode changes (data changes, child node increase or decrease changes), the ZooKeeper service will notify the client.

There are generally two forms of keeping the client and server connected:

  • The client continuously polls the server

  • The server pushes the status to the client

The choice of Zookeeper is that the server actively pushes the state, which is the observation mechanism (Watch).

ZooKeeper's observation mechanism allows users to register listeners for events of interest on designated nodes. When an event occurs, the listener will be triggered and the event information will be pushed to the client.

When the client uses interfaces such as getData to obtain the state of the znode and passes in a callback for processing node changes, the server will actively push the node changes to the client:

The Watcher object passed in from this method implements the corresponding process method. Every time the state of the corresponding node changes, the WatchManager will call the method passed to the Watcher in the following ways:

Set<Watcher> triggerWatch(String path, EventType type, Set<Watcher> supress) {
    WatchedEvent e = new WatchedEvent(type, KeeperState.SyncConnected, path);
    Set<Watcher> watchers;
    synchronized (this) {
        watchers = watchTable.remove(path);
    }
    for (Watcher w : watchers) {
        w.process(e);
    }
    return

All data in Zookeeper is actually managed by a data structure called DataTree. All requests to read and write data will eventually change the content of this tree. When a read request is issued, Watcher may be passed to register a callback function. The write request may trigger the corresponding callback, and the WatchManager will notify the client of the data change.

The implementation of the notification mechanism is actually relatively simple. Set Watcher to monitor events through a read request. When a write request triggers an event, a notification can be sent to the specified client.

3.5 Session

The ZooKeeper client connects to the ZooKeeper service cluster through long TCP connections. Session (Session) has been established since the first connection, and then through the heartbeat detection mechanism to maintain a valid session state . Through this connection, the client can send requests and receive responses, as well as receive notifications of Watch events.

Each ZooKeeper client configuration is configured with a list of ZooKeeper server clusters. At startup, the client will traverse the list to try to establish a connection. If it fails, it will try to connect to the next server, and so on.

Once a client establishes a connection with a server, the server will create a new session for the client. Each session will have a timeout period. If the server does not receive any request within the timeout period, the corresponding session is considered expired. Once the session expires, it cannot be reopened, and any temporary znodes related to the session will be deleted.

Generally speaking, the session should exist for a long time, and this needs to be guaranteed by the client. The client can use heartbeat (ping) to keep the session from expiring.

Learn more about ZooKeeper Learn more about ZooKeeper

The ZooKeeper session has four attributes:

  • sessionID: Session ID, which uniquely identifies a session. Every time a client creates a new session, Zookeeper will assign it a globally unique sessionID.

  • TimeOut: Session timeout time. When the client constructs a Zookeeper instance, it will configure the sessionTimeout parameter to specify the timeout time of the session. After the Zookeeper client sends this timeout time to the server, the server will finally determine the session according to its own timeout time limit. The timeout period.

  • TickTime: The time point for the next session timeout. In order to facilitate Zookeeper to implement the "bucket strategy" management of the session, and to implement session timeout checking and cleaning efficiently and cost-effectively, Zookeeper will mark each session with a next timeout time , Its value is roughly equal to the current time plus TimeOut.

  • isClosing: Marks whether a session has been closed. When the server detects that the session has expired, it will mark the isClosing of the session as "closed", so as to ensure that no new requests from the session are processed.

Zookeeper's session management is mainly responsible for SessionTracker, which adopts a bucket strategy (similar sessions are placed in the same block for management) for management, so that Zookeeper can isolate different blocks of sessions and process the same block. Unified processing.

4. ZAB Agreement

ZooKeeper does not directly use the Paxos algorithm, but uses a consensus protocol called ZAB. The ZAB protocol is not a Paxos algorithm, but is more similar, and the two are not the same in operation.

The ZAB protocol is an atomic broadcast protocol specially designed by Zookeeper to support crash recovery.

The ZAB protocol is ZooKeeper's data consistency and high availability solution.

The ZAB protocol defines two processes that can loop indefinitely :

  • Leader election: used for failure recovery to ensure high availability.

  • Atomic broadcast: used for master-slave synchronization to ensure data consistency.

4.1 Election Leader

ZooKeeper failure recovery

The ZooKeeper cluster adopts a master (called Leader) and multiple slaves (called Follower) mode. The master and slave nodes ensure data consistency through a copy mechanism.

  • If the follower node is down-each node in the ZooKeeper cluster will maintain its own state in memory separately, and communication between each node is maintained, as long as half of the machines in the cluster can work normally, then the entire cluster can be provided normally service.

  • If the leader node is down-if the leader node is down, the system will not work properly. At this time, the leader election mechanism of the ZAB protocol is required to perform failure recovery.

The leader election mechanism of the ZAB protocol is simply: a new leader is generated based on more than half of the election mechanism, and then other machines will synchronize the state from the new leader. When more than half of the machines have completed state synchronization, they will exit the election leader mode and enter the atomic broadcast mode.

4.1.1 Terminology

myid: For each Zookeeper server, a file named myid needs to be created in the data folder. This file contains the unique ID (integer) of the entire Zookeeper cluster.

zxid: Similar to the transaction ID in RDBMS, the Proposal ID used to identify an update operation. In order to ensure sequentiality, the zkid must increase monotonically. Therefore, Zookeeper uses a 64-bit number to represent, the upper 32 bits are the leader's epoch, starting from 1, each time a new leader is selected, the epoch is incremented by one. The lower 32 bits are the serial number in the epoch. Each time the epoch changes, the lower 32 bits are reset. This ensures the global incrementality of zkid.

4.1.2 Server status

  • LOOKING: Not sure about Leader status. The server in this state thinks that there is no leader in the current cluster and initiates leader election.

  • FOLLOWING: Follower status. Indicates that the current server role is Follower and it knows who the Leader is.

  • LEADING: Leader status. Indicates that the current server role is Leader, and it will maintain a heartbeat with Follower.

  • OBSERVING: Observer status. Indicates that the current server role is Observer. The only difference from Follower is that it does not participate in elections, nor does it participate in voting during cluster write operations.

4.1.3 Ballot data structure

Each server sends the following key information when conducting leadership elections:

  • logicClock: Each server maintains an incrementing integer called logicClock, which indicates how many rounds of voting this server initiates.

  • state: The state of the current server.

  • self_id: the myid of the current server.

  • self_zxid: The maximum zxid of the data saved on the current server.

  • vote_id: the myid of the voted server.

  • vote_zxid: The maximum zxid of the data saved on the recommended server.

4.1.4 Voting process

(1) Self-increasing election rounds

Zookeeper stipulates that all valid votes must be in the same round. When each server starts a new round of voting, it will automatically increase the logicClock maintained by itself.

(2) Initialize ballot

Each server will empty its ballot box before broadcasting its own ballot. The ballot box records the votes received. Example: Server 2 votes for server 3, server 3 votes for server 1, then the ballot box of server 1 is (2, 3), (3, 1), (1, 1). Only the last vote of each voter is recorded in the ballot box. If a voter updates his own ballot, other servers will update the server's ballot in their own ballot box after receiving the new ballot.

(3) Send initial ballot

Each server initially voted for itself through broadcast.

(4) Receive external votes

The server will try to get votes from other servers and put them in its ballot box. If it cannot obtain any external votes, it will confirm whether it maintains a valid connection with other servers in the cluster. If yes, send your vote again; if no, immediately establish a connection with it.

(5) Judging the election round

After receiving an external vote, it will first perform different processing according to the logicClock contained in the voting information:

  • The externally voted logicClock is greater than its own logicClock. Explain that the election round of this server lags behind the election rounds of other servers, immediately empty your ballot box and update your logicClock to the received logicClock, and then compare your previous votes with the received votes to determine whether you need it Change your vote, and finally broadcast your vote again.

  • The logicClock of an external vote is smaller than its own logicClock. The current server directly ignores the vote and continues to process the next vote.

  • With their logickClock external vote is equal . At that time, vote PK was carried out.

(6) Voting PK

The vote PK is based on the comparison between (self_id, self_zxid) and (vote_id, vote_zxid):

  • If the logicClock of the external vote is greater than the logicClock of your own, you will change your logicClock and the logicClock of your vote to the received logicClock.

  • If the logicClock** is consistent**, compare the vote_zxid of the two. If the vote_zxid of the external vote is relatively large, update the vote_zxid and vote_myid in your vote to the vote_zxid and vote_myid in the received ticket and broadcast it. The received tickets and their updated tickets are put into their own ticket box. If the same ballot (self_myid, self_zxid) already exists in the ballot box, it will be overwritten directly.

  • If the vote_zxid of the two are the same, compare the vote_myid of the two. If the vote_myid of the external vote is larger, update the vote_myid in your vote to the vote_myid in the received ticket and broadcast it. In addition, the received vote and yourself Put the updated ticket into your own ticket box.

(7) Count votes

If it has been determined that more than half of the servers have approved their vote (possibly an updated vote), the vote will be terminated. Otherwise, continue to receive votes from other servers.

(8) Update server status

After the voting ends, the server starts to update its own state. If more than half of the votes are cast for yourself, update your server status to LEADING, otherwise update your status to FOLLOWING.

Through the above process analysis, it is not difficult for us to see: In order for Leader to obtain the support of most servers, the number of  ZooKeeper cluster nodes must be an odd number. And the number of surviving nodes shall not be less than N + 1 .

The above process will be repeated after each server is started. In the recovery mode, if the server has just recovered from a crash or has just started, it will also recover data and session information from the disk snapshot. Zk will record the transaction log and take regular snapshots to facilitate state recovery during recovery.

4.2 Atomic Broadcast

ZooKeeper uses a copy mechanism to achieve high availability.

So, how does ZooKeeper implement the copy mechanism? The answer is: the atomic broadcast of the ZAB protocol.

Learn more about ZooKeeper Learn more about ZooKeeper

The atomic broadcasting requirements of the ZAB protocol:

All write requests will be forwarded to Leader, and Leader will notify Follow by atomic broadcast. When more than half of the Follow has been updated and the status is persisted, the leader will submit the update, and then the client will receive a successful update response. This is somewhat similar to the two-phase commit protocol in the database.

During the entire message broadcasting process, the Leader server will generate a corresponding Proposal for each transaction request, and assign a globally unique incremental transaction ID (ZXID) to it, and then broadcast it.

Five, ZooKeeper application

ZooKeeper can be used for publish/subscribe, load balancing, command service, distributed coordination/notification, cluster management, master election, distributed locks, distributed queues and other functions.

5.1 Naming Service

In a distributed system, a globally unique name is usually required, such as generating a globally unique order number. ZooKeeper can generate a globally unique ID through the characteristics of sequential nodes, so as to provide naming services for distributed systems.

Learn more about ZooKeeper Learn more about ZooKeeper

5.2 Configuration Management

Using ZooKeeper's observation mechanism, it can be used as a highly available configuration memory, allowing participants in distributed applications to retrieve and update configuration files.

5.3 Distributed lock

Distributed locks can be realized through ZooKeeper's temporary node and Watcher mechanism.

For example, there is a distributed system with three nodes A, B, and C, trying to acquire distributed locks through ZooKeeper.

(1) Access /lock (this directory path is determined by the program itself), and create  a temporary node (EPHEMERAL) with a serial number.

Learn more about ZooKeeper Learn more about ZooKeeper

(2) When each node tries to acquire the lock, it gets all the child nodes (id_0000, id_0001, id_0002) under the /locks node, and judges whether the node created by itself is the smallest.

  • If so, get the lock.

    Release the lock: After performing the operation, delete the created node.

  • If it is not, it will monitor the node change that is one less than itself.

Learn more about ZooKeeper Learn more about ZooKeeper

(3) Release the lock, that is, delete the node you created.

Learn more about ZooKeeper Learn more about ZooKeeper

In the figure, NodeA deletes the node id_0000 created by itself, NodeB monitors the change, and finds that its node is already the smallest node, and it can acquire the lock.

5.4 Cluster Management

ZooKeeper can also solve most problems in distributed systems:

  • For example, a heartbeat detection mechanism can be established by creating temporary nodes. If a service node of the distributed system goes down, the session held by it will time out. At this time, the temporary node will be deleted and the corresponding monitoring event will be triggered.

  • Each service node of the distributed system can also write its own node status to the temporary node, thereby completing status report or node work progress report.

  • Through data subscription and publishing functions, ZooKeeper can also decouple modules and schedule tasks for distributed systems.

  • Through the monitoring mechanism, the service nodes of the distributed system can also be dynamically online and offline, so as to realize the dynamic expansion of the service.

5.5 Election of Leader nodes

An important mode of the distributed system is the master-slave mode (Master/Salves). ZooKeeper can be used for Matser elections in this mode. It is possible to let all service nodes create the same ZNode competitively. Since ZooKeeper cannot have ZNodes with the same path, only one service node must be created successfully, so that the service node can become the Master node.

5.6 Queue management

ZooKeeper can handle two types of queues:

  • When the members of a queue are gathered, the queue is available, otherwise it has been waiting for all members to arrive. This is a synchronous queue.

  • Queues are enqueued and dequeued in FIFO mode, such as implementing producer and consumer models.

The realization idea of ​​synchronous queue using ZooKeeper is as follows:

Create a parent directory /synchronizing, each member monitors whether the set watch bit directory /synchronizing/start exists, and then each member joins this queue. The way to join the queue is to create a temporary directory node of /synchronizing/member_i, Then each member gets all the directory nodes of the /synchronizing directory, which is member_i. Judge whether the value of i is already the number of members, if it is less than the number of members, wait for the appearance of /synchronizing/start, if it is equal, create /synchronizing/start.

Reference

official

books

article

Author: ZhangPeng

Guess you like

Origin blog.51cto.com/14291117/2576508