What exactly is ZooKeeper? In-depth explanation of ZooKeeper

ZooKeeper is a distributed coordination service maintained by Apache.

ZooKeeper can be regarded as a highly available file system.

ZooKeeper can be used for functions such as publish/subscribe, load balancing, command service, distributed coordination/notification, cluster management, Master election, distributed lock and distributed queue.

Table of contents

What exactly is ZooKeeper? In-depth explanation of ZooKeeper 1. Introduction to ZooKeeper

1. Basic introduction to Zookeeper

1.1 What is ZooKeeper

1.2 Features of ZooKeeper

1.3 Design goals of ZooKeeper

2. The core concept of ZooKeeper

2.1 Data Model

2.2 Node information

2.3 Cluster roles

2.4 ACL

3. Working principle of ZooKeeper

3.1 Read operation

3.2 Write operation

3.3 Transactions

3.4 Observation

3.5 session

4. ZAB Agreement

4.1 Election Leader

4.2 Atomic Broadcast

5. ZooKeeper application

5.1 Naming service

5.2 Configuration Management

5.3 Distributed locks

5.4 Cluster Management

5.5 Election of Leader nodes

5.6 Queue management


What exactly is ZooKeeper? In-depth explanation of ZooKeeper 1. Introduction to ZooKeeper

1. Basic introduction to Zookeeper

1.1 What is ZooKeeper

ZooKeeper is an Apache top-level project. ZooKeeper provides efficient and reliable distributed coordination services for distributed applications, and provides distributed basic services such as unified naming services, configuration management, and distributed locks. In terms of solving distributed data consistency, ZooKeeper does not directly use the Paxos algorithm, but uses a consensus protocol called ZAB.

ZooKeeper is mainly used to solve the consistency problem of application systems in distributed clusters, and it can provide data storage based on a directory node tree similar to a file system. But ZooKeeper is not used to store data specially, its main function is to maintain and monitor the state changes of stored data. By monitoring the changes of these data states, data-based cluster management can be achieved.

Many famous frameworks are based on ZooKeeper to achieve distributed high availability, such as: Dubbo, Kafka, etc.

1.2 Features of ZooKeeper

ZooKeeper has the following features:

  • Sequential consistency: The server data model seen by all clients is consistent; transaction requests initiated from a client will eventually be applied to ZooKeeper in strict accordance with the order in which they were initiated. The specific implementation can be seen below: atomic broadcast.

  • Atomicity: The processing results of all transaction requests are applied consistently on all machines in the entire cluster, that is, the entire cluster either successfully applies a certain transaction or does not apply it at all. The implementation can be seen below: transaction.

  • Single view: No matter which Zookeeper server the client connects to, the server-side data model it sees is consistent.

  • High performance: ZooKeeper stores all data in memory, so its performance is very high. It should be noted that: Since all updates and deletions of ZooKeeper are based on transactions, ZooKeeper has better performance in application scenarios with more reads and fewer writes. If the write operation is frequent, the performance will be greatly reduced.

  • High availability: ZooKeeper's high availability is implemented based on the copy mechanism. In addition, ZooKeeper supports fault recovery, as shown below: Election Leader.

1.3 Design goals of ZooKeeper

  • simple data model

  • clusters can be built

  • sequential access

  • high performance

2. The core concept of ZooKeeper

2.1 Data Model

The data model of ZooKeeper is a tree-structured file system.

The nodes in the tree are called znodes, where the root node is /, and each node will save its own data and node information. A znode can be used to store data and has an ACL associated with it (see ACL for details). The design goal of ZooKeeper is to implement a coordination service, not really as a file storage, so the size of znode storage data is limited within 1MB.

Data access in ZooKeeper is atomic. Its read and write operations are either all successful or all fail.

znodes are referenced by path. The znode node path must be an absolute path.

There are two types of znodes:

  • Ephemeral ( EPHEMERAL ): ZooKeeper deletes the ephemeral znode when the client session ends.

  • Persistent (PERSISTENT): ZooKeeper will not delete a persistent znode unless the client actively performs the delete operation.

2.2 Node information

There is a sequential flag ( SEQUENTIAL ) on the znode . If the sequence flag (SEQUENTIAL) is set when creating a znode , then ZooKeeper will use a counter to add a monotonically increasing value to the znode, namely zxid. ZooKeeper uses zxid to achieve strict sequential access control.

While storing data, each znode node maintains a data structure called Stat, which stores all state information about the node. as follows:

2.3 Cluster roles

The Zookeeper cluster is a high-availability cluster based on master-slave replication, and each server assumes one of the following three roles.

  • Leader: It is responsible for initiating and maintaining the heartbeat with each Follwer and Observer. All write operations must be completed by the Leader, and then the Leader broadcasts the write operations to other servers. A Zookeeper cluster can only have one actual working Leader at a time.

  • Follower: It will respond to the Leader's heartbeat. The Follower can directly process and return the client's read request, and at the same time forward the write request to the Leader for processing, and is responsible for voting on the request when the Leader processes the write request. A Zookeeper cluster may have multiple Followers at the same time.

  • Observer: The role is similar to Follower, but has no voting rights.

2.4 ACL

ZooKeeper uses ACL (Access Control Lists) policies for permission control.

Each znode is created with a list of ACLs that determine who can perform what operations on it.

ACLs rely on ZooKeeper's client authentication mechanism. ZooKeeper provides the following authentication methods:

  • digest:  username and password to identify the client

  • sasl: identify the client through kerberos

  • ip: identify the client by IP

ZooKeeper defines the following five permissions:

  • CREATE: Allows the creation of child nodes;

  • READ: Allows to get data from a node and list its child nodes;

  • WRITE:  Allows setting data for a node;

  • DELETE: Allows deletion of child nodes;

  • ADMIN:  Allows setting permissions for nodes.

3. Working principle of ZooKeeper

Pay attention to my code miscellaneous forum to learn more.......

3.1 Read operation

Leader/Follower/Observer can directly process read requests, just read data from local memory and return it to the client.

Since processing read requests does not require interaction between servers, the more Followers/Observers there are, the greater the read request throughput of the overall system , that is, the better the read performance.

3.2 Write operation

All write requests are actually handed over to the Leader for processing. The Leader sends the write request to all Followers in the form of a transaction and waits for the ACK. Once more than half of the Followers' ACKs are received, the write operation is considered successful.

3.2.1 Write Leader

As can be seen from the above figure, the write operation through the Leader is mainly divided into five steps:

  1. The client initiates a write request to the Leader.

  2. The Leader sends the write request to all Followers in the form of transaction Proposal and waits for ACK.

  3. Follower returns ACK after receiving Leader's transaction Proposal.

  4. Leader sends Commit to all Followers and Observers after getting more than half of the ACKs (Leader has an ACK for itself by default).

  5. The leader returns the processing result to the client.

Notice

  • Leader does not need to get ACK from Observer, that is, Observer has no voting rights.

  • The Leader does not need to get ACKs from all Followers, it only needs to receive more than half of the ACKs. At the same time, the Leader itself has an ACK for itself. There are 4 Followers in the above figure, only two of them need to return ACK, because $$(2+1) / (4+1) > 1/2$$.

  • Although Observer has no voting rights, it still has to synchronize the data of Leader so that it can return the latest data as possible when processing read requests.

3.2.2 写 Follower/Observer

Both Follower/Observer can accept write requests, but they cannot process them directly, but need to forward the write requests to Leader for processing.

Except for one more step of request forwarding, the other processes are no different from directly writing to the Leader.

3.3 Transactions

For each update request from the client, ZooKeeper has strict sequential access control capabilities.

In order to ensure the sequential consistency of transactions, ZooKeeper uses an increasing transaction id number (zxid) to identify transactions.

The Leader service will allocate a separate queue for each Follower server, and then put the transaction Proposal into the queue in turn, and send messages according to the FIFO (first in, first out) strategy. After the Follower service receives the Proposal, it will write it into the local disk in the form of a transaction log, and feedback an Ack response to the Leader after the writing is successful. When the Leader receives more than half of Followers' Ack responses, it will broadcast a Commit message to all Followers to notify them to commit the transaction, and then the Leader will complete the transaction commit itself. And each Follower completes the commit of the transaction after receiving the Commit message.

All proposals have zxid added when they are proposed. zxid is a 64-bit number, and its upper 32 bits are used by epoch to identify whether the leader relationship has changed. Every time a leader is elected, it will have a new epoch, which identifies the current ruling period of that leader. The lower 32 bits are used for counting up.

The detailed process is as follows:

  • Leader waits for Server connection;

  • Follower connects to Leader and sends the largest zxid to Leader;

  • Leader determines the synchronization point according to Follower's zxid;

  • After the synchronization is completed, notify the follower that it has become an uptodate state;

  • After the Follower receives the uptodate message, it can accept the client's request again for service.

3.4 Observation

The client registers to monitor the znode it cares about. When the state of the znode changes (data change, child node increase or decrease), the ZooKeeper service will notify the client.

There are generally two forms of keeping the client and server connected:

  • The client keeps polling the server

  • The server pushes the status to the client

Zookeeper's choice is to actively push the status of the server, which is the observation mechanism (Watch).

ZooKeeper's observation mechanism allows users to register listeners for interested events on specified nodes. When an event occurs, the listener will be triggered and the event information will be pushed to the client.

When the client uses an interface such as getData to obtain the znode status, a callback for handling node changes is passed in, then the server will actively push the node changes to the client:

The Watcher object passed in from this method implements the corresponding process method. Every time the state of the corresponding node changes, the WatchManager will call the method passed in to the Watcher in the following way:

​​​​​​​Set<Watcher> triggerWatch(String path, EventType type, Set<Watcher> supress) { WatchedEvent e = new WatchedEvent(type, KeeperState.SyncConnected, path); Set<Watcher> watchers; synchronized (this) { watchers = watchTable.remove(path); } for (Watcher w : watchers) { w.process(e); } return

All data in Zookeeper is actually managed by a data structure called DataTree. All requests to read and write data will eventually change the content of this tree. When a read request is issued, it may pass in Watcher to register a callback function. The write request may trigger the corresponding callback, and the WatchManager notifies the client of the data change.

The implementation of the notification mechanism is actually relatively simple. Set the Watcher to listen to the event through the read request, and the write request can send the notification to the specified client when the event is triggered.

3.5 session

The ZooKeeper client connects to the ZooKeeper service cluster through a TCP long connection. Session (Session) has been established from the first connection, and then maintains a valid session state through the heartbeat detection mechanism . Through this connection, the client can send requests and receive responses, and can also receive notifications of Watch events.

A list of ZooKeeper server clusters is configured in each ZooKeeper client configuration. On startup, the client iterates through the list trying to establish a connection. If it fails, it tries to connect to the next server, and so on.

Once a client establishes a connection with a server, the server creates a new session for the client. Each session has a timeout period. If the server does not receive any request within the timeout period, the corresponding session is considered expired. Once a session expires, it cannot be reopened, and any temporary znodes associated with the session are deleted.

In general, sessions should be long-lived, and this needs to be guaranteed by the client. The client can keep the session from expiring by means of heartbeat (ping).

A ZooKeeper session has four attributes

  • sessionID: session ID, which uniquely identifies a session. Every time a client creates a new session, Zookeeper will assign it a globally unique sessionID.

  • TimeOut: The session timeout time. When the client constructs a Zookeeper instance, it will configure the sessionTimeout parameter to specify the session timeout time. After the Zookeeper client sends the timeout time to the server, the server will finally determine the session according to its own timeout time limit timeout period.

  • TickTime: The next session timeout time point. In order to facilitate Zookeeper to implement "bucket strategy" management on sessions, and to implement session timeout checking and cleaning efficiently and cost-effectively, Zookeeper will mark a next session timeout time point for each session , whose value is roughly equal to the current time plus TimeOut.

  • isClosing: Mark whether a session has been closed. When the server detects that the session has timed out, it will mark the session's isClosing as "closed", so as to ensure that no new requests from the session will be processed.

Zookeeper's session management is mainly through SessionTracker, which uses a bucketing strategy (managing similar sessions in the same block) for management, so that Zookeeper can isolate sessions in different blocks and the same block Unified processing.

Pay attention to my code miscellaneous forum to learn more.......

4. ZAB Agreement

ZooKeeper does not directly use the Paxos algorithm, but uses a consensus protocol called ZAB. The ZAB protocol is not the Paxos algorithm, but it is similar, and the two are not the same in operation.

The ZAB protocol is an atomic broadcast protocol specially designed by Zookeeper to support crash recovery.

The ZAB protocol is ZooKeeper's data consistency and high availability solution.

The ZAB protocol defines two processes that can loop infinitely :

  • Election Leader: Used for fault recovery to ensure high availability.

  • Atomic broadcast: used for master-slave synchronization to ensure data consistency.

4.1 Election Leader

ZooKeeper failure recovery

The ZooKeeper cluster adopts one master (called Leader) and multiple slaves (called Follower) mode, and the master and slave nodes ensure data consistency through the copy mechanism.

  • If the Follower node hangs up - each node in the ZooKeeper cluster will maintain its own state in memory independently, and the communication between the nodes will be maintained. As long as half of the machines in the cluster can work normally, the entire cluster can provide Serve.

  • If the Leader node is down - If the Leader node is down, the system will not work properly. At this time, the failure recovery needs to be carried out through the leader election mechanism of the ZAB protocol.

The leader election mechanism of the ZAB protocol is simple: a new leader is generated based on the majority election mechanism, and then other machines will synchronize their status from the new leader. When more than half of the machines complete the status synchronization, they will exit the election leader mode and enter the atomic broadcast model.

4.1.1 Terminology

myid: Each Zookeeper server needs to create a file named myid under the data folder, which contains the unique ID (integer) of the entire Zookeeper cluster.

zxid: Similar to the transaction ID in RDBMS, it is used to identify the Proposal ID of an update operation. To guarantee order, the zkid must be monotonically increasing. Therefore, Zookeeper uses a 64-bit number to represent, and the upper 32 bits are the epoch of the Leader. Starting from 1, each time a new Leader is selected, the epoch is incremented by one. The lower 32 bits are the serial number in the epoch, and each time the epoch changes, the lower 32-bit serial number will be reset. This ensures the global increment of zkid.

4.1.2 Server Status

  • LOOKING: Unsure of the Leader status. The server in this state thinks that there is no Leader in the current cluster, and will initiate a Leader election.

  • FOLLOWING: follower status. Indicates that the current server role is Follower, and it knows who the Leader is.

  • LEADING: leader status. Indicates that the current server role is Leader, and it will maintain heartbeat with Follower.

  • OBSERVING: Observer state. Indicates that the current server role is Observer, and the only difference from Folower is that it does not participate in elections and votes during cluster write operations.

4.1.3 Ballot data structure

When each server conducts leader election, it will send the following key information:

  • logicClock: Each server maintains a self-increasing integer called logicClock, which indicates the number of rounds of voting initiated by the server.

  • state: the current state of the server.

  • self_id: myid of the current server.

  • self_zxid: the maximum zxid of the data saved on the current server.

  • vote_id: the myid of the voted server.

  • vote_zxid: The maximum zxid of the data saved on the recommended server.

4.1.4 Voting process

(1) Self-increasing election rounds

Zookeeper stipulates that all valid votes must be in the same round. When each server starts a new round of voting, it will first perform an auto-increment operation on the logicClock it maintains.

(2) Initialize the ballot

Each server empties its own ballot box before broadcasting its own vote. This ballot box records the ballots received. Example: Server 2 votes for Server 3, and Server 3 votes for Server 1, then the ballot boxes of Server 1 are (2, 3), (3, 1), (1, 1). Only the last vote of each voter will be recorded in the ballot box. If a voter updates his or her own ballot, other servers will update the server's ballot in their own ballot box after receiving the new ballot.

(3) Send initialization ballot

Each server initially votes for itself by broadcasting.

(4) Receive external votes

The server will try to get votes from other servers and put them in its own ballot box. If it cannot obtain any external votes, it will confirm whether it maintains a valid connection with other servers in the cluster. If yes, send your own vote again; if no, establish a connection with it immediately.

(5) Judging the election round

After receiving an external vote, it will first perform different processing according to the logicClock contained in the voting information:

  • The logicClock of the external vote  is greater than its own logicClock. It means that the election round of this server is behind the election rounds of other servers. Immediately clear your ballot box and update your logicClock to the received logicClock, and then compare your previous votes with the received votes to determine whether you need Change your vote, and finally broadcast your vote again.

  • The logicClock of external voting  is smaller than its own logicClock. The current server directly ignores the vote and continues to process the next vote.

  • The external vote's logickClock is equal to its own . Ballot PK was carried out at that time.

(6) Ballot PK

The vote PK is based on the comparison between (self_id, self_zxid) and (vote_id, vote_zxid):

  • If the logicClock of external voting  is greater than your own logicClock, change your own logicClock and the logicClock of your own ballot to the received logicClock.

  • If  the logicClock  is consistent , compare the vote_zxid of the two. If the vote_zxid of the external vote is relatively large, update the vote_zxid and vote_myid in your own vote to the vote_zxid and vote_myid in the received vote and broadcast it. In addition, the received vote and your own updated ticket into your own ballot box. If the same ballot (self_myid, self_zxid) already exists in the ballot box, it will be overwritten directly.

  • If  the vote_zxid  of the two is the same, compare the vote_myid of the two. If the vote_myid of the external vote is larger, update the vote_myid in your own vote to the vote_myid in the received vote and broadcast it. In addition, the received vote and your own The updated ticket is placed in its own ballot box.

(7) Counting votes

If it has been determined that more than half of the servers have approved their vote (possibly an updated vote), the vote will be terminated. Otherwise, continue to receive votes from other servers.

(8) Update server status

After the voting is terminated, the server starts to update its own state. If more than half of the votes are for yourself, update your server status to LEADING, otherwise update your server status to FOLLOWING.

Through the analysis of the above process, we can easily see that the number of ZooKeeper cluster nodes must be an odd number in order for the Leader to obtain the support of most Servers  . And the number of surviving nodes must not be less than N + 1  .

The above process will be repeated after each server is started. In the recovery mode, if the server has just recovered from a crash or has just been started, data and session information will be recovered from disk snapshots, and zk will record transaction logs and take snapshots periodically to facilitate state recovery during recovery.

4.2 Atomic Broadcast

ZooKeeper achieves high availability through a replica mechanism.

So, how does ZooKeeper implement the replication mechanism? The answer is: atomic broadcast of the ZAB protocol.

The atomic broadcast requirements of the ZAB protocol:

All write requests will be forwarded to the Leader, and the Leader will notify the Follower in an atomic broadcast. When more than half of the Followers have updated the status and persisted, the Leader will submit the update, and then the client will receive a response that the update is successful. This is somewhat similar to the two-phase commit protocol in databases.

During the broadcasting process of the entire message, the Leader server generates a corresponding Proposal for each transaction request, and assigns it a globally unique incremental transaction ID (ZXID), and then broadcasts it.

Pay attention to my code miscellaneous forum to learn more.......

5. ZooKeeper application

ZooKeeper can be used for functions such as publish/subscribe, load balancing, command service, distributed coordination/notification, cluster management, Master election, distributed lock and distributed queue.

5.1 Naming service

In a distributed system, a globally unique name is usually required, such as generating a globally unique order number, etc. ZooKeeper can generate a globally unique ID through the characteristics of sequential nodes, thereby providing naming services for distributed systems.

5.2 Configuration Management

Using ZooKeeper's observation mechanism, it can be used as a highly available configuration store, allowing participants in distributed applications to retrieve and update configuration files.

5.3 Distributed locks

Distributed locks can be implemented through ZooKeeper's temporary nodes and Watcher mechanism.

For example, there is a distributed system with three nodes A, B, and C trying to acquire distributed locks through ZooKeeper.

(1) Visit /lock (this directory path is determined by the program itself), and create  a temporary node (EPHEMERAL) with a serial number.

(2) When each node tries to acquire a lock, it gets all the child nodes (id_0000, id_0001, id_0002) under the /locks node, and judges whether the node created by itself is the smallest.

  • If yes, get the lock.

    Release lock: After performing the operation, delete the created node.

  • If not, monitor the change of the node which is 1 smaller than itself.

(3) Release the lock, that is, delete the node created by yourself.

In the figure, NodeA deletes the node id_0000 created by itself, and NodeB detects the change and finds that its own node is already the smallest node, so it can acquire the lock.

5.4 Cluster Management

ZooKeeper also solves most problems in distributed systems:

  • For example, a heartbeat detection mechanism can be established by creating a temporary node. If a service node of the distributed system goes down, the session held by it will time out. At this time, the temporary node will be deleted, and the corresponding monitoring event will be triggered.

  • Each service node of the distributed system can also write its own node status to the temporary node, so as to complete the status report or node work progress report.

  • Through the data subscription and publishing functions, ZooKeeper can also decouple modules and schedule tasks for distributed systems.

  • Through the monitoring mechanism, the service nodes of the distributed system can also be dynamically online and offline, so as to realize the dynamic expansion of services.

5.5 Election of Leader nodes

An important mode of a distributed system is the master-slave mode (Master/Salves), and ZooKeeper can be used for Matser election in this mode. All service nodes can create the same ZNode competitively. Since ZooKeeper cannot have ZNodes with the same path, only one service node must be created successfully, so that the service node can become the Master node.

5.6 Queue management

ZooKeeper can handle two types of queues:

  • When all members of a queue are gathered, the queue is available, otherwise it has been waiting for all members to arrive. This is a synchronous queue.

  • The queue performs enqueue and dequeue operations according to the FIFO method, such as implementing the producer and consumer models.

The implementation idea of ​​synchronous queue implemented by ZooKeeper is as follows:

Create a parent directory /synchronizing, each member monitors whether the flag (Set Watch) bit directory /synchronizing/start exists, and then each member joins this queue. The way to join the queue is to create a temporary directory node of /synchronizing/member_i, Each member then gets all directory nodes of the /synchronizing directory, which is member_i. Determine whether the value of i is already the number of members. If it is less than the number of members, wait for /synchronizing/start to appear. If it is already equal, create /synchronizing/start.

Pay attention to my code miscellaneous forum to learn more.......

Guess you like

Origin blog.csdn.net/qq_34417408/article/details/124319778#comments_27332870