ZooKeeper is so awesome, do you know the basic principles?

Java technology stack

www.javastack.cn

Follow to read more quality articles

Author: Lu Avatar

Source: http://www.cnblogs.com/luxiaoxun/

Introduction to ZooKeeper

ZooKeeper is an open source distributed application coordination service. It contains a simple primitive set based on which distributed applications can implement synchronization services, configuration maintenance and naming services.


ZooKeeper design purpose

1. Eventual consistency: No matter which Server the client is connected to, it is displayed to it with the same view, which is the most important performance of zookeeper.

2. Reliability: It has simple, robust and good performance. If the message m is accepted by one server, it will be accepted by all servers.

3. Real-time: Zookeeper guarantees that the client will obtain the server's update information or server failure information within a time interval.

However, due to network delays and other reasons, Zookeeper cannot guarantee that two clients can get the newly updated data at the same time. If you need the latest data, you should call the sync() interface before reading the data.

4. Wait-free: A slow or invalid client must not interfere with the request of a fast client, so that each client can wait effectively.

5. Atomicity: Update can only succeed or fail, there is no intermediate state.

6. Sequentiality: Including global ordering and partial ordering: global ordering means that if message a is published before message b on one server, message a will be published before message b on all servers; Partial order means that if a message b is published by the same sender after message a, a must be ranked before b.

ZooKeeper data model

Zookeeper will maintain a hierarchical data structure, which is very similar to a standard file system, as shown in the figure:

The data structure of Zookeeper has the following characteristics:

1) Each subdirectory item such as NameService is called a znode, and this znode is uniquely identified by the path where it is located. For example, Server1 is identified by the znode as /NameService/Server1.

2) A znode can have sub-node directories, and each znode can store data. Note that EPHEMERAL (temporary) type directory nodes cannot have sub-node directories.

3) The znode has a version (version). The data stored in each znode can have multiple versions, that is, multiple copies of data can be stored in one access path, and the version number is automatically increased.

4) Type of znode:

  • Persistent nodes, once created, will not be accidentally lost, even if the server is all restarted, it will still exist. Each Persist node can contain data or child nodes.

  • The Ephemeral node is automatically deleted when the session between the client and server that created it ends. The server restart will cause the Session to end, so the Ephemeral type znode will also be automatically deleted at this time.

  • Non-sequence node. When multiple clients create the same Non-sequence node at the same time, only one can be created successfully, and the others will fail evenly. And the node name created is exactly the same as the node name specified when creating it.

  • Sequence node, the created node name has a 10-digit decimal number after the specified name. When multiple clients create nodes with the same name, they can all be created successfully, but the serial numbers are different.

5) The znode can be monitored, including the modification of the data stored in this directory node, the change of the sub-node directory, etc. Once the change is made, the client that sets the monitoring can be notified. This is the core feature of Zookeeper. Many functions of Zookeeper are based on this Features achieved.

6) ZXID: A zxid (ZooKeeper Transaction Id) is generated every time the status of Zookeeper is changed. The zxid is globally ordered. If zxid1 is less than zxid2, zxid1 occurs before zxid2.

ZooKeeper Session

The client and Zookeeper cluster establish a connection, and the state of the entire session changes as shown in the figure:

If the Client loses the connection with Zookeeper Server due to Timeout and the client is in the CONNECTING state, it will automatically try to connect to the Server again. If it successfully connects to a Server again within the validity period of the session, it will return to the CONNECTED state.

Note: If the client loses contact with the Server due to the bad network status, the client will stay in the current state and will try to actively connect to Zookeeper Server again. The client cannot declare its own session expired. The session expired is determined by Zookeeper Server. The client can choose to close the session on its own initiative.

ZooKeeper Watch

Zookeeper watch is a monitoring notification mechanism. All Zookeeper read operations getData(), getChildren() and exists() can be set to watch (watch), and monitoring events can be understood as a one-time trigger. " How does Zookeeper implement distributed locks? "Recommend to take a look.

The official definition is as follows:

a watch event is one-time trigger, sent to the client that set the watch, whichoccurs when the data for which the watch was set changes。

Three key points of Watch:

(One-time trigger) One-time trigger

When the setting monitoring data changes, the monitoring event will be sent to the client.

For example, if the client calls getData("/znode1", true) and the data on the /znode1 node is changed or deleted later, the client will get the monitoring event of the change in /znode1;

If /znode1 changes again, the client will not receive event notifications unless the client sets monitoring on /znode1 again.

(Sent to the client) Sent to the client

Zookeeper client and server communicate through sockets. Due to network failures, monitoring events may not successfully reach the client. Monitoring events are sent to the monitor asynchronously.

Zookeeper itself provides an ordering guarantee: a client will never see a change for which it has set a watch znode only after seeing the monitoring event first. until it first sees the watch event).

Network delay or other factors may cause different clients to perceive a monitoring event at different times, but everything that different clients see has a consistent order.

(The data for which the watch was set)

This means that the znode itself has different ways of changing. You can also imagine that Zookeeper maintains two monitoring linked lists: data watches and child watches (data watches and child watches) getData() and exists() to set data monitoring, and getChildren() to set child node monitoring.

Or you can also imagine that different monitors set by Zookeeper return different data, getData() and exists() return information about the znode node, and getChildren() returns a list of child nodes.

Therefore, setData() will trigger the data monitoring set on a certain node (assuming that the data is successfully set), and a successful create() operation will start the data monitoring set on the current node and the child nodes of the parent node Surveillance.

A successful delete operation will trigger the current node's data monitoring and child node monitoring events, and also trigger the child watch of the node's parent node.

Monitoring in Zookeeper is lightweight, so it is easy to set up, maintain, and distribute. When the client loses contact with the Zookeeper server, the client will not be notified of monitoring events. Only when the client reconnects, if necessary, the previously registered monitoring will be re-registered and triggered. For development This is usually transparent to the personnel.

There is only one situation that will cause the loss of monitoring events, that is: the monitoring of a znode node is set through exists(), but if a client loses contact with the zookeeper server during the interval between the creation and deletion of the znode node , The client will not get event notification even after reconnecting to the zookeeper server later.

Consistency Guarantees

Zookeeper is an efficient and scalable service. Both read and write operations are designed to be fast. Read operations are faster than write operations.

Sequential Consistency: Update requests from a client will be executed sequentially.

Atomicity: The update either succeeds or fails. There is no partial success.

The only system image (Single System Image): No matter which Server the client is connected to, the system image is the same.

Reliability: Once the update is effective, it will continue to be effective until it is overwritten.

Timeliness: Ensure that the system information seen by each client within a certain period of time is consistent.

How ZooKeeper works

In the zookeeper cluster, each node has the following 3 roles and 4 states:

  • Role: leader, follower, observer

  • 状态:leading,following,observing,looking

The core of Zookeeper is atomic broadcasting. This mechanism ensures synchronization between servers. The protocol that implements this mechanism is called the Zab protocol (ZooKeeper Atomic Broadcast protocol). The Zab protocol has two modes, which are recovery mode (Recovery chooses the master) and broadcast mode (Broadcast synchronization).

When the service starts or after the leader crashes, Zab enters the recovery mode. When the leader is elected and most of the servers are synchronized with the leader's state, the recovery mode ends. State synchronization ensures that the leader and server have the same system state.

In order to ensure the consistency of the transaction sequence, zookeeper uses an incremental transaction id number (zxid) to identify the transaction. All proposals add zxid when they are made.

In the implementation, zxid is a 64-bit number. Its high 32 bits are used by the epoch to identify whether the leader relationship has changed. Every time a leader is selected, it will have a new epoch, which identifies the current period of the leader's reign. The lower 32 bits are used for up counting.

Each Server has 4 states during its work:

LOOKING: The current server does not know who the leader is and is searching.

LEADING: The current Server is the elected leader.

FOLLOWING: The leader has been elected, and the current server is synchronized with it.

OBSERVING: The behavior of observers is exactly the same as that of followers in most cases, but they do not participate in elections and voting, but only accept (observing) the results of elections and voting.


Leader Election

When the leader crashes or the leader loses most of its followers, zk enters the recovery mode. The recovery mode needs to re-elect a new leader to restore all servers to a correct state. Zookeeper cluster installation and configuration are super detailed! Recommend to take a look. Follow the public number Java technology stack to read more ZK dry goods.

There are two Zk election algorithms: one is based on basic paxos and the other is based on fast paxos.

The system default election algorithm is fast paxos. First introduce the basic paxos process:

1. The election thread is held by the thread that the current server initiates the election. Its main function is to count the voting results and select the recommended server;

2. The election thread first initiates an inquiry to all servers (including themselves);

3. After the election thread receives the reply, it verifies whether it is a query initiated by itself (verifies whether the zxid is consistent), then obtains the other party's id (myid) and stores it in the current query object list, and finally obtains the leader related information proposed by the other party ( id, zxid), and store this information in the voting record table of the current election;

4. After receiving all the server replies, calculate the server with the largest zxid, and set the server related information as the server to vote next time;

5. The thread sets the server with the largest zxid as the leader to be recommended by the current server. If the winning server gets n/2 + 1 server votes at this time, set the current recommended leader as the winning server, which will be related to the winning server The information sets its own state, otherwise, the process continues until the leader is elected.

Through process analysis, we can get that: in order for Leader to get the support of most servers, the total number of servers must be an odd number 2n+1, and the number of surviving servers must not be less than n+1.

The above process will be repeated after each server is started. In the recovery mode, if the server has just recovered from a crash state or has just started the server, it will recover data and session information from the disk snapshot. Zk will record the transaction log and take regular snapshots to facilitate state recovery during recovery.

The fast paxos process is in the election process, a server first proposes to all servers to become the leader, when other servers receive the proposal, resolve the conflict between epoch and zxid, and accept the other party's proposal, and then send the other party to accept the proposal. News, repeat this process, and finally the Leader will be elected.

Leader workflow

Leader has three main functions:

  1. Data recovery;

  2. Maintain the heartbeat with the follower, receive follower requests and determine the type of follower's request message;

  3. The message types of the follower mainly include PING message, REQUEST message, ACK message, and REVALIDATE message. According to different message types, different processing is performed.

Description:

The PING message refers to the heartbeat information of the follower; the REQUEST message is the proposal information sent by the follower, including write requests and synchronization requests; the
ACK message is the follower’s reply to the proposal. If more than half of the followers pass, the proposal is
committed ; the REVALIDATE message is used To extend the effective time of SESSION.


Follower workflow

Follower has four main functions:

  1. Send requests to Leader (PING message, REQUEST message, ACK message, REVALIDATE message);

  2. Receive Leader messages and process them;

  3. Receive Client's request, if it is a write request, send it to Leader for voting;

  4. Return the Client result.

Follower's message loop processes the following types of messages from Leader:

  1. PING message: heartbeat message

  2. PROPOSAL News: A proposal initiated by the Leader, requiring Followers to vote

  3. COMMIT message: information about the latest proposal on the server side

  4. UPTODATE message: indicates that the synchronization is complete

  5. REVALIDATE message: According to the leader's REVALIDATE result, whether to close the session to be revalidated or allow it to accept the message

  6. SYNC message: returns the SYNC result to the client. This message is initially initiated by the client to force the latest update.

Zab: Broadcasting State Updates

Zookeeper Server receives a request. If it is a follower, it will be forwarded to the leader, and the leader will execute the request and broadcast the execution in the form of Transaction.

How does the Zookeeper cluster determine whether a transaction is executed by commit? Through "a two-phase commit" (a two-phase commit):

  • Leader sends a PROPOSAL message to all followers.

  • A follower receives this PROPOSAL message, writes it to the disk, and sends an ACK message to the leader to inform that it has been received.

  • When the leader receives an ACK from a quorum follower, it sends a commit message to execute.

The Zab agreement guarantees:

  • If the leader broadcasts in the order of T1 and T2, then all servers must execute T1 first, and then execute T2.

  • If any server commits execution in the order of T1 and T2, all other servers must also execute in the order of T1 and T2.

The biggest problem with the "two-phase submission protocol" is that if the leader crashes or temporarily loses the connection after sending the PROPOSAL message, it will cause the entire cluster to be in an indeterminate state (follower does not know whether to abandon this submission or perform the submission).

Zookeeper will select a new leader at this time, and the request processing will also be moved to the new leader. Different leaders are identified by different epochs. When switching Leader, you need to solve the following two problems:

1. Never forget delivered messages

Leader crashes before COMMIT is delivered to any follower, only it commits itself. The new leader must ensure that this transaction must also be committed.

2. Let go of messages that are skipped

The leader generates a certain proposal, but before the crash, no follower sees the proposal. When the server recovers, the proposal must be discarded.

Zookeeper will try to ensure that there will not be two active leaders at the same time, because two different leaders will cause the cluster to be in an inconsistent state, so the Zab protocol also guarantees:

  • Before the new leader broadcasts the transaction, the transaction of the previous leader commit will be executed first.

  • At any time, there will not be two servers with quorum supporters at the same time.
    The quorum here is more than half of the number of servers, to be precise, the servers with voting rights (not including Observers).

to sum up

A brief introduction to the basic principles of Zookeeper, data model, Session, Watch mechanism, consistency guarantee, Leader Election, the workflow of Leader and Follower, and the Zab protocol.

reference

  • 《ZooKeeper—Distributed Process Coordination》 by FlavioJunqueira and Benjamin Reed

  • http://zookeeper.apache.org/doc/trunk/zookeeperOver.html

  • http://www.ibm.com/developerworks/cn/opensource/os-cn-zookeeper/index.html

  • "Appreciation of ZooKeeper's Consensus Algorithm" https://my.oschina.net/pingpangkuangmo/blog/778927

The copyright of this article belongs to the author and the blog garden, welcome to reprint, but without the author’s consent, this statement must be retained, and the original link must be given in an obvious place on the article page, otherwise the right to pursue legal responsibility is reserved.

Pay attention to the Java technology stack to see more dry goods

Click the original text to get more benefits!

Guess you like

Origin blog.csdn.net/youanyyou/article/details/108525598