Zookeeper learning summary ZooKeeper basic principles

ZooKeeper Fundamentals

Introduction to ZooKeeper

ZooKeeper is an open source distributed application coordination service, which contains a simple set of primitives on which distributed applications can implement synchronization services, configuration maintenance, and naming services.

 

ZooKeeper design purpose

1. Eventual consistency: No matter which server the client connects to, it shows the same view to it, which is the most important performance of zookeeper.

2. Reliability: It has simple, robust and good performance. If message m is accepted by one server, it will be accepted by all servers.

3. Real-time: Zookeeper ensures that the client will obtain the updated information of the server within a time interval, or the information of the server failure. However, due to network delay and other reasons, Zookeeper cannot guarantee that two clients can get the newly updated data at the same time. If the latest data is required, the sync() interface should be called before reading the data.

4. Wait-free: A slow or invalid client must not interfere with the request of a fast client, so that each client can effectively wait.

5. Atomicity: Updates can only succeed or fail, with no intermediate states.

6. Sequential: including global ordering and partial ordering: global ordering means that if message a is published before message b on one server, message a will be published before message b on all servers; Partial order means that if a message b is published by the same sender after message a, a will be ranked before b.

ZooKeeper data model

Zookeeper will maintain a hierarchical data structure, which is very similar to a standard file system, as shown in the figure:

The data structure of Zookeeper has the following characteristics:

1) Each subdirectory item such as NameService is called a znode, and this znode is uniquely identified by the path where it is located. For example, the znode of Server1 is identified as /NameService/Server1.

2) znodes can have sub-node directories, and each znode can store data. Note that EPHEMERAL (temporary) type directory nodes cannot have sub-node directories.

3) Znodes have versions. The data stored in each znode can have multiple versions, that is, multiple copies of data can be stored in one access path, and the version number is automatically increased.

4) Type of znode:

  • Persistent  nodes, once created, will not be accidentally lost, even if the server is fully restarted. Each Persist node can contain data as well as child nodes.
  • Ephemeral  nodes are automatically deleted at the end of the session between the client and server that created it. The server restart will cause the session to end, so the znode of type Ephemeral will also be automatically deleted at this time.
  • Non-sequence  node, when multiple clients create the same Non-sequence node at the same time, only one of them can be successfully created, and the others fail. And the created node name is exactly the same as the node name specified during creation.
  • Sequence  node, the created node name has a 10-digit decimal number after the specified name. When multiple clients create nodes with the same name, they can all be created successfully, but with different serial numbers.

5) The znode can be monitored, including the modification of the data stored in this directory node, the change of the sub-node directory, etc. Once the change is made, the client who sets the monitoring can be notified. This is the core feature of Zookeeper, and many functions of Zookeeper are based on this. features are implemented.

6) ZXID: A zxid (ZooKeeper Transaction Id) is generated each time the state of Zookeeper is changed. zxid is globally ordered. If zxid1 is less than zxid2, then zxid1 occurs before zxid2.

ZooKeeper Session

The client establishes a connection with the Zookeeper cluster, and the state of the entire session changes as shown in the figure:

If the client loses connection with the Zookeeper Server due to Timeout, and the client is in the CONNECTING state, it will automatically try to connect to the server again. If it successfully connects to a server again within the validity period of the session, it will return to the CONNECTED state.

Note: If the client loses contact with the server due to poor network status, the client will stay in the current state and will try to actively connect to the Zookeeper Server again. The client cannot declare that its session expired. The session expired is determined by the Zookeeper Server. The client can choose to close the session by itself.

ZooKeeper Watch

Zookeeper watch is a monitoring notification mechanism. All Zookeeper read operations getData(), getChildren() and exists() can be set to watch (watch). Watch events can be understood as one-time triggers. The official definition is as follows: a watch event is one-time trigger, sent to the client that set the watch, whichoccurs when the data for which the watch was set changes. Three key points of Watch:

*(One-time trigger) One-time trigger

The watch event is sent to the client when the data set to watch changes, for example, if the client calls getData("/znode1", true) and later the data on the /znode1 node changes or is deleted If /znode1 changes again, the client will not receive event notification unless the client sets monitoring on /znode1 again.

* (Send to the client) Sent to the client

The Zookeeper client and server communicate through sockets. Due to network failures, monitoring events are likely to fail to reach the client. Monitoring events are sent to the monitor asynchronously. Zookeeper itself provides an ordering guarantee. guarantee): that is, the client will never see a change for which it has set a watch until it first sees the watch event until it first sees the watch event. . Network latency or other factors may cause different clients to perceive a monitoring event at different times, but the different clients see everything in a consistent order.

*(The data for which the watch was set) The data for which the watch was set

This means that the znode itself has a different way of changing. You can also imagine that Zookeeper maintains two monitoring linked lists: data watches and child watches (data watches and child watches) getData() and exists() set data monitoring, and getChildren() sets child node monitoring. Or you can imagine that the different monitoring set by Zookeeper returns different data, getData() and exists() return information about the znode node, and getChildren() returns the list of child nodes. Therefore, setData() will trigger the data monitoring set on a node (assuming the data is set successfully), and a successful create() operation will trigger the data monitoring set on the current node and the child nodes of the parent node. monitor. A successful delete operation will trigger the data monitoring and child node monitoring events of the current node, as well as the child watch of the parent node of the node.

Monitoring in Zookeeper is lightweight and therefore easy to set up, maintain, and distribute. When the client loses contact with the Zookeeper server, the client will not be notified of monitoring events. Only when the client reconnects, if necessary, the previously registered monitoring will be re-registered and triggered. For development This is usually transparent to personnel. There is only one situation that will lead to the loss of monitoring events, that is: monitoring of a znode node is set through exists(), but if a client loses contact with the zookeeper server during the interval between the creation and deletion of this znode node , the client will not be notified of the event even after reconnecting to the zookeeper server later.

Consistency Guarantees

Zookeeper is an efficient and scalable service. Both read and write operations are designed to be fast, and read operations are faster than write operations.

Sequential Consistency: Update requests from a client are executed sequentially.

Atomicity: Updates either succeed or fail, with no partial success.

The only system image (Single System Image): No matter which server the client connects to, the system image is consistent.

Reliability: Once an update is valid, it remains valid until it is overwritten.

Timeliness: Ensure that the system information seen by each client within a certain period of time is consistent.

How ZooKeeper Works

In the zookeeper cluster, each node has the following 3 roles and 4 states:

  • Role: leader, follower, observer
  • 状态:leading,following,observing,looking

The core of Zookeeper is atomic broadcast, which ensures synchronization between servers. The protocol that implements this mechanism is called the Zab protocol (ZooKeeper Atomic Broadcast protocol). Zab protocol has two modes, they are recovery mode (Recovery main mode) and broadcast mode (Broadcast synchronization). When the service starts or after the leader crashes, Zab enters recovery mode. When the leader is elected and most of the servers have finished synchronizing their state with the leader, the recovery mode ends. State synchronization ensures that the leader and server have the same system state.

In order to ensure the sequential consistency of transactions, zookeeper uses an increasing transaction id number (zxid) to identify transactions. All proposals are made with zxid added. In the implementation, zxid is a 64-bit number, and its upper 32 bits are the epoch used to identify whether the leader relationship has changed. Every time a leader is elected, it will have a new epoch to identify the current reign of that leader. The lower 32 bits are used to count up.

Each Server has 4 states in the working process:

LOOKING: The current server does not know who the leader is and is searching.

LEADING: The current server is the elected leader.

FOLLOWING: The leader has been elected, and the current server is synchronized with it.

OBSERVING: The behavior of observers is exactly the same as that of followers in most cases, but they do not participate in elections and votes, but only accept (observing) the results of elections and voting.

Leader Election

When the leader crashes or the leader loses most of the followers, zk enters the recovery mode, and the recovery mode needs to re-elect a new leader to restore all servers to a correct state. There are two kinds of election algorithms in Zk: one is based on basic paxos, and the other is based on fast paxos. The system default election algorithm is fast paxos. First introduce the basic paxos process:

1. The election thread is the thread that initiates the election by the current server. Its main function is to count the voting results and select the recommended server;

2. The election thread first initiates a query to all servers (including itself);

3. After the election thread receives the reply, it verifies whether it is the query initiated by itself (verifies whether the zxid is consistent), then obtains the id (myid) of the other party, stores it in the current query object list, and finally obtains the information about the leader proposed by the other party ( id, zxid), and store this information in the voting record table of the current election;

4. After receiving the replies from all servers, calculate the server with the largest zxid, and set the relevant information of this server as the server to be voted next;

5. The thread sets the current server with the largest zxid as the leader to be recommended by the current server. If the winning server gets n/2 + 1 server votes at this time, set the currently recommended leader as the winning server, which will be related to the winning server. The message sets its own state, otherwise, the process continues until the leader is elected.

Through the process analysis, we can conclude that in order for the leader to obtain the support of most servers, the total number of servers must be an odd number 2n+1, and the number of surviving servers must not be less than n+1.

The above process is repeated after each server starts. In the recovery mode, if the server has just recovered from a crash or has just started, the data and session information will be recovered from the disk snapshot, zk will record the transaction log and take snapshots regularly to facilitate state recovery during recovery.

The fast paxos process is that in the election process, a server first proposes to all servers that it wants to become the leader. When other servers receive the proposal, they resolve the conflict between epoch and zxid, accept the other party's proposal, and then send the other party to accept the proposal. message, repeat this process, and finally the leader will be elected.

Leader Workflow

Leader has three main functions:

1. Restore data;

2. Maintain the heartbeat with the follower, receive the follower request and judge the request message type of the follower;

3. The message types of follower mainly include PING message, REQUEST message, ACK message, REVALIDATE message, and different processing is performed according to different message types.

The PING message refers to the heartbeat information of the follower; the REQUEST message is the proposal information sent by the follower, including write requests and synchronization requests;

The ACK message is the follower's reply to the proposal, and if more than half of the followers pass, the proposal is committed;

The REVALIDATE message is used to extend the SESSION valid time.

Follower Workflow

Follower has four main functions:

1. Send a request to the Leader (PING message, REQUEST message, ACK message, REVALIDATE message);

2. Receive the Leader message and process it;

3. Receive the client's request, if it is a write request, send it to the Leader for voting;

4. Return the Client result.

The Follower's message loop processes the following messages from the Leader:

1. PING message: heartbeat message

2. PROPOSAL message: a proposal initiated by the Leader, asking Followers to vote

3.COMMIT message: information about the latest proposal on the server side

4.UPTODATE message: indicates that synchronization is complete

5. REVALIDATE message: According to the leader's REVALIDATE result, whether to close the session to be revalidated or allow it to accept messages

6. SYNC message: Return the SYNC result to the client. This message is originally initiated by the client to force the latest update.

Zab: Broadcasting State Updates

Zookeeper Server receives a request. If it is a follower, it will forward it to the leader. The leader executes the request and broadcasts the execution in the form of Transaction. How does the Zookeeper cluster decide whether a transaction is committed or not? By "a two-phase commit":

  1. Leader sends a PROPOSAL message to all followers.
  2. A follower receives this PROPOSAL message, writes it to disk, and sends an ACK message to the leader to inform that it has received it.
  3. When the Leader receives the ACK of the quorum followers, it sends a commit message to execute.

Zab protocol guarantees:

  • 1) If the leader broadcasts in the order of T1 and T2, then all servers must execute T1 first, and then execute T2.
  • 2) If any server commits execution in the order of T1 and T2, all other servers must also be executed in the order of T1 and T2.

The biggest problem with the "two-stage commit protocol" is that if the leader crashes or temporarily loses the connection after sending the PROPOSAL message, the entire cluster will be in an uncertain state (followers don't know whether to give up the commit or execute the commit). Zookeeper will select a new leader at this time, and the request processing will also be moved to the new leader. Different leaders are identified by different epochs. When switching leaders, the following two problems need to be solved:

1. Never forget delivered messages

Leader crashes before the COMMIT is delivered to any follower, only it commits itself. The new leader must ensure that the transaction must also commit.

2. Let go of messages that are skipped

The leader generates a proposal, but before the crash, no follower sees the proposal. When the server recovers, the proposal must be discarded.

Zookeeper will try to ensure that there are no two active leaders at the same time, because two different leaders will cause the cluster to be in an inconsistent state, so the Zab protocol also guarantees:

  • 1) Before the new leader broadcasts the transaction, the transaction committed by the previous leader will be executed first.
  • 2) At any time, no two servers have a quorum of supporters at the same time.

The quorum here is more than half of the number of servers, to be precise, servers with voting power (excluding Observers).

 

Summary: A brief introduction to the basic principles of Zookeeper, data model, Session, Watch mechanism, consistency guarantee, Leader Election, Leader and Follower workflow and Zab protocol.

 

Author: Avalo
The copyright of this article belongs to the author and the blog garden. You are welcome to reprint it, but this statement must be retained without the author's consent, and a link to the original text should be given in an obvious position on the article page, otherwise the right to pursue legal responsibility is reserved.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324772094&siteId=291194637