Detailed explanation of the working principle of Zookeeper (reproduced)

1. The role of Zookeeper

»The leader (leader) is responsible for the initiation and resolution of voting, and updates the system status
　　» Learner, including followers (follower) and observers (observer), follower is used to accept client requests and want the client to return As a result, participate in voting during the election process
　　»Observer can accept client connections and forward write requests to the leader, but the observer does not participate in the voting process and only synchronizes the state of the leader. The purpose of the observer is to expand the system and increase the reading speed
　　» Customers Client, request initiator

Insert picture description here

• The core of Zookeeper is atomic broadcasting. This mechanism ensures synchronization between servers. The protocol that implements this mechanism is called the Zab protocol
　　 protocol. The Zab protocol has two modes, which are recovery mode (primary selection) and broadcast mode (synchronization). When the service starts or after the leader
　　　crashes, Zab enters the recovery mode. When the leader is elected and most of the servers complete the synchronization with the leader's state
　　 , the recovery mode ends. State synchronization ensures that the leader and server have the same system state.

• In order to ensure the sequential consistency of transactions, zookeeper uses an increasing transaction id number (zxid) to identify transactions. All
　　　proposals have zxid added when they are proposed. In the implementation, zxid is a 64-bit number. Its high 32 bits are used by the epoch to identify
　　 whether the leader relationship has changed. Every time a leader is selected, it will have a new epoch that identifies the current
　　　rule of that leader . The lower 32 bits are used for up counting.
　　• Each server has three states in the working process:
　　　　LOOKING: The current server does not know who the leader is and is searching for
　　　　LEADING: the current server is the elected leader
　　　　FOLLOWING: the leader has been elected, and the current server is synchronized with it

Other documents: http://www.cnblogs.com/lpshou/archive/2013/06/14/3136738.html

2. Zookeeper's read and write mechanism

»Zookeeper is a cluster composed of multiple servers
　　» One leader, multiple followers
　　»Each server saves a copy of data
　　» Global data consistency
　　»Distributed read and write
　　» Update request forwarding, implemented by the leader

3. Guarantee of Zookeeper

»Update requests are performed in sequence, and the update requests from the same client are executed in the order in which they are sent
　　» Data update atomicity, a data update either succeeds or fails
　　»Globally unique data view, no matter which server the client is connected to, the data view is always Consistent
　　»Real-time, within a certain range of events, the client can read the latest data

4. Data operation process of Zookeeper node

Insert picture description here

Note: 1. Send a write request to Follwer on the Client

2. Follwer sends the request to Leader

3. After the Leader receives it, it starts voting and informs the Follwer to vote

4. Follwer sends the voting result to Leader

5. After the Leader summarizes the results, if it needs to be written, it will start to write and notify the Leader of the write operation, and then commit;

6.Follwer returns the request result to the Client

• Follower has four main functions:
　　　　• 1. Send request to Leader (PING message, REQUEST message, ACK message, REVALIDATE message);
　　　　2. Receive Leader message and process it;
　　　　3. Receive Client's request, if it is writing Request, send to Leader to vote;
　　　　4. Return Client result.
　　　　• Follower’s message loop processes the following types of messages from Leader:
　　　　• 1. PING message: heartbeat message;
　　　　• 2. PROPOSAL message: a proposal initiated by the leader, requiring
　　　　follower to vote; • 3. COMMIT message: the latest proposal on the server side Information;
　　　　• 4. UPTODATE message: indicates that the synchronization is complete;
　　　　• 5. REVALIDATE message: According to the REVALIDATE result of the Leader, close the session to be revalidated or allow it to accept the message;
　　　　• 6. SYNC message: return the SYNC result to the client, this message Initially initiated by the client, it is used to force the latest updates.

5. Zookeeper leader election

• Half of the pass
　　　　-3 machines are linked to one 2>3/
　　　　2-4 machines are linked to 2 2! >4/2

• Proposal A says, I want to choose myself. Do you agree with B? C Do you agree? B said, I agree to choose A; C said, I agree to choose A. (Note that more than half of them are here. In fact, the elections in the real world have been successful.

But the computer world is very strict. In addition, we must understand the algorithm and continue to simulate. )
　　• Next proposal B said, I want to choose myself, do you agree with A; A said, I have more than half agreed to be elected, and your proposal is invalid; C said, A has more than half agreed to be elected, and B proposal is invalid.
　　• Then the proposal of C said, I want to choose myself, do you agree with A; A said, I have agreed to be elected more than half, and your proposal is invalid; B said that more than half of A has agreed to be elected, and C's proposal is invalid.
　　• The leader has been elected in the election, and all the followers behind them can only obey the leader's orders. And there is a small detail here, that is, who actually starts and takes the lead.
Insert picture description here

6、zxid

• The state information of the znode node contains czxid, so what is zxid?
　　• Every change of ZooKeeper state corresponds to an incremental Transaction id, which is called zxid. Due to the incremental nature of zxid, if zxid1 is less than zxid2, Then zxid1 must happen before zxid2.

Creating any node, or updating the data of any node, or deleting any node, will cause the status of Zookeeper to change, which will cause the value of zxid to increase.

7. Working principle of Zookeeper

»The core of Zookeeper is atomic broadcast. This mechanism ensures synchronization between servers. The protocol that implements this mechanism is called the Zab protocol. The Zab protocol has two modes, which are recovery mode and broadcast mode.

When the service starts or after the leader crashes, Zab enters the recovery mode. When the leader is elected and most servers are synchronized with the leader's state, the recovery mode ends.

State synchronization ensures that the leader and server have the same system state

»Once the leader has synchronized the state with most of the followers, he can start broadcasting messages, that is, enter the broadcasting state. At this time, when a server joins the zookeeper service, it will start in recovery mode,

Discover the leader and synchronize the state with the leader. When the synchronization is over, it also participates in the message broadcast. The Zookeeper service has been maintained in the Broadcast state until the leader crashes or the leader loses most of it

Followers support.

»Broadcast mode needs to ensure that proposals are processed in order, so zk uses an increasing transaction id number (zxid) to ensure. All proposals add zxid when they are made.

In the implementation, zxid is a 64-bit number. Its high 32 bits are used by the epoch to identify whether the leader relationship has changed. Every time a leader is selected, it will have a new epoch. The lower 32 bits is an up count.

»When the leader crashes or the leader loses most of its followers, zk enters the recovery mode. The recovery mode requires a new leader to be re-elected to restore all servers to a correct state.

»After each server starts, it asks other servers who it wants to vote for.
　　»For inquiries from other servers, the server replies with the id of the leader recommended by itself and the zxid of the last transaction processed according to its own status (each server will recommend itself when the system starts)
　　» After receiving all the server replies, calculate Find out which server has the largest zxid, and set the server related information as the server to vote next time.
　　»The sever who gets the most votes in the calculation process is the winner. If the winner has more than half of the votes, the server is selected as the leader. Otherwise, continue this process until the leader is elected

»The leader will start waiting for the server to connect
　　» Follower connects to the leader and sends the largest zxid to the leader
　　»Leader determines the synchronization point according to the follower’s zxid
　　» After synchronization is completed, the follower is notified that it has become uptodate status
　　» After the follower receives the uptodate message, it can restart Accept the client's request for service

8. Data consistency and paxos algorithm

• It is said that the difficulty of the Paxos algorithm is as admirable as the popularity of the algorithm, so we first look at how to maintain data consistency. Here is a principle:
　　• In a distributed database system, if the initial state of each node is consistent, Each node performs the same sequence of operations, so they can finally get a consistent state.
　　• What problem does the Paxos algorithm solve? The solution is to ensure that each node performs the same sequence of operations. Well, this is not simple. The master maintains a
　　 global write queue. All write operations must be placed in this queue number. Then no matter how many nodes we write, as long as the write operations are based on the number, we can guarantee a
　　　consistent 　　　. Yes, that's it, but what if the master hangs up.
　　• The Paxos algorithm uses voting to globally number write operations. At the same time, only one write operation is approved, and concurrent write operations have to win votes.
　　　Only write operations that get more than half of the votes will be approved (so only One write operation was approved), other write operations failed to compete and had to initiate another
　　　round of voting. In this way, in the voting day after day and year after year, all write operations were strictly numbered and sorted. The number is strictly increasing. When a node accepts a
　　　write operation numbered 100, and then a write operation numbered 99 (due to many unforeseen reasons such as network delays), it will immediately realize that its data is
　　　inconsistent and stop automatically External service and restart the synchronization process. The failure of any node will not affect the data consistency of the entire cluster (total 2n+1 units, unless more than n units are suspended).
　 Summary
　　• As a sub-project of the Hadoop project, Zookeeper is an indispensable module for Hadoop cluster management. It is mainly used to control the data in the cluster.

For example, it manages the NameNode in the Hadoop cluster, as well as the state synchronization between Master Election and Server in Hbase. \

For the Paxos algorithm, you can view the article Zookeeper full analysis-Paxos as the soul https://www.douban.com/note/208430424/

Recommended book: "From Paxos to Zookeeper Distributed Consistency Principle and Practice"

9、Observer

• Zookeeper needs to ensure high availability and strong consistency;
　　• In order to support more clients, more Servers need to be added;
　　• More Servers increase the delay in the voting phase, which affects performance;
　　• Weigh scalability and high throughput and introduce Observer
　　• Observer does not participate in voting;
　　• Observers accept client connections and forward write requests to the leader node;
　　• Add more Observer nodes to improve scalability without affecting throughput

10. Why is the number of zookeeper clusters generally odd?

•Leader election algorithm adopts Paxos protocol;
　　•Paxos core idea: When most servers are successfully written, task data is written successfully. If there are 3 servers, then two are written successfully; if there are 4 or 5 servers, then three Write successfully.
　　• The number of servers is generally an odd number (3, 5, 7). If there are 3 servers, at most 1 server is allowed to hang up; if there are 4 servers, at most 1 server is also allowed to hang up.

We see that the disaster tolerance capabilities of 3 servers and 4 servers are the same, so in order to save server resources, we generally use an odd number as the number of servers deployed.

11. Data model of Zookeeper

»Hierarchical directory structure, naming conforms to conventional file system specifications
　　» Each node is called znode in zookeeper, and it has a unique path identifier
　　»Node Znode can contain data and child nodes, but EPHEMERAL type nodes cannot have child nodes
　　» The data in Znode can have multiple versions. For example, if there are multiple data versions under a certain path, then you need to bring the version to query the data under this path
　　»Client applications can set up monitors on the node
　　» Node does not support part Read and write, but read and write completely at once

12. Zookeeper's node

»There are two types of Znodes, ephemeral and persistent
　　» The type of Znode is determined when it is created and cannot be modified afterwards
　　»When the client session of the short-lived znode ends, zookeeper will delete the short-lived znode. A znode cannot have child nodes
　　»Persistent znode does not depend on the client session, it will only be deleted when the client explicitly wants to delete the persistent znode
　　» Znode has four types of directory nodes
　　»PERSISTENT (persistent)
　　» EPHEMERAL (temporary) a)
　　>> PERSISTENT_SEQUENTIAL (persistent sequence number directory node)
　　>> EPHEMERAL_SEQUENTIAL (temporary directory of sequentially numbered node)

Detailed explanation of the working principle of Zookeeper (reproduced)

Guess you like