ZooKeeper and ZAB agreement

Foreword

ZooKeeper is to provide a highly available, consistent, high-performance storage systems to ensure sequential read. ZAB protocol that supports data consistency ZooKeeper specially designed atomic broadcast protocol.

Demo environment

$ uname -a
Darwin 18.6.0 Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64 x86_64

installation

brew cask install java
brew install zookeeper

Configuration

Demonstrated here it is to deploy a pseudo three ZooKeeper cluster processes on the same machine.

$ cat /usr/local/etc/zookeeper/zoo1.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/usr/local/var/run/zookeeper/data1
clientPort=2181
server.1=localhost:2888:3888
server.2=localhost:4888:5888
server.3=localhost:6888:7888
$ echo "1" >
/usr/local/var/run/zookeeper/data1/myid
  • TickTime ZooKeeper basic time unit used, in milliseconds, the default is 2000. It is used to regulate the heartbeat and timeout.
  • initLimit default value is 10, i.e., 10 times tickTime attribute value. It represents follower connection and allows the maximum time to synchronize the leader. If a large amount of data ZooKeeper management, then you can increase this value.
  • The default value is 5 times syncLimit 5, i.e. tickTime property values. Lead and follower represents the maximum delay time of the heartbeat detection. If follower unable to communicate within the time provided in the leader, the follower will be discarded.
  • dataDir ZooKeeper directory used to store memory database snapshots, and unless you specify another directory, or database update transaction log will be stored in the directory. You can configure dataLogDir designated storage directory ZooKeeper transaction logs.
  • Port clientPort server listens for client connections, default is 2181.
  • server.id = host: port1: port2 trunked mode for sensing other machines, each row represents an instance ZooKeeper configuration. id known as Server ID number used to represent instances in the cluster. Also you need to be written for each instance ID of myid dataDir file. port1 for data synchronization. port2 for election.

Second cluster configuration instance to:

$ cat /usr/local/etc/zookeeper/zoo2.cfg 
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/usr/local/var/run/zookeeper/data2
clientPort=2182
server.1=localhost:2888:3888
server.2=localhost:4888:5888
server.3=localhost:6888:7888
$ cat /usr/local/var/run/zookeeper/data2/myid 
2

Cluster third configuration example is:

$ cat /usr/local/etc/zookeeper/zoo3.cfg 
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/usr/local/var/run/zookeeper/data3
clientPort=2183
server.1=localhost:2888:3888
server.2=localhost:4888:5888
server.3=localhost:6888:7888
$ cat /usr/local/var/run/zookeeper/data3/myid 
3

Start the cluster:

$ zkServer start /usr/local/etc/zookeeper/zoo1.cfg
ZooKeeper JMX enabled by default
Using config: /usr/local/etc/zookeeper/zoo1.cfg
Starting zookeeper ... STARTED
$ zkServer start /usr/local/etc/zookeeper/zoo2.cfg
ZooKeeper JMX enabled by default
Using config: /usr/local/etc/zookeeper/zoo2.cfg
Starting zookeeper ... STARTED
$ zkServer start /usr/local/etc/zookeeper/zoo3.cfg
ZooKeeper JMX enabled by default
Using config: /usr/local/etc/zookeeper/zoo3.cfg
Starting zookeeper ... STARTED
$ zkServer status /usr/local/etc/zookeeper/zoo1.cfg
ZooKeeper JMX enabled by default
Using config: /usr/local/etc/zookeeper/zoo1.cfg
Mode: follower
$ zkServer status /usr/local/etc/zookeeper/zoo2.cfg
ZooKeeper JMX enabled by default
Using config: /usr/local/etc/zookeeper/zoo2.cfg
Mode: leader
$ zkServer status /usr/local/etc/zookeeper/zoo3.cfg
ZooKeeper JMX enabled by default
Using config: /usr/local/etc/zookeeper/zoo3.cfg
Mode: follower

As can be seen from the above status check, Leader is the second example, two other examples are follower.

operating

Here I will demonstrate to read and write nodes in the cluster.

$ zkCli -server localhost:2182
Connecting to localhost:2182
Welcome to ZooKeeper!
JLine support is enabled

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
[zk: localhost:2182(CONNECTED) 2] create /test "i am test"
Created /test
[zk: localhost:2182(CONNECTED) 3] get /test
i am test
cZxid = 0x200000002
ctime = Tue Jul 02 16:35:15 CST 2019
mZxid = 0x200000002
mtime = Tue Jul 02 16:35:15 CST 2019
pZxid = 0x200000002
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 9
numChildren = 0

Create / test nodes in Example 2, in Example 3 the following can be read to the node, indicating consistency ZooKeeper cluster data.

ZooKeeper semantics guarantee

ZooKeeper simple and efficient, while providing the following semantics guarantee, so that we can take advantage of these features provide complex services:

  • Sequential: an update request initiated by the client in the transmitted sequence would be applied to the ZooKeeper.
  • Atomic: update-or-fail, will not be an intermediate state.
  • Reliability: Once accepted an update operation that is not accidental loss, unless covered by other operational update.
  • Eventual consistency: write a final (but not immediately) would be visible to the client.

ZooKeeper Watch mechanism

All reads of the ZooKeeper can be shipped with a Watch. Once the appropriate data has changed, the Watch is triggered.

Watch has the following characteristics:

  • Active push: The Watch is triggered by the ZooKeeper server actively push the update to the client, without the client polls.
  • Disposable: Watch will only be triggered once when the data changes. If the client wants to be notified of future updates, you must re-register after a Watch Watch is triggered.
  • Sequential: If multiple update triggers multiple Watch, Watch that the order is triggered in line with the order of updates.
  • Visibility: If a client comes in the read request Watch, Watch triggered simultaneously read data again, the client will not be possible to see the updated data before getting Watch news. In other words, the update notification before the update results.

ZAB agreement

In order to ensure consistency with the availability of a write operation, ZooKeeper paxos on the basis of design consistency protocol that supports called Atomic Broadcast (ZAB) crash recovery. Based on this agreement, ZooKeeper implements a master-slave architecture model system to maintain data consistency between copies of each cluster.

ZAB According to the agreement, all write operations must be done by the leader, leader written to the local log and then replicated to all nodes of the follower. If the client is on the follower / observer initiated write requests, follower / observer forwards the request to the leader, the leader after the completion of treatment and then forwards the results back to the follower / observer sent to the client.

ZAB agreement is divided into a broadcast mode and crash recovery mode

leader write request processing step (broadcast mode) is:

1.leader transaction request to generate a unique transaction ID (ZXID), ZAB each transaction protocol will be in the order ZXID Proposal to sort and process.

2. Send and wait follower follower Proposal to reply ACK.

3.leader after receiving more than half of ACK (leader of their own have a default ACK) / observer commit to sending all of the follower, while the leader itself will complete the commit.

4. The processing result returned to the client.

The above process become ZooKeeper two-phase commit.

Crash Recovery:

When the leader instance crashes down, or because the network causes the follower more than half of its lost contact, it will crash into the recovery phase.

leader downtime or lost contact with more than half of the follower leader will lead to the re-election (election algorithm article will explain later). After the election will be tight and enter data crash recovery to ensure data consistency, data is synchronized. We need to ensure that the transaction is already commit all servers are submitted, and the need to discard those transactions only leader presented by the server. So elected a new leader to have the highest number of cluster ZXID. Data synchronization will work after the new leader is elected, leader who will not be sent to the follower synchronized affairs Proposal follower in the form of messages, and followed by sending a commit message behind each Proposal which indicates that the the transaction has been submitted. Then follower server will be synchronized from the transaction Proposal were successfully applied to the local database.

ZXID is a 64-bit unsigned integer, 32-bit high is Epoch, the representative leader selection period, is the low 32-bit accumulated count, the count is cleared after each round of elections. leader did not produce a transaction server, ZXID low 32 will be a plus, not a leader to complete the election, it will be ZXID high 32 plus one. This is done to ensure that the new leader must be generated ZXID ZXID produced before larger than the old leader.

Leadership election algorithm

Server Status:

  • LOOKING uncertain Leader status. Server in the cluster does not state that the current Leader, will initiate Leader election.
  • FOLLOWING follower state. It indicates that the current server role is Follower, and it knows who is Leader yes.
  • LEADING leader status. Indicates that the current server state is a leader, it maintains a heartbeat between the Follower.
  • OBSERVING observer status. Indicates that the current server role is the Observer, Follower different and is not involved in the election, do not participate in the voting cluster write operation.

Ballot data structure:

Each leader during the election server that will send the following key information:

  • logicClock each server will maintain an auto-incremented integer, called logicClock, it indicates that this is the first rounds of voting initiated by the server.
  • state represents the current state of the server.
  • self_id current server myid
  • Self_zxid maximum zxid on the current server stored data
  • vote_id was elected servers myid
  • The maximum zxid on the server vote_zxid was elected the saved data 

(Quick leader election algorithm) votes PK:

PK is based on the votes (logicClock, self_id, self_zxid) compared with (vote_logicClock, vote_self_id, vote_self_zxid):
first comparing logicClock, if equal then compare zxid, if zxid equal, then compare myid. If the final vote is larger than its own, then change votes, also voted vote.

to sum up

Beginning of the article presentation ZooKeeper deployment and operations to give the reader an intuitive feel, then introduced the ZAB leader election protocol and the principle of ZooKeeper.

reference

https://cwiki.apache.org/confluence/display/ZOOKEEPER/ProjectDescription

https://dbaplus.cn/news-141-1875-1.html

 

Guess you like

Origin www.cnblogs.com/makelu/p/11123103.html