1.基本架构
2.ZAB协议
ZooKeeper并没有完全采用Paxos算法,而是使用了一种称为ZooKeeper Atomic Broadcast(ZAB,zookeeper原子消息广播协议)的协议作为其数据一致性的核心算法。
2.1选择Leader需用半数通过才选举成成功,同时集群中已经有过半的机器与该Leader服务器完成状态同步(数据同步)才能开始服务。
2.2所有事务请求必须由一个全局唯一的服务器来协调处理,这样的服务器称为Leader服务器,而余下的其他服务器则成为Follower服务器。Leader服务器负责将一个客户端事务请求转换成一个事务Proposal(提议),并将该Proposal分发给集群中所有的Follower服务器。之后Leader服务器需要等待所有Follower服务器的反馈,一旦超过半数的Follower服务器进行了正确反馈后,那么Leader就会再次向所有的Follower服务器分发Commit消息,要求其将前一个Proposal进行提交。
3.Leader和Follower启动过程
4.请求处理
4.1请求处理链
4.1.1leader请求处理链
4.1.2follower请求处理链
4.2处理流程
以creater服务端为leade为例流程如下
FollowerZooKeeperServer与LeaderZooKeeperServer处理流程的差别是FollowerRequestProcessor会将事务请求转发给leader,SendAckRequestProcessor向leader返回事务提议正确的响应,其他的处理链都是一致的。SendAckRequestProcessor和AckRequestProcessor的区别是AckRequestProcessor是leader的本地调用。FollowerRequestProcessor的事务请求的代码如下
- public void run() {
- try {
- while (!finished) {
- Request request = queuedRequests.take();
- if (LOG.isTraceEnabled()) {
- ZooTrace.logRequest(LOG, ZooTrace.CLIENT_REQUEST_TRACE_MASK,
- 'F', request, "");
- }
- if (request == Request.requestOfDeath) {
- break;
- }
- // We want to queue the request to be processed before we submit
- // the request to the leader so that we are ready to receive
- // the response
- nextProcessor.processRequest(request);
- // We now ship the request to the leader. As with all
- // other quorum operations, sync also follows this code
- // path, but different from others, we need to keep track
- // of the sync operations this follower has pending, so we
- // add it to pendingSyncs.
- switch (request.type) {
- case OpCode.sync:
- zks.pendingSyncs.add(request);
- zks.getFollower().request(request);
- break;
- case OpCode.create:
- case OpCode.delete:
- case OpCode.setData:
- case OpCode.setACL:
- case OpCode.createSession:
- case OpCode.closeSession:
- case OpCode.multi:
- zks.getFollower().request(request);
- break;
- }
- }
- } catch (Exception e) {
- LOG.error("Unexpected exception causing exit", e);
- }
- LOG.info("FollowerRequestProcessor exited loop!");
- }
5.数据同步
ZooKeeper集群数据同步分为4类,分别为直接差异化同步(DIFF)、先回滚再差异化同步(TRUNC+DIFF)、回滚同步(TRUNC)和全量同步(SNAP)。在同步之前,leader服务器先对peerLastZxid(该leader服务器最好处理的ZXID)、minCommittedLog(leader服务器提议缓存队列committedLog中的最小ZXID)、maxCommittedLog(leader服务器提议缓存队列committedLog中的最大ZXID)进行初始化,然后通过这3个ZXID值进行判断同步类型,并进行同步。代码见LearnerHandler的run方法:
- .....
- long peerLastZxid;
- StateSummary ss = null;
- long zxid = qp.getZxid();
- long newEpoch = leader.getEpochToPropose(this.getSid(), lastAcceptedEpoch);
- if (this.getVersion() < 0x10000) {
- // we are going to have to extrapolate the epoch information
- long epoch = ZxidUtils.getEpochFromZxid(zxid);
- ss = new StateSummary(epoch, zxid);
- // fake the message
- leader.waitForEpochAck(this.getSid(), ss);
- } else {
- byte ver[] = new byte[4];
- ByteBuffer.wrap(ver).putInt(0x10000);
- QuorumPacket newEpochPacket = new QuorumPacket(Leader.LEADERINFO, ZxidUtils.makeZxid(newEpoch, 0), ver, null);
- oa.writeRecord(newEpochPacket, "packet");
- bufferedOutput.flush();
- QuorumPacket ackEpochPacket = new QuorumPacket();
- ia.readRecord(ackEpochPacket, "packet");
- if (ackEpochPacket.getType() != Leader.ACKEPOCH) {
- LOG.error(ackEpochPacket.toString()
- + " is not ACKEPOCH");
- return;
- ByteBuffer bbepoch = ByteBuffer.wrap(ackEpochPacket.getData());
- ss = new StateSummary(bbepoch.getInt(), ackEpochPacket.getZxid());
- leader.waitForEpochAck(this.getSid(), ss);
- }
- peerLastZxid = ss.getLastZxid();
- /* the default to send to the follower */
- int packetToSend = Leader.SNAP;
- long zxidToSend = 0;
- long leaderLastZxid = 0;
- /** the packets that the follower needs to get updates from **/
- long updates = peerLastZxid;
- /* we are sending the diff check if we have proposals in memory to be able to
- * send a diff to the
- */
- ReentrantReadWriteLock lock = leader.zk.getZKDatabase().getLogLock();
- ReadLock rl = lock.readLock();
- try {
- rl.lock();
- final long maxCommittedLog = leader.zk.getZKDatabase().getmaxCommittedLog();
- final long minCommittedLog = leader.zk.getZKDatabase().getminCommittedLog();
- LOG.info("Synchronizing with Follower sid: " + sid
- +" maxCommittedLog=0x"+Long.toHexString(maxCommittedLog)
- +" minCommittedLog=0x"+Long.toHexString(minCommittedLog)
- +" peerLastZxid=0x"+Long.toHexString(peerLastZxid));
- LinkedList<Proposal> proposals = leader.zk.getZKDatabase().getCommittedLog();
- if (proposals.size() != 0) {
- LOG.debug("proposal size is {}", proposals.size());
- if ((maxCommittedLog >= peerLastZxid)
- && (minCommittedLog <= peerLastZxid)) {
- LOG.debug("Sending proposals to follower");
- // as we look through proposals, this variable keeps track of previous
- // proposal Id.
- long prevProposalZxid = minCommittedLog;
- // Keep track of whether we are about to send the first packet.
- // Before sending the first packet, we have to tell the learner
- // whether to expect a trunc or a diff
- boolean firstPacket=true;
- for (Proposal propose: proposals) {
- // skip the proposals the peer already has
- if (propose.packet.getZxid() <= peerLastZxid) {
- prevProposalZxid = propose.packet.getZxid();
- continue;
- } else {
- // If we are sending the first packet, figure out whether to trunc
- // in case the follower has some proposals that the leader doesn't
- if (firstPacket) {
- firstPacket = false;
- // Does the peer have some proposals that the leader hasn't seen yet
- if (prevProposalZxid < peerLastZxid) {
- // send a trunc message before sending the diff
- packetToSend = Leader.TRUNC;
- LOG.info("Sending TRUNC");
- zxidToSend = prevProposalZxid;
- updates = zxidToSend;
- }
- else {
- // Just send the diff
- packetToSend = Leader.DIFF;
- LOG.info("Sending diff");
- zxidToSend = maxCommittedLog;
- }
- }
- queuePacket(propose.packet);
- QuorumPacket qcommit = new QuorumPacket(Leader.COMMIT, propose.packet.getZxid(),
- null, null);
- queuePacket(qcommit);
- }
- }
- } else if (peerLastZxid > maxCommittedLog) {
- LOG.debug("Sending TRUNC to follower zxidToSend=0x{} updates=0x{}",
- Long.toHexString(maxCommittedLog),
- Long.toHexString(updates));
- packetToSend = Leader.TRUNC;
- zxidToSend = maxCommittedLog;
- updates = zxidToSend;
- } else {
- LOG.warn("Unhandled proposal scenario");
- }
- } else {
- // just let the state transfer happen
- LOG.debug("proposals is empty");
- }
- leaderLastZxid = leader.startForwarding(this, updates);
- if (peerLastZxid == leaderLastZxid) {
- LOG.debug("Leader and follower are in sync, sending empty diff. zxid=0x{}",
- Long.toHexString(leaderLastZxid));
- // We are in sync so we'll do an empty diff
- packetToSend = Leader.DIFF;
- zxidToSend = leaderLastZxid;
- }
- } finally {
- rl.unlock();
- }
- QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER,
- ZxidUtils.makeZxid(newEpoch, 0), null, null);
- if (getVersion() < 0x10000) {
- oa.writeRecord(newLeaderQP, "packet");
- } else {
- queuedPackets.add(newLeaderQP);
- }
- bufferedOutput.flush();
- //Need to set the zxidToSend to the latest zxid
- if (packetToSend == Leader.SNAP) {
- zxidToSend = leader.zk.getZKDatabase().getDataTreeLastProcessedZxid();
- }
- oa.writeRecord(new QuorumPacket(packetToSend, zxidToSend, null, null), "packet");
- bufferedOutput.flush();
- /* if we are not truncating or sending a diff just send a snapshot */
- if (packetToSend == Leader.SNAP) {
- LOG.info("Sending snapshot last zxid of peer is 0x"
- + Long.toHexString(peerLastZxid) + " "
- + " zxid of leader is 0x"
- + Long.toHexString(leaderLastZxid)
- + "sent zxid of db as 0x"
- + Long.toHexString(zxidToSend));
- // Dump data to peer
- leader.zk.getZKDatabase().serializeSnapshot(oa);
- oa.writeString("BenWasHere", "signature");
- }
- bufferedOutput.flush();
- // Start sending packets
- new Thread() {
- public void run() {
- Thread.currentThread().setName(
- "Sender-" + sock.getRemoteSocketAddress());
- try {
- sendPackets();
- } catch (InterruptedException e) {
- LOG.warn("Unexpected interruption",e);
- }
- }
- }.start();
- /*
- * Have to wait for the first ACK, wait until
- * the leader is ready, and only then we can
- * start processing messages.
- */
- qp = new QuorumPacket();
- ia.readRecord(qp, "packet");
- if(qp.getType() != Leader.ACK){
- LOG.error("Next packet was supposed to be an ACK");
- return;
- }
- LOG.info("Received NEWLEADER-ACK message from " + getSid());
- leader.waitForNewLeaderAck(getSid(), qp.getZxid(), getLearnerType());
- .....
6.watch
6.1服务端
在请求处理链的最后端FinalRequestProcessor的processRequest()中会判断是否需要处理watch。
注册watch会调用DataTree.getData()方法将当前的ServerCnxn和path注册到dataWatches或者childWatches。以getData为例代码如下
- case OpCode.getData: {
- ...
- byte b[] = zks.getZKDatabase().getData(getDataRequest.getPath(), stat,
- getDataRequest.getWatch() ? cnxn : null);
- rsp = new GetDataResponse(b, stat);
- break;
触发watch会在事务保存数据时调用DataTree.processTxn时触发,并通过调用WatchManager.triggerWatch()触发当前的ServerCnxn的process方法的调用返回客户端。以create为例代码如下:
- public String createNode(String path, byte data[], List<ACL> acl,
- long ephemeralOwner, int parentCVersion, long zxid, long time)
- throws KeeperException.NoNodeException,
- KeeperException.NodeExistsException {
- .....
- dataWatches.triggerWatch(path, Event.EventType.NodeCreated);
- childWatches.triggerWatch(parentName.equals("") ? "/" : parentName,
- Event.EventType.NodeChildrenChanged);
- return path;
- }
- public Set<Watcher> triggerWatch(String path, EventType type, Set<Watcher> supress) {
- WatchedEvent e = new WatchedEvent(type,
- KeeperState.SyncConnected, path);
- HashSet<Watcher> watchers;
- synchronized (this) {
- //watch触发一次后就会将其移除
- watchers = watchTable.remove(path);
- if (watchers == null || watchers.isEmpty()) {
- if (LOG.isTraceEnabled()) {
- ZooTrace.logTraceMessage(LOG,
- ZooTrace.EVENT_DELIVERY_TRACE_MASK,
- "No watchers for " + path);
- }
- return null;
- }
- for (Watcher w : watchers) {
- HashSet<String> paths = watch2Paths.get(w);
- if (paths != null) {
- paths.remove(path);
- }
- }
- }
- for (Watcher w : watchers) {
- if (supress != null && supress.contains(w)) {
- continue;
- }
- w.process(e);
- }
- return watchers;
- }
- @Override
- synchronized public void process(WatchedEvent event) {
- ReplyHeader h = new ReplyHeader(-1, -1L, 0);
- if (LOG.isTraceEnabled()) {
- ZooTrace.logTraceMessage(LOG, ZooTrace.EVENT_DELIVERY_TRACE_MASK,
- "Deliver event " + event + " to 0x"
- + Long.toHexString(this.sessionId)
- + " through " + this);
- }
- // Convert WatchedEvent to a type that can be sent over the wire
- WatcherEvent e = event.getWrapper();
- sendResponse(h, e, "notification");
- }
6.2客户端
注册watch后,客户端会将当前客户端的请求设置为使用watch监听,同时封装一个DataWatchRegistration保存路径和watch的对应关系。以getData为例代码如下:
- public byte[] getData(final String path, Watcher watcher, Stat stat)
- throws KeeperException, InterruptedException
- {
- .....
- // the watch contains the un-chroot path
- WatchRegistration wcb = null;
- if (watcher != null) {
- wcb = new DataWatchRegistration(watcher, clientPath);
- }
- ......
- RequestHeader h = new RequestHeader();
- h.setType(ZooDefs.OpCode.getData);
- GetDataRequest request = new GetDataRequest();
- request.setPath(serverPath);
- request.setWatch(watcher != null);
- GetDataResponse response = new GetDataResponse();
- ReplyHeader r = cnxn.submitRequest(h, request, response, wcb);
- .....
- }
- private void finishPacket(Packet p) {
- if (p.watchRegistration != null) {
- p.watchRegistration.register(p.replyHeader.getErr());
- }
- if (p.cb == null) {
- synchronized (p) {
- p.finished = true;
- p.notifyAll();
- }
- } else {
- p.finished = true;
- eventThread.queuePacket(p);
- }
- }
- public void createBB() {
- try {
- ByteArrayOutputStream baos = new ByteArrayOutputStream();
- BinaryOutputArchive boa = BinaryOutputArchive.getArchive(baos);
- boa.writeInt(-1, "len"); // We'll fill this in later
- if (requestHeader != null) {
- requestHeader.serialize(boa, "header");
- }
- if (request instanceof ConnectRequest) {
- request.serialize(boa, "connect");
- // append "am-I-allowed-to-be-readonly" flag
- boa.writeBool(readOnly, "readOnly");
- } else if (request != null) {
- request.serialize(boa, "request");
- }
- baos.close();
- this.bb = ByteBuffer.wrap(baos.toByteArray());
- this.bb.putInt(this.bb.capacity() - 4);
- this.bb.rewind();
- } catch (IOException e) {
- LOG.warn("Ignoring unexpected exception", e);
- }
- }
- void readResponse(ByteBuffer incomingBuffer) throws IOException {
- .....
- if (replyHdr.getXid() == -1) {
- // -1 means notification
- if (LOG.isDebugEnabled()) {
- LOG.debug("Got notification sessionid:0x"
- + Long.toHexString(sessionId));
- }
- WatcherEvent event = new WatcherEvent();
- event.deserialize(bbia, "response");
- // convert from a server path to a client path
- if (chrootPath != null) {
- String serverPath = event.getPath();
- if(serverPath.compareTo(chrootPath)==0)
- event.setPath("/");
- else if (serverPath.length() > chrootPath.length())
- event.setPath(serverPath.substring(chrootPath.length()));
- else {
- LOG.warn("Got server path " + event.getPath()
- + " which is too short for chroot path "
- + chrootPath);
- }
- }
- WatchedEvent we = new WatchedEvent(event);
- if (LOG.isDebugEnabled()) {
- LOG.debug("Got " + we + " for sessionid 0x"
- + Long.toHexString(sessionId));
- }
- eventThread.queueEvent( we );
- return;
- }
- ...
- /*
- * Since requests are processed in order, we better get a response
- * to the first request!
- */
- try {
- ...
- } finally {
- finishPacket(packet);
- }
- }
- public void queueEvent(WatchedEvent event) {
- if (event.getType() == EventType.None
- && sessionState == event.getState()) {
- return;
- }
- sessionState = event.getState();
- // materialize the watchers based on the event
- WatcherSetEventPair pair = new WatcherSetEventPair(
- watcher.materialize(event.getState(), event.getType(),
- event.getPath()),
- event);
- // queue the pair (watch set & event) for later processing
- waitingEvents.add(pair);
- }
- public Set<Watcher> materialize(Watcher.Event.KeeperState state,
- Watcher.Event.EventType type,
- String clientPath)
- {
- Set<Watcher> result = new HashSet<Watcher>();
- switch (type) {
- case None:
- result.add(defaultWatcher);
- boolean clear = ClientCnxn.getDisableAutoResetWatch() &&
- state != Watcher.Event.KeeperState.SyncConnected;
- synchronized(dataWatches) {
- for(Set<Watcher> ws: dataWatches.values()) {
- result.addAll(ws);
- }
- if (clear) {
- dataWatches.clear();
- }
- }
- synchronized(existWatches) {
- for(Set<Watcher> ws: existWatches.values()) {
- result.addAll(ws);
- }
- if (clear) {
- existWatches.clear();
- }
- }
- synchronized(childWatches) {
- for(Set<Watcher> ws: childWatches.values()) {
- result.addAll(ws);
- }
- if (clear) {
- childWatches.clear();
- }
- }
- return result;
- //通过dataWatches或者existWatches或者childWatches的remove取出对应的watch,表明客户端watch也是注册一次就移除
- case NodeDataChanged:
- case NodeCreated:
- synchronized (dataWatches) {
- addTo(dataWatches.remove(clientPath), result);
- }
- synchronized (existWatches) {
- addTo(existWatches.remove(clientPath), result);
- }
- break;
- case NodeChildrenChanged:
- synchronized (childWatches) {
- addTo(childWatches.remove(clientPath), result);
- }
- break;
- case NodeDeleted:
- synchronized (dataWatches) {
- addTo(dataWatches.remove(clientPath), result);
- }
- // XXX This shouldn't be needed, but just in case
- synchronized (existWatches) {
- Set<Watcher> list = existWatches.remove(clientPath);
- if (list != null) {
- addTo(existWatches.remove(clientPath), result);
- LOG.warn("We are triggering an exists watch for delete! Shouldn't happen!");
- }
- }
- synchronized (childWatches) {
- addTo(childWatches.remove(clientPath), result);
- }
- break;
- default:
- String msg = "Unhandled watch event type " + type
- + " with state " + state + " on path " + clientPath;
- LOG.error(msg);
- throw new RuntimeException(msg);
- }
- return result;
- }