zk-leader选举

选举环境

QuorumCnxManager

QuorumCnxManager 
   QuorumCnxManager.Listener 
 QuorumCnxManager.SendWorker
        final ConcurrentHashMap<Long, ArrayBlockingQueue<ByteBuffer>> queueSendMap;
 QuorumCnxManager.RecvWorker
        public final ArrayBlockingQueue<Message> recvQueue;

QuorumCnxManager Listener SendWorker RecvWorker 的分工很明确 准确的说 QuorumCnxManager这个类的职责也很明确
就是负责监听端口 发消息 读消息 其中

  • Listener 别人主动连我的信息 同时还有一个条件 (sid < this.mySid) 这个条件我体会了半天才意识到为何这么做)
    这里在选举的时候 有一个简单的策略 会主动断开与比自己myid小的节点建立的连接
  • SendWorker 负责根据Listener保存的连接信息 向对应的server发送(投票)信息
  • RecvWorker 获取其他server的(投票)信息 并存入队列

在QuorumCnxManager的内部类中只有一个Message的内部类
这里只负责与其他server的信息交换 但不负责信息的生成与处理 数据的处理就要交给对应的选举算法进行处理了
zk提拱多种选举算法 不过之前版本的都废弃掉了
一般默认使用FastLeaderElection 也就是在配置文件中设置 electorArg=3
具体的信息处理 都在选举的算法里 zk 的状态也在这个类中进行改变

QuorumPeer.createElectionAlgorithm
 protected Election createElectionAlgorithm(int electionAlgorithm){
        Election le=null;          
        //TODO: use a factory rather than a switch
        switch (electionAlgorithm) {
        case 0:
            le = new LeaderElection(this);
            break;
        case 1:
            le = new AuthFastLeaderElection(this);
            break;
        case 2:
            le = new AuthFastLeaderElection(this, true);
            break;
        case 3:
            qcm = createCnxnManager();
            QuorumCnxManager.Listener listener = qcm.listener;
            if(listener != null){
                listener.start();
                le = new FastLeaderElection(this, qcm);
            } else {
                LOG.error("Null listener when initializing cnx manager");
            }
            break;
        default:
            assert false;
        }
        return le;
    }

FastLeaderElection

消息体的定义
static public class ToSend {
    static enum mType {crequest, challenge, notification, ack}

     ToSend(mType type,  消息类型 如上面的枚举类型
            long leader, 候选leader  QuorumPeer获取
            long zxid,   候选事务id  QuorumPeer获取
            long electionEpoch,  逻辑时钟
            ServerState state,   服务状态
            long sid,            本身的myid
            long peerEpoch)      选举的纪元 
				 peerEpoch初始值
				      public long getCurrentEpoch() throws IOException {
						   if (currentEpoch == -1) {
							currentEpoch = readLongFromFile(CURRENT_EPOCH_FILENAME);
						   }
						  return currentEpoch;
					   }
消息组装
消息组装 一共40字节
static ByteBuffer buildMsg(int state,
            long leader,
            long zxid,
            long electionEpoch,
            long epoch) {
        byte requestBytes[] = new byte[40];
        ByteBuffer requestBuffer = ByteBuffer.wrap(requestBytes);

        /*
         * Building notification packet to send 
         */

        requestBuffer.clear();
        requestBuffer.putInt(state);
        requestBuffer.putLong(leader);
        requestBuffer.putLong(zxid);
        requestBuffer.putLong(electionEpoch);
        requestBuffer.putLong(epoch);
        requestBuffer.putInt(Notification.CURRENTVERSION);
        
        return requestBuffer;
    }

两个线程
  • WorkerSender 负责将sendqueue中的 消息交给QuorumCnxManager放到queueSendMap中sid对应的 队列里进行消息的发送
  • WorkerReceiver 负责将收到的消息进行简单处理 以及将消息进行判断 然后给对应的server发送自己更新的后的消息
    在这个版本中 消息一共40字节

上述俩个线程负责消息的发送和收集 同时 使用到了QuorumCnxManager这个类 发送的消息交给它queueSendMap 获取的消息从他的recvQueue里面拿

主要逻辑梳理

1. response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS); ==>QuorumCnxManager.recvQueue.poll(timeout, unit);  
   这里是从QuorumCnxManager的接收消息对列中获取消息
2. 检查收到的数据的myid是否有在配置文件中配置 如果没有 则要向该服务发送消息
 .............................   
QuorumPeer
    public Map<Long,QuorumPeer.QuorumServer> getView() {
        return Collections.unmodifiableMap(this.quorumPeers);
    }

    /**
     * Observers are not contained in this view, only nodes with 
     * PeerType=PARTICIPANT.
     */
    public Map<Long,QuorumPeer.QuorumServer> getVotingView() {
        return QuorumPeer.viewToVotingView(getView());
    }

QuorumPeerMain
    quorumPeer.setQuorumPeers(config.getServers());   
.............................
    if(!self.getVotingView().containsKey(response.sid)){
          Vote current = self.getCurrentVote();
           ToSend notmsg = new ToSend(ToSend.mType.notification,
          current.getId(),
          current.getZxid(),
          logicalclock.get(),
          self.getPeerState(),
          response.sid,
          current.getPeerEpoch());
          sendqueue.offer(notmsg); 
 3. 如果是存在 那么进行后续的逻辑
    1. 检查数据的合法性之前的版本 数据大小为28字节 小于28字节则舍去该消息 否则消息初始化buffer.clear()=>position=0
   /*
    * We check for 28 bytes for backward compatibility
   */
    if (response.buffer.capacity() < 28) {
          LOG.error("Got a short response: "
            + response.buffer.capacity());
               continue;
          }
        boolean backCompatibility = (response.buffer.capacity() == 28);
       response.buffer.clear();
    2. 将buffer中的消息读取出来
        Notification n = new Notification();
                            // State of peer that sent this message
         QuorumPeer.ServerState ackstate = QuorumPeer.ServerState.LOOKING;
        switch (response.buffer.getInt()) {
        case 0:
          ackstate = QuorumPeer.ServerState.LOOKING;
            break;
         case 1:
          ackstate = QuorumPeer.ServerState.FOLLOWING;
           break;
         case 2:
          ackstate = QuorumPeer.ServerState.LEADING;
             break;
         case 3:
          ackstate = QuorumPeer.ServerState.OBSERVING;
              break;
          default:
                 continue;
            }
                            
           n.leader = response.buffer.getLong();
           n.zxid = response.buffer.getLong();
           n.electionEpoch = response.buffer.getLong();
           n.state = ackstate;
           n.sid = response.sid;
           if(!backCompatibility){
                 n.peerEpoch = response.buffer.getLong();
                } else {
                if(LOG.isInfoEnabled()){
                     LOG.info("Backward compatibility mode, server id=" + n.sid);
                    }
                n.peerEpoch = ZxidUtils.getEpochFromZxid(n.zxid);
              }

              /*
                * Version added in 3.4.6
                */
           n.version = (response.buffer.remaining() >= 4) ? response.buffer.getInt() : 0x0;

     3. 根据消息的状态处理消息
          如果自己的状态是如果也为looking

                          if(self.getPeerState() == QuorumPeer.ServerState.LOOKING){
                                recvqueue.offer(n);

                                /*
                                 * Send a notification back if the peer that sent this
                                 * message is also looking and its logical clock is
                                 * lagging behind.
                                 */
                                 判断该消息状态 如果也为lookig 同时逻辑时钟小于自己的 则向该服务发送一条消息 leader为自己选举的leader(不一定是自己)
                                if((ackstate == QuorumPeer.ServerState.LOOKING)
                                        && (n.electionEpoch < logicalclock.get())){
                                    Vote v = getVote();
                                    ToSend notmsg = new ToSend(ToSend.mType.notification,
                                            v.getId(),
                                            v.getZxid(),
                                            logicalclock.get(),
                                            self.getPeerState(),
                                            response.sid,
                                            v.getPeerEpoch());
                                    sendqueue.offer(notmsg);
                                }
                            }
        如果自己的状态不是looking状态
                   
                                /*
                                 * If this server is not looking, but the one that sent the ack
                                 * is looking, then send back what it believes to be the leader.
                                 */
                                Vote current = self.getCurrentVote();
                                如果请求的服务的状态是 looking 向该服务发送自己当前的投票信息
                                if(ackstate == QuorumPeer.ServerState.LOOKING){
                                    if(LOG.isDebugEnabled()){
                                        LOG.debug("Sending new notification. My id =  " +
                                                self.getId() + " recipient=" +
                                                response.sid + " zxid=0x" +
                                                Long.toHexString(current.getZxid()) +
                                                " leader=" + current.getId());
                                    }
                                    
                                    ToSend notmsg;
                                    if(n.version > 0x0) {
                                        notmsg = new ToSend(
                                                ToSend.mType.notification,
                                                current.getId(),
                                                current.getZxid(),
                                                current.getElectionEpoch(),
                                                self.getPeerState(),
                                                response.sid,
                                                current.getPeerEpoch());
                                        
                                    } 
                                    else {
                                        Vote bcVote = self.getBCVote();
                                        notmsg = new ToSend(
                                                ToSend.mType.notification,
                                                bcVote.getId(),
                                                bcVote.getZxid(),
                                                bcVote.getElectionEpoch(),
                                                self.getPeerState(),
                                                response.sid,
                                                bcVote.getPeerEpoch());
                                    }
                                    sendqueue.offer(notmsg);
                                }
                              
       

选举流程

QuorumPeer.run()
{
  */
           while (running) {
               switch (getPeerState()) {
               case LOOKING:
                   LOG.info("LOOKING");
                  ...
                  else {
                       try {
                           setBCVote(null);
                           setCurrentVote(makeLEStrategy().lookForLeader());
                       } catch (Exception e) {
                           LOG.warn("Unexpected exception", e);
                           setPeerState(ServerState.LOOKING);
                       }
                   }
                   break;
               case OBSERVING:
                   try {
                       LOG.info("OBSERVING");
                       setObserver(makeObserver(logFactory));
                       observer.observeLeader();
                   } catch (Exception e) {
                       LOG.warn("Unexpected exception",e );                        
                   } finally {
                       observer.shutdown();
                       setObserver(null);
                       setPeerState(ServerState.LOOKING);
                   }
                   break;
               case FOLLOWING:
                   try {
                       LOG.info("FOLLOWING");
                       setFollower(makeFollower(logFactory));
                       follower.followLeader();
                   } catch (Exception e) {
                       LOG.warn("Unexpected exception",e);
                   } finally {
                       follower.shutdown();
                       setFollower(null);
                       setPeerState(ServerState.LOOKING);
                   }
                   break;
               case LEADING:
                   LOG.info("LEADING");
                   try {
                       setLeader(makeLeader(logFactory));
                       leader.lead();
                       setLeader(null);
                   } catch (Exception e) {
                       LOG.warn("Unexpected exception",e);
                   } finally {
                       if (leader != null) {
                           leader.shutdown("Forcing shutdown");
                           setLeader(null);
                       }
                       setPeerState(ServerState.LOOKING);
                   }
                   break;
}

makeLEStrategy().lookForLeader() 正式开始选举

主要逻辑梳理

 1. 初始化一些配置
    HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();  存放收到的投票
    HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
    int notTimeout = finalizeWait;  等待时间 默认200
    synchronized(this){
           logicalclock.incrementAndGet(); //逻辑时钟更新
           updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch()); //更新当前投票信息
        }

2. 发送自己的投票信息(发送给自己)
    sendNotifications();  --这个时候数据 包括myid全是自己的  
       sendqueue.offer(notmsg);
  WorkerSender.run 
        public void run() {
                while (!stop) {
                    try {
                        ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);
                        if(m == null) continue;
                        process(m);
=============================================================
                      manager.toSend(m.sid, requestBuffer);   
      public void toSend(Long sid, ByteBuffer b) {
        /*
         * If sending message to myself, then simply enqueue it (loopback).
         */
        if (this.mySid == sid) {  如果myid和自己的一样 直接放到接收队列
             b.position(0);
             addToRecvQueue(new Message(b.duplicate(), sid));
            /*
             * Otherwise send to the corresponding thread to send.
             */
        } else {
             /*
              * Start a new connection if doesn't have one already.
              */
             ArrayBlockingQueue<ByteBuffer> bq = new ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY);
             ArrayBlockingQueue<ByteBuffer> bqExisting = queueSendMap.putIfAbsent(sid, bq);
             if (bqExisting != null) {
                 addToSendQueue(bqExisting, b);
             } else {
                 addToSendQueue(bq, b);  ====>  queue.add(buffer);  queueSendMap这个存放sid 和对应的发送消息对列
             }
             connectOne(sid);
                
        }
    }
        
          ...
 }    

 3. 这一步主要是从QuorumCnxManager的recvQueue里面拿消息 同时在必要的时候 
     请求QuorumCnxManager向对应的服务发送请求
  while ((self.getPeerState() == ServerState.LOOKING) &&
                    (!stop)){
                /*
                 * Remove next notification from queue, times out after 2 times
                 * the termination time
                 */
                Notification n = recvqueue.poll(notTimeout,
                        TimeUnit.MILLISECONDS);

                /*
                 * Sends more notifications if haven't received enough.
                 * Otherwise processes new notification.
                 */
                if(n == null){
                    if(manager.haveDelivered()){
                        sendNotifications();
                    } else {
                        manager.connectAll();
                    }

                    /*
                     * Exponential backoff
                     */
                    int tmpTimeOut = notTimeout*2;
                    notTimeout = (tmpTimeOut < maxNotificationInterval?
                            tmpTimeOut : maxNotificationInterval);
                    LOG.info("Notification time out: " + notTimeout);
                }

 4. 根据其他 集群内的server返回的消息进行处理 如果不是集群内配置的 服务直接跳过 打印警告日志
                     
                    if(self.getVotingView().containsKey(n.sid)) {
                    /*
                     * Only proceed if the vote comes from a replica in the
                     * voting view.
                     */
                    switch (n.state) { case LOOKING:
                        // If notification > current, replace and send messages out
                        if (n.electionEpoch > logicalclock.get()) {
                            logicalclock.set(n.electionEpoch);
                            recvset.clear();
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications();
                        } else if (n.electionEpoch < logicalclock.get()) {
                            if(LOG.isDebugEnabled()){
                                LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                        + Long.toHexString(n.electionEpoch)
                                        + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                            }
                            break;
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            sendNotifications();
                        }

                        if(LOG.isDebugEnabled()){
                            LOG.debug("Adding vote: from=" + n.sid +
                                    ", proposed leader=" + n.leader +
                                    ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                                    ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                        }

                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                        if (termPredicate(recvset,
                                new Vote(proposedLeader, proposedZxid,
                                        logicalclock.get(), proposedEpoch))) {

                            // Verify if there is any change in the proposed leader
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    recvqueue.put(n);
                                    break;
                                }
                            }

                            /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */
                            if (n == null) {
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(proposedLeader,
                                                        proposedZxid,
                                                        logicalclock.get(),
                                                        proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }
                        break; 

                         case OBSERVING:
                        LOG.debug("Notification from observer: " + n.sid);
                        break;
                    case FOLLOWING:
                    case LEADING:
                        /*
                         * Consider all notifications from the same epoch
                         * together.
                         */
                        if(n.electionEpoch == logicalclock.get()){
                            recvset.put(n.sid, new Vote(n.leader,
                                                          n.zxid,
                                                          n.electionEpoch,
                                                          n.peerEpoch));
                           
                            if(ooePredicate(recvset, outofelection, n)) {
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(n.leader, 
                                        n.zxid, 
                                        n.electionEpoch, 
                                        n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

                        /*
                         * Before joining an established ensemble, verify
                         * a majority is following the same leader.
                         */
                        outofelection.put(n.sid, new Vote(n.version,
                                                            n.leader,
                                                            n.zxid,
                                                            n.electionEpoch,
                                                            n.peerEpoch,
                                                            n.state));
           
                        if(ooePredicate(outofelection, outofelection, n)) {
                            synchronized(this){
                                logicalclock.set(n.electionEpoch);
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader,
                                                    n.zxid,
                                                    n.electionEpoch,
                                                    n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;
                    default:
                        LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
                                n.state, n.sid);
                        break;
                    }

    4.1   先与获取的信息进行比较 
                       如果自身的逻辑时钟较小 则删队列中已经获取到的消息 更新选票的信息 然后发送notify消息
                       如果自身的逻辑时钟较大 则直接忽略该消息
                       如果逻辑时钟一样 比较信息 然后发送notify消息
                           if (n.electionEpoch > logicalclock.get()) {
                            logicalclock.set(n.electionEpoch);
                            recvset.clear();
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications();
                        } else if (n.electionEpoch < logicalclock.get()) {
                            if(LOG.isDebugEnabled()){
                                LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                        + Long.toHexString(n.electionEpoch)
                                        + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                            }
                            break;
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            sendNotifications();
                        }
   4.2 将获取到的消息存recvset的Map中 sid->vote
        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

   5.  这里判断自己收都到的投票是否足够结束一轮投票 这里两种策略 不过我们一般都是使用票数过半作为条件
         termPredicate =>self.getQuorumVerifier().containsQuorum(set);
         ==> 
          public boolean containsQuorum(HashSet<Long> set){
            return (set.size() > half);
         }

        if (termPredicate(recvset,
                 new Vote(proposedLeader, proposedZxid,
                     logicalclock.get(), proposedEpoch))) {
                            
                            如果票数过半 最后等待一段时间 看投票信息是否有变化
                            // Verify if there is any change in the proposed leader
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    recvqueue.put(n);
                                    break;
                                }
                            }

                            /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */

                            这里开始修改当前服务的状态
                             在获取超过一般的服务器的数据后 一般这个时候是可以 确定自己可以作为什么角色
                            if (n == null) {
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(proposedLeader,
                                                        proposedZxid,
                                                        logicalclock.get(),
                                                        proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;  这里返回最后的 投票信息
                            }
                        }
                        break;   


    6. 
     这里 FOLLOWING  LEADING
       是放在一个逻辑里处理的
          如果自己的leader 就做判断
          如果自己不是leader 或者只是新加入集群的一员 就将消息放入
          outofelection进行验证 同时返回自己最后的投票信息 并更新自己的状态
                     case FOLLOWING:
                     case LEADING:
                        /*
                         * Consider all notifications from the same epoch
                         * together.
                         */
                        if(n.electionEpoch == logicalclock.get()){
                            recvset.put(n.sid, new Vote(n.leader,
                                                          n.zxid,
                                                          n.electionEpoch,
                                                          n.peerEpoch));
                           
                            if(ooePredicate(recvset, outofelection, n)) {
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(n.leader, 
                                        n.zxid, 
                                        n.electionEpoch, 
                                        n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

                        /*
                         * Before joining an established ensemble, verify
                         * a majority is following the same leader.
                         */
                        outofelection.put(n.sid, new Vote(n.version,
                                                            n.leader,
                                                            n.zxid,
                                                            n.electionEpoch,
                                                            n.peerEpoch,
                                                            n.state));
           
                        if(ooePredicate(outofelection, outofelection, n)) {
                            synchronized(this){
                                logicalclock.set(n.electionEpoch);
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader,
                                                    n.zxid,
                                                    n.electionEpoch,
                                                    n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;                      

猜你喜欢

转载自blog.csdn.net/zhaoyu_nb/article/details/88663599