Zookeeper快速选举流程详解

                                Zookeeper快速选举流程详解

在讲解流程之前,先说明一下选举流程中涉及到的角色,以及涉及到的关键类和变量(源码参考版本:3.4.9):

角色:1.LOOKING:竞选

           2.OBSERVING:观察

           3.FOLLOWING:跟随者

           4.LEADER:领导者

投票信息:

           1.logicalclock(electionEpoch):本地选举周期,每次投票都会自增

           2.epoch(peerEpoch):选举周期,每次选举最终确定完leader结束选举流程时会自增(真正zxid的前32位)

           3.zxid:数据ID,每次数据变动都会自增(真正zxid的后32位,zxid一共64位)

           4.sid:该投票信息所属的serverId

           5.leader:提议的leader(被提议的server的serverId,即sid)

投票比较规则:

          1.epoch大的胜出,否则进行步骤2

          2.zxid大的胜出,否则进行步骤3

          3.sid大的胜出

比较规则的源码如下:

/**
     * Check if a pair (server id, zxid) succeeds our
     * current vote.
     *
     * @param id    Server identifier
     * @param zxid  Last zxid observed by the issuer of this vote
     */
    protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
        LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
                Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
        if(self.getQuorumVerifier().getWeight(newId) == 0){
            return false;
        }
        
        /*
         * We return true if one of the following three cases hold:
         * 1- New epoch is higher
         * 2- New epoch is the same as current epoch, but new zxid is higher
         * 3- New epoch is the same as current epoch, new zxid is the same
         *  as current zxid, but server id is higher.
         */
        
        return ((newEpoch > curEpoch) || 
                ((newEpoch == curEpoch) &&
                ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
    }

下面首先讲解一下大概的选举流程,这里暂时先不用考虑投票的数据是如何进行交互的,只管拿来用即可,后续会讲到选举期间投票数据是如何进行交互的。

1.首先更新logicalclock并提议自己为leader并广播出去

2.进入本轮投票的循环

3.从recvqueue队列中获取一个投票信息,如果为空则检查是否要重发自己的投票或者重连,否则进入步骤4

4.判断投票信息中的选举状态:

        LOOKING状态:1.如果对方的logicalclock大于本地的logicalclock,则更新本地的logicalclock并清空本地投票信息统计箱recvset,并将自己作为候选和投票中的leader进行比较,选择大的作为新的投票,然后广播出去,否则进入步骤2

                                    2.如果对方的logicalclock小于本地的logicalclock,则忽略对方的投票,重新进入下一轮选举流程,否则进入步骤3

                                    3.如果两方的logicalclock相等,则比较当前本地被推选的leader和投票中的leader,选择大的作为新的投票,然后广播出去

                                     4.把对方的投票信息保存到本地投票统计箱recvset中,判断当前被选举的leader是否在投票中占了大多数(大于一半的server数量),如果是则需再等待finalizeWait时间(从recvqueue继续poll投票消息)看是否有人修改了leader的候选,如果有则再将该投票信息再放回recvqueue中并重新开始下一轮循环,否则确定角色,结束选举

        OBSERVING状态:没有投票权,无视直接进入下一轮选举

        FOLLOWING/LEADING:1.如果对方的logicalclock等于本地的logicalclock,把对方的投票信息保存到本地投票统计箱recvset中,判断对方的投票信息是否在recvset中占大多数并且确认自己确实为leader,如果是则确定角色,结束选举,否则进入步骤2

                                                   2.将对方的投票信息放入本地统计不参与投票信息箱outofelection中,判断对方的投票信息是否在outofelection中占大多数并且确认自己确实为leader,如果是则更新logicalclock,并确定角色,结束选举,否则进入下一轮选举

选举流程源码如下:

/**
     * Starts a new round of leader election. Whenever our QuorumPeer
     * changes its state to LOOKING, this method is invoked, and it
     * sends notifications to all other peers.
     *
     * 开始新的一轮leader选举。
     * 每当当前的peer的选举状态为LOOKING时,这个方法就会执行,并且会向其他peer发送提议leader消息。
     *
     */
    public Vote lookForLeader() throws InterruptedException {
        try {
            self.jmxLeaderElectionBean = new LeaderElectionBean();
            MBeanRegistry.getInstance().register(
                    self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            self.jmxLeaderElectionBean = null;
        }
        if (self.start_fle == 0) {
           self.start_fle = System.currentTimeMillis();
        }
        try {
            //本机统计的投票信息
            HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();

            //FOLLOWING LEADING状态的节点信息-->非LOOKING状态
            HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

            int notTimeout = finalizeWait;

            //提议选举自己为leader
            synchronized(this){
                logicalclock++;
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

            LOG.info("New election. My id =  " + self.getId() +
                    ", proposed zxid=0x" + Long.toHexString(proposedZxid));
            sendNotifications();

            /*
             * Loop in which we exchange notifications until we find a leader
             *
             * 循环:开始交换提议信息,直到选举出leader
             */

            while ((self.getPeerState() == ServerState.LOOKING) &&
                    (!stop)){
                /*
                 * Remove next notification from queue, times out after 2 times
                 * the termination time
                 */
                Notification n = recvqueue.poll(notTimeout,
                        TimeUnit.MILLISECONDS);

                /*
                 * Sends more notifications if haven't received enough.
                 * Otherwise processes new notification.
                 */
                if(n == null){
                    if(manager.haveDelivered()){
                        sendNotifications();
                    } else {
                        manager.connectAll();
                    }

                    /*
                     * Exponential backoff
                     */
                    int tmpTimeOut = notTimeout*2;
                    notTimeout = (tmpTimeOut < maxNotificationInterval?
                            tmpTimeOut : maxNotificationInterval);
                    LOG.info("Notification time out: " + notTimeout);
                }
                else if(self.getVotingView().containsKey(n.sid)) {
                    /*
                     * Only proceed if the vote comes from a replica in the
                     * voting view.
                     */
                    switch (n.state) {
                    case LOOKING:
                        // If notification > current, replace and send messages out
                        if (n.electionEpoch > logicalclock) {
                            logicalclock = n.electionEpoch;
                            recvset.clear();
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications();
                        } else if (n.electionEpoch < logicalclock) {
                            if(LOG.isDebugEnabled()){
                                LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                        + Long.toHexString(n.electionEpoch)
                                        + ", logicalclock=0x" + Long.toHexString(logicalclock));
                            }
                            break;
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            sendNotifications();
                        }

                        if(LOG.isDebugEnabled()){
                            LOG.debug("Adding vote: from=" + n.sid +
                                    ", proposed leader=" + n.leader +
                                    ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                                    ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                        }

                        // 把对方的投票意愿缓存起来,用于最终的统计
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                        if (termPredicate(recvset,
                                new Vote(proposedLeader, proposedZxid,
                                        logicalclock, proposedEpoch))) {

                            // Verify if there is any change in the proposed leader
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    recvqueue.put(n);
                                    break;
                                }
                            }

                            /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */
                            if (n == null) {
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(proposedLeader,
                                                        proposedZxid,
                                                        logicalclock,
                                                        proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }
                        break;
                    case OBSERVING:
                        LOG.debug("Notification from observer: " + n.sid);
                        break;
                    case FOLLOWING:
                    case LEADING:
                        /*
                         * Consider all notifications from the same epoch
                         * together.
                         */
                        if(n.electionEpoch == logicalclock){
                            recvset.put(n.sid, new Vote(n.leader,
                                                          n.zxid,
                                                          n.electionEpoch,
                                                          n.peerEpoch));
                           
                            if(ooePredicate(recvset, outofelection, n)) {
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(n.leader, 
                                        n.zxid, 
                                        n.electionEpoch, 
                                        n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

                        /*
                         * Before joining an established ensemble, verify
                         * a majority is following the same leader.
                         */
                        outofelection.put(n.sid, new Vote(n.version,
                                                            n.leader,
                                                            n.zxid,
                                                            n.electionEpoch,
                                                            n.peerEpoch,
                                                            n.state));
           
                        if(ooePredicate(outofelection, outofelection, n)) {
                            synchronized(this){
                                logicalclock = n.electionEpoch;
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader,
                                                    n.zxid,
                                                    n.electionEpoch,
                                                    n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;
                    default:
                        LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
                                n.state, n.sid);
                        break;
                    }
                } else {
                    LOG.warn("Ignoring notification from non-cluster member " + n.sid);
                }
            }
            return null;
        } finally {
            try {
                if(self.jmxLeaderElectionBean != null){
                    MBeanRegistry.getInstance().unregister(
                            self.jmxLeaderElectionBean);
                }
            } catch (Exception e) {
                LOG.warn("Failed to unregister with JMX", e);
            }
            self.jmxLeaderElectionBean = null;
        }
    }

选举流程图如下:

快速选举流程
标题

上面讲解了快速的选举流程,那么选举中的数据是怎么交互的呢,下面来进行进一步的讲解:

在zookeeper的启动脚本zkServer.cmd可以看到有这么一行脚本内容:

set ZOOMAIN=org.apache.zookeeper.server.quorum.QuorumPeerMain
echo on
call %JAVA% "-Dzookeeper.log.dir=%ZOO_LOG_DIR%" "-Dzookeeper.root.logger=%ZOO_LOG4J_PROP%" -cp "%CLASSPATH%" %ZOOMAIN% "%ZOOCFG%" %*

我们得知启动类为:org.apache.zookeeper.server.quorum.QuorumPeerMain,跟踪代码可以得知选举流程为:

FastLeaderElection类中的lookForLeader()方法,实际发生网络交互的地方为QuorumCnxManager类,类图关系如下两图:

网络交互类图

具体说明:
         QuorumCnxManager类为实际发生网络交互的地方,负责网络通讯中收集与发送投票信息,有类图关系中可以看到此类中有个叫Listener的内部类,此类负责保证连接的一对一以及启动两个线程进行投票消息的收发:sendWorker和recvWorker;
         FastLeaderElection类中也有两个内部类负责投票信息的收发:WorkerSender和WorkerReceiver。
         消息发送条线:选举方法lookForLeader()中发送投票时是将投票信息放入FastLeaderElection类中的sendqueue队列中,而WorkerSender(FastLeaderElection):负责将sendqueue队列中的信息放入QuorumCnxManager类中的queueSendMap中;而sendWorker(QuorumCnxManager):负责将QuorumCnxManager类中的queueSendMap中的投票信息发送到网络上。

         消息接收条线:recvWorker(QuorumCnxManager):负责接收网络上的投票信息,并放入QuorumCnxManager类的recrQueue队列中;WorkerReceiver(FastLeaderElection):负责从QuorumCnxManager类中的recrQueue队列中获取数据,并放入FastLeaderElection类中的recvqueue队列中。

自己拷贝了一份3.4.9的源码并添加了些许注释:https://github.com/learnertogether/zookeeper-3.4.9.git

猜你喜欢

转载自blog.csdn.net/long290046464/article/details/81408624