一、概述

在这里插入图片描述
zookeeper集群中，每个zookeeper节点通过投票的方式，选举出1个Leader，确定Leader后，其余参与选举的zk节点则为Follower。

二、选举过程分析

最开始，当前zk节点会投票给自己，并决定选举用的算法。
QuorumPeer.startLeaderElection:

synchronized public void startLeaderElection() {
    	try {
    		//投票给自己
    		currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
    	} catch(IOException e) {
    		RuntimeException re = new RuntimeException(e.getMessage());
    		re.setStackTrace(e.getStackTrace());
    		throw re;
    	}
        for (QuorumServer p : getView().values()) {
            if (p.id == myid) {
                myQuorumAddr = p.addr;
                break;
            }
        }
        if (myQuorumAddr == null) {
            throw new RuntimeException("My id " + myid + " not in the peer list");
        }
        if (electionType == 0) {
            try {
                udpSocket = new DatagramSocket(myQuorumAddr.getPort());
                responder = new ResponderThread();
                responder.start();
            } catch (SocketException e) {
                throw new RuntimeException(e);
            }
        }
        //确定选举算法
        this.electionAlg = createElectionAlgorithm(electionType);
    }

选举算法默认是FastLeaderElection:
QuorumPeer.createEletionAlgorithm:

protected Election createElectionAlgorithm(int electionAlgorithm){
        Election le=null;
                
        //TODO: use a factory rather than a switch
        switch (electionAlgorithm) {
        case 0:
            le = new LeaderElection(this);
            break;
        case 1:
            le = new AuthFastLeaderElection(this);
            break;
        case 2:
            le = new AuthFastLeaderElection(this, true);
            break;
        case 3:
        	//连接管理器--管理参与选举的ZooKeeper实例之间的连接
            qcm = createCnxnManager();
            QuorumCnxManager.Listener listener = qcm.listener;
            if(listener != null){
                listener.start();
                //默认算法
                le = new FastLeaderElection(this, qcm);
            } else {
                LOG.error("Null listener when initializing cnx manager");
            }
            break;
        default:
            assert false;
        }
        return le;
    }

在QuorumPeer.run方法中，根据节点状态的不同，采取不同的行为。当状态为LOOKING，则进行选举：

while (running) {
                switch (getPeerState()) {
                case LOOKING:
                    LOG.info("LOOKING");
					//无关代码省略
                    try {
                         setBCVote(null); 
                         //状态为LOOKING，通过选举算法得出当前的Leader投票
                         setCurrentVote(makeLEStrategy().lookForLeader());
                     } catch (Exception e) {
                         LOG.warn("Unexpected exception", e);
                         setPeerState(ServerState.LOOKING);
                     }
                    break;

于是，选举的过程主要在FastLeaderElection.lookForLeader()方法中：

    /**
     * Starts a new round of leader election. Whenever our QuorumPeer
     * changes its state to LOOKING, this method is invoked, and it
     * sends notifications to all other peers.
     */
    public Vote lookForLeader() throws InterruptedException {
        try {
            self.jmxLeaderElectionBean = new LeaderElectionBean();
            MBeanRegistry.getInstance().register(
                    self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            self.jmxLeaderElectionBean = null;
        }
        if (self.start_fle == 0) {
           self.start_fle = Time.currentElapsedTime();
        }
        try {
        	//本轮选举的所有投票 sid -> sid提议的Leader票
            HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
            //当前节点为LEADING状态，其他节点投新Leader的票
            HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

            int notTimeout = finalizeWait;

            synchronized(this){
            	//logicallock代表整个集群的第几次选举，初始为0，每次选举加1
                logicalclock.incrementAndGet();
                //初始提议自己为Leader
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

            LOG.info("New election. My id =  " + self.getId() +
                    ", proposed zxid=0x" + Long.toHexString(proposedZxid));
            //通知集群所有节点（也发给自己），这轮选举，当前节点提议的Leader
            sendNotifications();

            /*
             * Loop in which we exchange notifications until we find a leader
             */
             //持续选举过程直到确定Leader为止
            while ((self.getPeerState() == ServerState.LOOKING) &&
                    (!stop)){
                /*
                 * Remove next notification from queue, times out after 2 times
                 * the termination time
                 */
                 //等待接收集群节点的通知
                Notification n = recvqueue.poll(notTimeout,
                        TimeUnit.MILLISECONDS);

                /*
                 * Sends more notifications if haven't received enough.
                 * Otherwise processes new notification.
                 */
                 //没收到消息
                if(n == null){
                	//消息发出去了，却没收到响应，再发一次
                    if(manager.haveDelivered()){    	
                        sendNotifications();
                    } else {
                    	//消息没发出去，说明连接可能有问题，重建连接
                        manager.connectAll();
                    }

                    /*
                     * Exponential backoff
                     */
                    //网络不太好？延长等待时间
                    int tmpTimeOut = notTimeout*2;
                    notTimeout = (tmpTimeOut < maxNotificationInterval?
                            tmpTimeOut : maxNotificationInterval);
                    LOG.info("Notification time out: " + notTimeout);
                    //接着，进入下一次while循环
                }
                else if(validVoter(n.sid) && validVoter(n.leader)) 
                    /*
                     * Only proceed if the vote comes from a replica in the
                     * voting view for a replica in the voting view.
                     */
                     //投票节点的状态
                    switch (n.state) {
                    case LOOKING: //如果投票节点状态为LOOKING
                        // If notification > current, replace and send messages out
                        //节点n的选举轮数大于当前节点认为的轮数，即当前节点的投票过时了
                        if (n.electionEpoch > logicalclock.get()) {
                        	//更新选举轮数
                            logicalclock.set(n.electionEpoch);
                            //这轮收到的投票作废，因为logicallock过时了
                            recvset.clear();
                            //和n.sid提议的leader pk
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                //n.leader胜出，投给n.leader
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                            	//自己胜出，投自己
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            //通知所有节点自己的提议
                            sendNotifications();
                        } else if (n.electionEpoch < logicalclock.get()) {
                        	//通知过时了，直接忽略 
                            if(LOG.isDebugEnabled()){
                                LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                        + Long.toHexString(n.electionEpoch)
                                        + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                            }
                            //忽略啦
                            break;
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {
                            //同一轮选举，由于n.leader更优，改为投给n.leader
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            //发通知
                            sendNotifications();
                        }

                        if(LOG.isDebugEnabled()){
                            LOG.debug("Adding vote: from=" + n.sid +
                                    ", proposed leader=" + n.leader +
                                    ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                                    ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                        }
                        //记录n.sid的投票
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                        //是否可以结束投票？默认条件是，投票的zk实例超过集群半数
                        if (termPredicate(recvset,
                                new Vote(proposedLeader, proposedZxid,
                                        logicalclock.get(), proposedEpoch))) {

                        	//在finalizeWait期间，投票发生变动
                            // Verify if there is any change in the proposed leader
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    recvqueue.put(n);
                                    break;
                                }
                            }

                            /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */
                            if (n == null) {
                            	//没再收到新的Notification，确定状态，
                            	//如果推测是自己，则确认自己是Leader，否则为Follower
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(proposedLeader,
                                                        proposedZxid,
                                                        logicalclock.get(),
                                                        proposedEpoch);
                                leaveInstance(endVote);
                                //返回选举结果
                                return endVote;
                            }
                        }
                        break;
                    case OBSERVING:
                        LOG.debug("Notification from observer: " + n.sid);
                        break;
                    case FOLLOWING:
                    case LEADING: //如果为LEADING状态
                        /*
                         * Consider all notifications from the same epoch
                         * together.
                         */
                 		//如果是同一轮选举
                        if(n.electionEpoch == logicalclock.get()){
                        	//记录选票
                            recvset.put(n.sid, new Vote(n.leader,
                                                          n.zxid,
                                                          n.electionEpoch,
                                                          n.peerEpoch));
                            //新Leader是否有效，根据投票是否过半，以及n.leader是否投自己为leader
                            //只要n.leader给自己投了leader，当前节点就会信任n.leader
                            if(ooePredicate(recvset, outofelection, n)) {
                            	//切换n.leader为Leader
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(n.leader, 
                                        n.zxid, 
                                        n.electionEpoch, 
                                        n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

                        //记录新Leader的票
                        /*
                         * Before joining an established ensemble, verify
                         * a majority is following the same leader.
                         */
                        outofelection.put(n.sid, new Vote(n.version,
                                                            n.leader,
                                                            n.zxid,
                                                            n.electionEpoch,
                                                            n.peerEpoch,
                                                            n.state));
                        //超过半数选举了新Leader，并且新Leader有效
                        if(ooePredicate(outofelection, outofelection, n)) {
                            synchronized(this){
                                logicalclock.set(n.electionEpoch);
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader,
                                                    n.zxid,
                                                    n.electionEpoch,
                                                    n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;
                    default:
                        LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
                                n.state, n.sid);
                        break;
                    }
                } else {
                    if (!validVoter(n.leader)) {
                        LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
                    }
                    if (!validVoter(n.sid)) {
                        LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
                    }
                }
            }
            return null;
        } finally {
            try {
                if(self.jmxLeaderElectionBean != null){
                    MBeanRegistry.getInstance().unregister(
                            self.jmxLeaderElectionBean);
                }
            } catch (Exception e) {
                LOG.warn("Failed to unregister with JMX", e);
            }
            self.jmxLeaderElectionBean = null;
            LOG.debug("Number of connection processing threads: {}",
                    manager.getConnectionThreadCount());
        }
    }

三、为什么zookeeper集群至少3个节点？

从选举过程看出，同一轮选举过程中，节点自身（self）会和投票对象（proposal）pk，如果proposal pk胜出，则自己改投proposal，并发通知，否则不再发通知。

如果集群中只有两个节点 A和B，A的serverId小于B的serverId。

A启动，投给自己。此时有：
A – votes: (A, A)，proposal: A
B启动，投给自己。此时有：
A – votes: (A, A) (B, B), proposal: B
B – votes: (B, B)，proposal: B
此时，A推荐B为Leader，但B无法收到通知，于是无法确认自己是否当选了Leader，即产生了脑裂。分析可知，zk集群个数至少要 > 3。

zookeeper源码分析--Leader选举

一、概述

二、选举过程分析

三、为什么zookeeper集群至少3个节点？

猜你喜欢