ZK Cluster Leader Election Code Reading

Foreword

ZooKeeper has its own standby model to achieve Zab agreement, namely Leader and learner (Observer + Follower), there are several cases need to be election leader

  • Case 1: In the process of starting a cluster need to elect the Leader
  • Scenario 2: After the cluster starts normally, leader because of failure hung up, needed to be elected Leader
  • Scenario 3: The number of Follower cluster is not sufficient by half test, Leader will hang themselves, elect a new leader
  • Scenario 4: The cluster is working properly, a new addition Follower

This blog, these four aspects of reading source tracking

Program entry

QuorumPeer.javaEach node corresponds to a server cluster, in its start()process, to complete the start-up operation of the current node, the following source code:

    // todo 进入了 QuorumPeer(意为仲裁人数)类中,可以把这个类理解成集群中的某一个点
    @Override
    public synchronized void start() {
        // todo 从磁盘中加载数据到内存中
        loadDataBase();

        // todo 启动上下文的这个工厂,他是个线程类, 接受客户端的请求
        cnxnFactory.start();

        // todo 开启leader的选举工作
        startLeaderElection();

        // todo 确定服务器的角色, 启动的就是当前类的run方法在900行
        super.start();
    }

The first loadDataBase();aim is to restore the data from the cluster into memory

The second cnxnFactory.start();is the current node may accept the connection request from the client (java code or console) sent by

The third startLeaderElection();turned leader of the election , but in fact he is initialized a series of helper classes, aid to the election of the leader, is not really in the election

The current class quorumPeerinherits ZKThread, which is itself a thread class, super.start();is to start its run method, there is a while loop in his Run method, beginning at the stage of proceedings, default values for all of the nodes are Looking, so will enter the branch, in this would divide the genuine leader election

summary

From the entrance introduction of the program, can be seen in this article will focus on look at startLeaderElection();what had been done? And in lookinghow the electoral branch leader

Case 1: In the process of starting a cluster, the election of a new Leader

Into the startLeaderElection();method, source code follows, he mainly do two things

  • This type of QuorumPeer.javavariable maintenance ( volatile private Vote currentVote;) initialization
  • createElectionAlgorithm()Creating a leader election method

    In fact, up to now, leaving an algorithm not expired, that is,fastLeaderElection

   // TODO 开启投票选举Leader的工作
    synchronized public void startLeaderElection() {
        try {
            // todo 创建了一个封装了投票结果对象   包含myid 最大的zxid 第几轮Leader
            // todo 先投票给自己
            // todo 跟进它的构造函数
            currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
        } catch(IOException e) {
            RuntimeException re = new RuntimeException(e.getMessage());
            re.setStackTrace(e.getStackTrace());
            throw re;
        }
        for (QuorumServer p : getView().values()) {
            if (p.id == myid) {
                myQuorumAddr = p.addr;
                break;
            }
        }
        if (myQuorumAddr == null) {
            throw new RuntimeException("My id " + myid + " not in the peer list");
        }
        if (electionType == 0) {
            try {
                udpSocket = new DatagramSocket(myQuorumAddr.getPort());
                responder = new ResponderThread();
                responder.start();
            } catch (SocketException e) {
                throw new RuntimeException(e);
            }
        }
        // todo  创建一个领导者选举的算法,这个算法还剩下一个唯一的实现 快速选举
        this.electionAlg = createElectionAlgorithm(electionType);
    }

Continue to follow up createElectionAlgorithm(electionType), in this method, do the following three things

  • createdQuorumCnxnManager
  • createListenner
  • createFastLeaderElection
protected Election createElectionAlgorithm(int electionAlgorithm){
    Election le=null;
            
    //TODO: use a factory rather than a switch
    switch (electionAlgorithm) {
    case 0:
        le = new LeaderElection(this);
        break;
    case 1:
        le = new AuthFastLeaderElection(this);
        break;
    case 2:
        le = new AuthFastLeaderElection(this, true);
        break;
    case 3:
        // todo 创建CnxnManager 上下文的管理器
        qcm = createCnxnManager();
        QuorumCnxManager.Listener listener = qcm.listener;
        if(listener != null){

            // todo 在这里将listener 开启
            listener.start();
            // todo  实例化领导者选举的算法
            le = new FastLeaderElection(this, qcm);
        } else {
            LOG.error("Null listener when initializing cnx manager");
        }
        break;

Prepare electoral environment

QuorumManager

https://img2018.cnblogs.com/blog/1496926/201910/1496926-20191004181514119-1851057732.png

The figure is a class diagram QuorumCnxManager, look, it has six internal class, except where the Messagethread class alia are run separately

This class has a pivotal role, it is shared auxiliary classes in all nodes in the cluster, and that in the end what role do? I'm not guessing just say, because the election leader is a resolution out of the vote, as we have to vote each other, and that cluster each point you have to establish a connection between any two, that QuorumCnxManagerit is responsible for maintaining communication at various points in the cluster

It maintains two queues, source below, is stored in the first queue ConcurrentHashMapin the node is MyID key (or a serverId), it will be appreciated that the value transmitted to the vote server queue to store the other

The second is to receive the other server queue to send over the msg

// todo key=serverId(myid)   value = 保存着当前服务器向其他服务器发送消息的队列
final ConcurrentHashMap<Long, ArrayBlockingQueue<ByteBuffer>> queueSendMap;

// todo 接收到的所有数据都在这个队列中
public final ArrayBlockingQueue<Message> recvQueue;

https://img2018.cnblogs.com/blog/1496926/201910/1496926-20191004181513574-985979791.png

Pictured above is a hand-drawn QuorumCnxManager.javasystem diagram, the most intuitive of the three thread class can see inside it, and that three of the thread class run () method and were doing anything at all?

SendWorker the RUN () , it can be seen that the current node extracted in accordance with a queue corresponding sid, and then send the data out queue


    public void run() {
            threadCnt.incrementAndGet();
            try {
                ArrayBlockingQueue<ByteBuffer> bq = queueSendMap.get(sid);
                if (bq == null || isSendQueueEmpty(bq)) {
                   ByteBuffer b = lastMessageSent.get(sid);
                   if (b != null) {
                       LOG.debug("Attempting to send lastMessage to sid=" + sid);
                       send(b);
                   }
                }
            } catch (IOException e) {
                LOG.error("Failed to send last message. Shutting down thread.", e);
                this.finish();
            }
            
            try {
                while (running && !shutdown && sock != null) {

                    ByteBuffer b = null;
                    try {
                        // todo 取出任务所在的队列
                        ArrayBlockingQueue<ByteBuffer> bq = queueSendMap.get(sid);

                        if (bq != null) {
                            // todo 将bq,添加进sendQueue
                            b = pollSendQueue(bq, 1000, TimeUnit.MILLISECONDS);
                        } else {
                            LOG.error("No queue of incoming messages for " +
                                      "server " + sid);
                            break;
                        }

                        if(b != null){
                            lastMessageSent.put(sid, b);
                            // todo
                            send(b);
                        }
                    } catch (InterruptedException e) {
                        LOG.warn("Interrupted while waiting for message on queue",
                                e);
                    }
                }

RecvWorker run method , the received msg, msg then into the recvQueuequeue

        public void run() {
            threadCnt.incrementAndGet();
            try {
                while (running && !shutdown && sock != null) {
                    /**
                     * Reads the first int to determine the length of the
                     * message
                     */
                    int length = din.readInt();
                    if (length <= 0 || length > PACKETMAXSIZE) {
                        throw new IOException(
                                "Received packet with invalid packet: "
                                        + length);
                    }
                    /**
                     * Allocates a new ByteBuffer to receive the message
                     */
                    // todo 从数组中把数据读取到数组中
                    byte[] msgArray = new byte[length];
                    din.readFully(msgArray, 0, length);
                    // todo 将数组包装成ByteBuf
                    ByteBuffer message = ByteBuffer.wrap(msgArray);
                    // todo 添加到RecvQueue中
                    addToRecvQueue(new Message(message.duplicate(), sid));
                }

https://img2018.cnblogs.com/blog/1496926/201910/1496926-20191004181512961-831337433.png]

Listenner the RUN () , it will use the port (above 3888) use the communication key cluster we configured in the configuration file to establish a connection between each other

Also found that the use of conventional communication socket communication between various points in the cluster

        InetSocketAddress addr;
            while((!shutdown) && (numRetries < 3)){
                try {
                    // todo 创建serversocket
                    ss = new ServerSocket();
                    ss.setReuseAddress(true);
                    if (listenOnAllIPs) {
                        int port = view.get(QuorumCnxManager.this.mySid)
                            .electionAddr.getPort();
                        //todo 它取出来的地址就是address就是我们在配置文件中配置集群时添加进去的 port 3888...
                        addr = new InetSocketAddress(port);
                    } else {
                        addr = view.get(QuorumCnxManager.this.mySid)
                            .electionAddr;
                    }
                    LOG.info("My election bind port: " + addr.toString());
                    setName(view.get(QuorumCnxManager.this.mySid)
                            .electionAddr.toString());
                    // todo 绑定端口
                    ss.bind(addr);
                    while (!shutdown) {
                        // todo 阻塞接受其他的服务器发起连接
                        Socket client = ss.accept();
                        setSockOpts(client);
                        LOG.info("Received connection request "
                                + client.getRemoteSocketAddress());
                       // todo  如果启用了仲裁SASL身份验证,则异步接收和处理连接请求
                        // todo  这是必需的,因为sasl服务器身份验证过程可能需要几秒钟才能完成,这可能会延迟下一个对等连接请求。
                        if (quorumSaslAuthEnabled) {
                            // todo 异步接受一个连接
                            receiveConnectionAsync(client);
                        } else {
                            // todo 跟进这个方法
                            receiveConnection(client);
                        }
                        numRetries = 0;
                    }

Continue to follow up the source code back to QuorumPeer.javathe createElectionAlgorithm()process, re-intercepts source follows the completion of the QuorumCnxManagercreation, after the start Listener, Listenner start marks have between each node in the cluster to establish communication capability between any two at the same time Listenner a thread class, its Run () method in the above code

FastLeaderElection

After starting Listenner, began to instantiate objects leader election algorithmnew FastLeaderElection(this, qcm)

    ...
     break;
        case 3:
            // todo 创建CnxnManager 上下文的管理器
            qcm = createCnxnManager();
   QuorumCnxManager.Listener listener = qcm.listener;
            if(listener != null){
                // todo 在这里将listener 开启
                listener.start();
                // todo  实例化领导者选举的算法
                le = new FastLeaderElection(this, qcm);
            } else {
                LOG.error("Null listener when initializing cnx manager");
            }

Below is a FasterElectionclass diagram

https://img2018.cnblogs.com/blog/1496926/201910/1496926-20191004181512482-258189385.png

Directly visually see that three internal classes

  • Messager (which have two internal threads classes)
    • WorkerRecriver
      • Responsible for
    • WorkerSender
  • Notification
    • Usually start when a new node status is looking, then polls resolution, after other node receives will be used Notificationto tell its own trusted leader
  • ToSend
    • Send each other, or messages from other nodes. The message may be a notification, the notification may be received ack

Correspond to QuorumCnxManagertwo kinds of queue maintenance, FasterElectionlikewise with care to maintain the following two queues, one sendqueueother isrecvqueue

LinkedBlockingQueue<ToSend> sendqueue;
LinkedBlockingQueue<Notification> recvqueue;

Specifically how to play? As shown below

https://img2018.cnblogs.com/blog/1496926/201910/1496926-20191004181511814-142904818.png

When a node is starting the process of external voting will be credited FasterElectionin sendqueue, and then after QuorumCnxManagerthe sendWorkersending out through the NIO, the opposite process, voting other nodes receive will be QuorumCnxManagerof recvWorkerreceipt, and then into QuorumCnxManagerthe recvQueue中this queue msg will continue to be FasterElectionan internal thread class workerRecvivertaken to store FasterElectiontherecvqueue中

By tracking code can be found, Message of two internal threads are daemon thread as a way to open

Messenger(QuorumCnxManager manager) {
    // todo WorderSender 作为一条新的线程
    this.ws = new WorkerSender(manager);

    Thread t = new Thread(this.ws,
            "WorkerSender[myid=" + self.getId() + "]");
    t.setDaemon(true);
    t.start();

    //todo------------------------------------
    // todo WorkerReceiver  作为一条新的线程
    this.wr = new WorkerReceiver(manager);

    t = new Thread(this.wr,
            "WorkerReceiver[myid=" + self.getId() + "]");
    t.setDaemon(true);
    t.start();
}

summary

Code see here, in fact, preparations for the elections leader has been completed, which means that quorumPeer.javathe start()method startLeaderElection();is ready to environmental leadership election, it is on the map


The real beginning of the election

Here take a look quorumPeer.javain this thread classes start, part of run()the interception method, we are concerned about its lookForLeader()method

while (running) {
switch (getPeerState()) {
    /**
     * todo 四种可能的状态, 经过了leader选举之后, 不同的服务器就有不同的角色
     * todo 也就是说,不同的服务器会会走动下面不同的分支中
     * LOOKING 正在进行领导者选举
     * Observing
     * Following
     * Leading
     */
case LOOKING:
    // todo 当为Looking状态时,会进入领导者选举的阶段
    LOG.info("LOOKING");

    if (Boolean.getBoolean("readonlymode.enabled")) {
        LOG.info("Attempting to start ReadOnlyZooKeeperServer");

        // Create read-only server but don't start it immediately
        // todo 创建了一个 只读的server但是不着急立即启动它
        final ReadOnlyZooKeeperServer roZk = new ReadOnlyZooKeeperServer(
                logFactory, this,
                new ZooKeeperServer.BasicDataTreeBuilder(),
                this.zkDb);

        // Instead of starting roZk immediately, wait some grace(优雅) period(期间) before we decide we're partitioned.
        // todo 为了立即启动roZK ,在我们决定分区之前先等一会
        // Thread is used here because otherwise it would require changes in each of election strategy classes which is
        // unnecessary code coupling.
        //todo  这里新开启一条线程,避免每一个选举策略类上有不同的改变 而造成的代码的耦合
        Thread roZkMgr = new Thread() {
            public void run() {
                try {
                    // lower-bound grace period to 2 secs
                    sleep(Math.max(2000, tickTime));
                    if (ServerState.LOOKING.equals(getPeerState())) {
                        // todo 启动上面那个只读的Server
                        roZk.startup();
                    }
                } catch (InterruptedException e) {
                    LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started");
                } catch (Exception e) {
                    LOG.error("FAILED to start ReadOnlyZooKeeperServer", e);
                }
            }
        };
        try {
            roZkMgr.start();
            setBCVote(null);
            // todo 上面的代码都不关系,直接看它的 lookForLeader()方法
            // todo 直接点进去,进入的是接口,我们看它的实现类
            setCurrentVote(makeLEStrategy().lookForLeader());
        } catch (Exception e) {
            LOG.warn("Unexpected exception",e);
            setPeerState(ServerState.LOOKING);
        } finally {
            // If the thread is in the the grace period, interrupt
            // to come out of waiting.
            roZkMgr.interrupt();
            roZk.shutdown();
        }

Here is the lookForLeader()source code interpretation
to tell the truth this approach really is quite long, but this method it is really important, because we can find on the web you sum up the election for the Leader of bits and pieces from this method

The first point: every vote will cast their first vote, it plainly new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());own myid, the largest zxid, and the first sessions of the package together, but there is one detail, it is cast in their own, or will exist It has its own information of this vote sends to other nodes through the socket

Accept other people's vote by QuorumManagerthe recvWorkerThread class will vote added to recvQueuethe queue, when the vote for yourself, do not go this route, but rather choose to add directly into the ticket recvQueuequeue

There is a line in the following code, HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();this map can be understood as a small mailbox, each node will maintain a mailbox, there may store their own vote for their ticket, or someone else vote for their ticket, or someone else to vote for someone else's ticket or vote for their own people's votes, this statistical mailbox number of votes can decide whether a particular node can be a leader, as the source, use the information in the mailbox,

    // todo 根据别人的投票,以及自己的投票判断,本轮得到投票的集群能不能成为leader
    if (termPredicate(recvset,
            new Vote(proposedLeader, proposedZxid,
                    logicalclock.get(), proposedEpoch))) {
        // todo 到这里说明接收到投票的机器已经是准leader了

        // Verify if there is any change in the proposed leader
        // todo 校验一下, leader有没有变动
        while ((n = recvqueue.poll(finalizeWait,
                TimeUnit.MILLISECONDS)) != null) {
            if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                    proposedLeader, proposedZxid, proposedEpoch)) {
                recvqueue.put(n);
                break;
            }
        }
        if (n == null) {
                // todo 判断自己是不是leader, 如果是,更改自己的状态未leading , 否则根据配置文件确定状态是 Observer 还是Follower
                // todo leader选举出来后, QuorumPeer中的run方法中的while再循环,不同角色的服务器就会进入到 不同的分支
                self.setPeerState((proposedLeader == self.getId()) ?
                        ServerState.LEADING : learningState());
        
                Vote endVote = new Vote(proposedLeader,
                        proposedZxid,
                        logicalclock.get(),
                        proposedEpoch);
                leaveInstance(endVote);
                return endVote;
            }
        }

In the termPredicate()following logic function, self.getQuorumVerifier().containsQuorum(set);its implementation follows, in fact, more than half of the mechanism during the test, the conclusion is that when a node has a vote in more than half of the nodes in the cluster, and it will revise their status leading, the other nodes according to their needs will become the statefollowing或者observing

    public boolean containsQuorum(Set<Long> set){
        return (set.size() > half);
    }

Maintains a clock, marking how many times voted logicalclockHe is AutomicLong types of variables, he had what use is it? Logic can be seen as the following code, that is, when their clock reception than the current clock hours to vote , indicating that he might for other reasons missed a ballot, so update its own clock, or to re-cast their vote judge others , empathy, if the clock received votes less than their current clock, indicating that this vote does not make sense direct discard ignore

   if (n.electionEpoch > logicalclock.get()) {
                                // todo 将自己的时钟调整为更新的时间
                                logicalclock.set(n.electionEpoch);
                                // todo 清空自己的投票箱
                                recvset.clear();

So according to what is judged to vote for their own or vote for someone else? Information by parsing class ticket package encapsulated nodes What information? Zxid, myid, epoch often the case that epoch become a big priority Leader , general the same epoch will be so large zxid priorities become leader, if zxid again the same, the priority became leader of a large myid

Check the node to another more suitable than myself as leader, will be re-vote, the election is more suitable node

Complete source code

// todo 当前进入的是FastLeaderElection.java的实现类
public Vote lookForLeader() throws InterruptedException {
try {
    // todo 创建用来选举Leader的Bean
    self.jmxLeaderElectionBean = new LeaderElectionBean();

    MBeanRegistry.getInstance().register(
            self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
} catch (Exception e) {
    LOG.warn("Failed to register with JMX", e);
    self.jmxLeaderElectionBean = null;
}
if (self.start_fle == 0) {
    self.start_fle = Time.currentElapsedTime();
}
try {
    // todo 每台服务器独有的投票箱 , 存放其他服务器投过来的票的map
    // todo long类型的key (sid)标记谁给当前的server投的票   Vote类型的value 投的票
    HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();

    HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

    int notTimeout = finalizeWait;

    synchronized (this) {
        //todo Automic 类型的时钟
        logicalclock.incrementAndGet();
        //todo 一开始启动时,入参位置的值都取自己的,相当于投票给自己
        updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
    }

    LOG.info("New election. My id =  " + self.getId() +
            ", proposed zxid=0x" + Long.toHexString(proposedZxid));
    // todo 发送出去,投票自己
    sendNotifications();

    /*
     * Loop in which we exchange notifications until we find a leader
     */
    // todo 如果自己一直处于LOOKING的状态,一直循环
    while ((self.getPeerState() == ServerState.LOOKING) && (!stop)) {
        /*
         * Remove next notification from queue, times out after 2 times
         * the termination time
         */
        //todo  尝试获取其他服务器的投票的信息

        // todo 从接受消息的队列中取出一个msg(这个队列中的数据就是它投票给自己的票)
        // todo 在QuorumCxnManager.java中 发送的投票的逻辑中,如果是发送给自己的,就直接加到recvQueue,而不经过socket
        // todo 所以它在这里是取出了自己的投票
        Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);

        /*
         * Sends more notifications if haven't received enough.
         * Otherwise processes new notification.
         */
        // todo 第一轮投票这里不为空
        if (n == null) {
            // todo 第二轮就没有投票了,为null, 进入这个分支
            // todo 进行判断 ,如果集群中有三台服务器,现在仅仅启动一台服务器,还剩下两台服务器没启动
            // todo 那就会有3票, 其中1票直接放到 recvQueue , 另外两票需要发送给其他两台机器的逻辑就在这里判断
            // todo 验证是通不过的,因为queueSendMap中的两条队列都不为空
            if (manager.haveDelivered()) {
                sendNotifications();
            } else {
                // todo 进入这个逻辑
                manager.connectAll();
            }

            /*
             * Exponential backoff
             */
            int tmpTimeOut = notTimeout * 2;
            notTimeout = (tmpTimeOut < maxNotificationInterval ?
                    tmpTimeOut : maxNotificationInterval);
            LOG.info("Notification time out: " + notTimeout);
        } else if (validVoter(n.sid) && validVoter(n.leader)) {
            // todo 收到了其他服务器的投票信息后,来到下面的分支中处理
            /*
             * Only proceed if the vote comes from a replica in the
             * voting view for a replica in the voting view.
             * todo 仅当投票来自投票视图中的副本时,才能继续进行投票。
             */
            switch (n.state) {
                case LOOKING:
                    // todo 表示获取到投票的服务器的状态也是looking

                    // If notification > current, replace and send messages out
                    // todo 对比接收到的头片的 epoch和当前时钟先后

                    // todo 接收到的投票 > 当前服务器的时钟
                    // todo 表示当前server在投票过程中可能以为故障比其他机器少投了几次,需要重新投票
                    if (n.electionEpoch > logicalclock.get()) {
                        // todo 将自己的时钟调整为更新的时间
                        logicalclock.set(n.electionEpoch);
                        // todo 清空自己的投票箱
                        recvset.clear();
                        // todo 用别人的信息和自己的信息对比,选出一个更适合当leader的,如果还是自己适合,不作为, 对方适合,修改投票,投 对方
                        if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                        } else {
                            updateProposal(getInitId(),
                                    getInitLastLoggedZxid(),
                                    getPeerEpoch());
                        }
                        sendNotifications();

                        // todo 接收到的投票 < 当前服务器的时钟
                        // todo 说明这个投票已经不能再用了
                    } else if (n.electionEpoch < logicalclock.get()) {
                        if (LOG.isDebugEnabled()) {
                            LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                    + Long.toHexString(n.electionEpoch)
                                    + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                        }
                        break;
                        // todo 别人的投票时钟和我的时钟是相同的
                        // todo 满足 totalOrderPredicate 后,会更改当前的投票,重新投票
                        /**
                         *   在 totalOrderPredicate 比较两者之间谁更满足条件
                         *   ((newEpoch > curEpoch) ||
                         *   ((newEpoch == curEpoch) &&
                         *   ((newZxid > curZxid) ||
                         *   ((newZxid == curZxid) &&
                         *   (newId > curId)))));
                         */
                        // todo 返回true说明 对方更适合当leader
                    } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                            proposedLeader, proposedZxid, proposedEpoch)) {
                        updateProposal(n.leader, n.zxid, n.peerEpoch);
                        sendNotifications();
                    }

                    if (LOG.isDebugEnabled()) {
                        LOG.debug("Adding vote: from=" + n.sid +
                                ", proposed leader=" + n.leader +
                                ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                                ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                    }
                    // todo 将自己的投票存放到投票箱子中
                    recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                    // todo 根据别人的投票,以及自己的投票判断,本轮得到投票的集群能不能成为leader
                    if (termPredicate(recvset,
                            new Vote(proposedLeader, proposedZxid,
                                    logicalclock.get(), proposedEpoch))) {
                        // todo 到这里说明接收到投票的机器已经是准leader了

                        // Verify if there is any change in the proposed leader
                        // todo 校验一下, leader有没有变动
                        while ((n = recvqueue.poll(finalizeWait,
                                TimeUnit.MILLISECONDS)) != null) {
                            if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    proposedLeader, proposedZxid, proposedEpoch)) {
                                recvqueue.put(n);
                                break;
                            }
                        }

                        /*
                         * This predicate is true once we don't read any new
                         * relevant message from the reception queue
                         */
                        if (n == null) {
                            // todo 判断自己是不是leader, 如果是,更改自己的状态未leading , 否则根据配置文件确定状态是 Observer 还是Follower
                            // todo leader选举出来后, QuorumPeer中的run方法中的while再循环,不同角色的服务器就会进入到 不同的分支
                            self.setPeerState((proposedLeader == self.getId()) ?
                                    ServerState.LEADING : learningState());

                            Vote endVote = new Vote(proposedLeader,
                                    proposedZxid,
                                    logicalclock.get(),
                                    proposedEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                    }
                    break;
                case OBSERVING:
                    // todo 禁止Observer参加投票
                    LOG.debug("Notification from observer: " + n.sid);
                    break;
                case FOLLOWING:
                case LEADING:
                    /*
                     * Consider all notifications from the same epoch
                     * together.
                     */
                    if (n.electionEpoch == logicalclock.get()) {
                        recvset.put(n.sid, new Vote(n.leader,
                                n.zxid,
                                n.electionEpoch,
                                n.peerEpoch));

                        if (ooePredicate(recvset, outofelection, n)) {
                            self.setPeerState((n.leader == self.getId()) ?
                                    ServerState.LEADING : learningState());

                            Vote endVote = new Vote(n.leader,
                                    n.zxid,
                                    n.electionEpoch,
                                    n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                    }

                    /*
                     * Before joining an established ensemble, verify
                     * a majority is following the same leader.
                     */
                    outofelection.put(n.sid, new Vote(n.version,
                            n.leader,
                            n.zxid,
                            n.electionEpoch,
                            n.peerEpoch,
                            n.state));

                    if (ooePredicate(outofelection, outofelection, n)) {
                        synchronized (this) {
                            logicalclock.set(n.electionEpoch);
                            self.setPeerState((n.leader == self.getId()) ?
                                    ServerState.LEADING : learningState());
                        }
                        Vote endVote = new Vote(n.leader,
                                n.zxid,
                                n.electionEpoch,
                                n.peerEpoch);
                        leaveInstance(endVote);
                        return endVote;
                    }
                    break;
                default:
                    LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
                            n.state, n.sid);
                    break;
            }
        } else {
            if (!validVoter(n.leader)) {
                LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
            }
            if (!validVoter(n.sid)) {
                LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
            }
        }
    }
    return null;

After the above judgment of each node can elect a different role, to return to QuorumPeer.javathe run()time in circulation, no longer enter case LOOKING:the code block, but to carry out their duties in accordance with their different roles and perform different initialization start

Scenario 2: After the cluster starts normally, leader hung up because of a failure to elect a new Leader

The second election leader of the cases, after the cluster starts normally, leader hung up due to failure to elect a new Leader

This part of the logic of what is it?

Although the leader hung up, but the role of the server will still go Follower execution QuorumPeer.javaof run()an infinite while loop method, when it executes follower.followLeader();can not find a method leader, will be abnormal, the final implementation of finallythe code block logic, you can see it modify their status as looking, and then re-elected leader

   break;
        case FOLLOWING:
            // todo server 当选follow角色
            try {
                LOG.info("FOLLOWING");
                setFollower(makeFollower(logFactory));
                follower.followLeader();
            } catch (Exception e) {
                LOG.warn("Unexpected exception",e);
            } finally {
                follower.shutdown();
                setFollower(null);
                setPeerState(ServerState.LOOKING);
            }
            break;

Scenario 3: The number of Follower cluster is not sufficient by half test, Leader will hang himself, and then elect a new leader

Case 3: Suppose a cluster 2 Follower, Taiwan leader, then hang up when a Follower, remaining unable to meet more than half of Taiwan Follower check mechanism will therefore re-elected leader

Back Source: leader every time enter case LEADING:to performleader.lead();

 case LEADING:
            // todo 服务器成功当选成leader
            LOG.info("LEADING");
            try {
                setLeader(makeLeader(logFactory));
                // todo 跟进lead
                leader.lead();
                setLeader(null);
            } catch (Exception e) {
                LOG.warn("Unexpected exception",e);
            } finally {
                if (leader != null) {
                    leader.shutdown("Forcing shutdown");
                    setLeader(null);
                }
                setPeerState(ServerState.LOOKING);
            }
            break;

But leader.lead();in each execution will make the following judgment, it is clear that when the test is not met half, leader directly to hang himself, the final status of all nodes in the cluster will be changed to LOOKINGre-election


              if (!tickSkip && !self.getQuorumVerifier().containsQuorum(syncedSet)) {
                //if (!tickSkip && syncedCount < self.quorumPeers.size() / 2) {
                    // Lost quorum, shutdown
                    shutdown("Not sufficient followers synced, only synced with sids: [ "
                            + getSidSetString(syncedSet) + " ]");
                    // make sure the order is the same!
                    // the leader goes to looking
                    return;
              } 

Scenario 4: The cluster is working properly, a new addition Follower

Follower new additions coming in at the start of its state is looking, she would try the same election leader, the same will first vote for himself, but for a stable cluster is
a cluster of each orange has been determined down, in this case, we will enter FastLeaderElection.javathe lookForLeader()following branches of the method, so that the current add in new nodes
recognized directly Leader

case OBSERVING:
        // todo 禁止Observer参加投票
        LOG.debug("Notification from observer: " + n.sid);
        break;
    case FOLLOWING:
    case LEADING:
        /*
            * Consider all notifications from the same epoch
            * together.
            */
        if (n.electionEpoch == logicalclock.get()) {
            recvset.put(n.sid, new Vote(n.leader,
                    n.zxid,
                    n.electionEpoch,
                    n.peerEpoch));

            if (ooePredicate(recvset, outofelection, n)) {
                self.setPeerState((n.leader == self.getId()) ?
                        ServerState.LEADING : learningState());

                Vote endVote = new Vote(n.leader,
                        n.zxid,
                        n.electionEpoch,
                        n.peerEpoch);
                leaveInstance(endVote);
                return endVote;
            }
        }

If there is an error please point out, if it helps, welcome point for your support

Guess you like

Origin www.cnblogs.com/ZhuChangwu/p/11622763.html