zookeeper (7) source code analysis - Cluster Leader election FastLeaderElection

A, Election Interface

Parent election Election interface, which defines two methods lookForLeader and shutdown, lookForLeader represents the seek Leader, shutdown, said closed, such as to close the connection between the service side.

zookeeper (7) source code analysis - Cluster Leader election FastLeaderElection

Election interface implementation class

zookeeper (7) source code analysis - Cluster Leader election FastLeaderElection

1, AuthFastLeaderElection, consistent with FastLeaderElection algorithm, only added authentication information in the message, has been abandoned after the 3.4.0 version.
2, FastLeaderElection, is to achieve a standard fast paxos algorithm, an election based on the TCP protocol.
3, LeaderElection, has been abandoned after the 3.4.0 version.

FastLeaderElection analysis

1, an important inner class

1, Notification Class

Notification representation elections information (other sent from the server poll information) received by the server that contains the sid election's vote, zxid, epoch election cycle, voters server sid, state, election cycle epoch

static public class Notification {
        /*
         * Format version, introduced in 3.4.6
         */

        public final static int CURRENTVERSION = 0x2;
        int version;

        /*
         * Proposed leader 被选举者的服务器id
         */
        long leader;

        /*
         * zxid of the proposed leader 被选举者的事务zxid
         */
        long zxid;

        /*
         * Epoch 选举者的选举周期
         */
        long electionEpoch;

        /*
         * current state of sender 选举者的节点状态
         * 总共有4中
         * LOOKING 寻找leader状态
         * FOLLOWING 跟随者
         * LEADING leader状态
                 *OBSERVING 不参与操作和选举
         */
        QuorumPeer.ServerState state;

        /*
         * Address of sender 选举者的服务器id
         */
        long sid;

        QuorumVerifier qv;
        /*
         * epoch of the proposed leader 被选举者的选举周期
         */
        long peerEpoch;
    }

2, ToSend class

ToSend representation sent to the poll information from other servers, also it contains the information sid's election, zxid, election cycles.

static public class ToSend {
        static enum mType {crequest, challenge, notification, ack}

        ToSend(mType type,
                long leader,
                long zxid,
                long electionEpoch,
                ServerState state,
                long sid,
                long peerEpoch,
                byte[] configData) {

            this.leader = leader;
            this.zxid = zxid;
            this.electionEpoch = electionEpoch;
            this.state = state;
            this.sid = sid;
            this.peerEpoch = peerEpoch;
            this.configData = configData;
        }

        /*
         * Proposed leader in the case of notification 被推举的leader的sid
         */
        long leader;

        /*
         * id contains the tag for acks, and zxid for notifications 
         * 被推举的leader的最大事务id
         */
        long zxid;

        /*
         * Epoch 选举者的选举周期
         */
        long electionEpoch;

        /*
         * Current state; 选举者的节点状态
         */
        QuorumPeer.ServerState state;

        /*
         * Address of recipient选举者的服务器sid
         */
        long sid;

        /*
         * Used to send a QuorumVerifier (configuration info)
         */
        byte[] configData = dummyData;

        /*
         * Leader epoch 被选举者的选举周期
         */
        long peerEpoch;
    }

3, Messenger class

3.1, inner classes

Messenger contains WorkerReceiver and WorkerSender two inner classes
1, WorkerReceiver inherited ZooKeeperThread, a ballot receiver.
2, it will continue to acquire other server sent messages from QuorumCnxManager elections in recvQueue, type Message

WorkerReceiver(QuorumCnxManager manager) {
                super("WorkerReceiver");
                this.stop = false;
                this.manager = manager;
            }

//从QuorumCnxManager中的recvQueue中获取投票消息
                        response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS);
                        if(response == null) continue;                      

And converts it into a ballot message Notification, and then save the recvqueue, the ballot receiving process, if found outside the electoral votes less than the current round servers, then ignore the external vote, while sending its own internal vote immediately, the assembly voted to ToSend information added to sendqueue queue.

ToSend notmsg = new ToSend(ToSend.mType.notification,
                                            v.getId(),
                                            v.getZxid(),
                                            logicalclock.get(),
                                            self.getPeerState(),
                                            response.sid,
                                            v.getPeerEpoch(),
                                            qv.toString().getBytes());
                                    sendqueue.offer(notmsg);

3, WorkerSender also inherited ZooKeeperThread, the transmitter for the ballot, which will continue to get ballots to be sent from the sendqueue

ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);

QuorumCnxManager and passes it to the bottom, it is the process of FastLeaderElection ToSend QuorumCnxManager into the Message.

void process(ToSend m) {
                ByteBuffer requestBuffer = buildMsg(m.state.ordinal(),
                                                    m.leader,
                                                    m.zxid,
                                                    m.electionEpoch,
                                                    m.peerEpoch,
                                                    m.configData);

                manager.toSend(m.sid, requestBuffer);

            }

3.2, Messenger constructor

 Messenger(QuorumCnxManager manager) {
                        //创建WorkerSender
            this.ws = new WorkerSender(manager);
                        // 新创建线程
            this.wsThread = new Thread(this.ws,
                    "WorkerSender[myid=" + self.getId() + "]");
                         // 设置为守护线程
            this.wsThread.setDaemon(true);
                         // 创建WorkerReceiver
            this.wr = new WorkerReceiver(manager);
                        // 新创建线程
            this.wrThread = new Thread(this.wr,
                    "WorkerReceiver[myid=" + self.getId() + "]");
                         // 设置为守护线程             
            this.wrThread.setDaemon(true);
        }

2, FastLeaderElection class attributes

    // 完成Leader选举之后需要等待时长
    final static int finalizeWait = 200;
    // 两个连续通知检查之间的最大时长
    final static int maxNotificationInterval = 60000;       
        // 管理服务器之间的连接
    QuorumCnxManager manager;
        // 选票发送队列,用于保存待发送的选票
    LinkedBlockingQueue<ToSend> sendqueue;

    // 选票接收队列,用于保存接收到的外部投票
    LinkedBlockingQueue<Notification> recvqueue;
        //投票者
        QuorumPeer self;
    Messenger messenger;
        //逻辑始终,当前选举周期
    AtomicLong logicalclock = new AtomicLong(); /* Election instance */
        //被选举者服务器sid
    long proposedLeader;
        //被选举者服务器zxid
    long proposedZxid;
        //被选举者服务器选举周期
    long proposedEpoch;

FastLeaderElection core methods

1. Send ballot

It will traverse all participants vote collection, and then send your ballot information to all of the above set of voters, it is not synchronous transmission, but the message is placed in sendqueue ToSend, followed by the transmission WorkerSender

private void sendNotifications() {
        for (long sid : self.getCurrentAndNextConfigVoters()) {
            QuorumVerifier qv = self.getQuorumVerifier();
                         // 构造发送消息
            ToSend notmsg = new ToSend(ToSend.mType.notification,
                    proposedLeader,
                    proposedZxid,
                    logicalclock.get(),
                    QuorumPeer.ServerState.LOOKING,
                    sid,
                    proposedEpoch, qv.toString().getBytes());
            if(LOG.isDebugEnabled()){
                LOG.debug("Sending Notification: " + proposedLeader + " (n.leader), 0x"  +
                      Long.toHexString(proposedZxid) + " (n.zxid), 0x" + Long.toHexString(logicalclock.get())  +
                      " (n.round), " + sid + " (recipient), " + self.getId() +
                      " (myid), 0x" + Long.toHexString(proposedEpoch) + " (n.peerEpoch)");
            }
                        // 将发送消息放置于队列
            sendqueue.offer(notmsg);
        }
    }

2, totalOrderPredicate function

The function receives the votes and vote PK itself, if the server id contained in the message to see if better, according to which for PK epoch, zxid, id priority.

protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
        LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
                Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
        if(self.getQuorumVerifier().getWeight(newId) == 0){
            return false;
        }

        /*
         * We return true if one of the following three cases hold:
         * 1- New epoch is higher
         * 2- New epoch is the same as current epoch, but new zxid is higher
         * 3- New epoch is the same as current epoch, new zxid is the same
         *  as current zxid, but server id is higher.
         */
        // 1. 判断消息里的epoch是不是比当前的大,如果大则消息中id对应的服务器就是leader
        // 2. 如果epoch相等则判断zxid,如果消息里的zxid大,则消息中id对应的服务器就是leader
        // 3. 如果前面两个都相等那就比较服务器id,如果大,则其就是leader
        return ((newEpoch > curEpoch) ||
                ((newEpoch == curEpoch) &&
                ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
    }

3、termPredicate方法

This function is used to determine whether the end of the Leader elections, that is, whether more than half of the same server elected Leader, which will receive the ballot process is compared with the current ballot, the same ballot into the same collection, after the judge vote whether the same set of more than half.

protected boolean termPredicate(Map<Long, Vote> votes, Vote vote) {
        SyncedLearnerTracker voteSet = new SyncedLearnerTracker();
        voteSet.addQuorumVerifier(self.getQuorumVerifier());
        if (self.getLastSeenQuorumVerifier() != null
                && self.getLastSeenQuorumVerifier().getVersion() > self
                        .getQuorumVerifier().getVersion()) {
            voteSet.addQuorumVerifier(self.getLastSeenQuorumVerifier());
        }

        /*
         * First make the views consistent. Sometimes peers will have different
         * zxids for a server depending on timing.
         */
        for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
            if (vote.equals(entry.getValue())) {
                voteSet.addAck(entry.getKey());
            }
        }

        return voteSet.hasAllQuorums();
    }

4, lookForLeader function

1, this function is used to start a new round of elections Leader, which will first increment logic clock, and then update the information in this vote server (initialization votes), after the vote information into sendqueue waiting to be sent to other servers
zookeeper (7) source code analysis - Cluster Leader election FastLeaderElection

2, each server will continue to get votes from the outside recvqueue queue, handling external votes.

Notification n = recvqueue.poll(notTimeout,
                        TimeUnit.MILLISECONDS);

3, to determine the election rounds, votes PK, update vote

if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications();

4, archiving votes, votes were counted, the final returns of the vote

if (termPredicate(outofelection, new Vote(n.version, n.leader,
                                n.zxid, n.electionEpoch, n.peerEpoch, n.state))
                                && checkLeader(outofelection, n.leader, n.electionEpoch)) {
                            synchronized(this){
                                logicalclock.set(n.electionEpoch);
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            //最终选票
                            Vote endVote = new Vote(n.leader, n.zxid, 
                                    n.electionEpoch, n.peerEpoch);
                            // 清空recvqueue队列的选票
                            leaveInstance(endVote);
                            return endVote;
                        }

to sum up

FastLeaderElection algorithm, which is the core of ZooKeeper, more complex, sort out a bit about the process, a lot of the details are not expanded.

Guess you like

Origin blog.51cto.com/janephp/2452788