Zookeeper's Leader election source code analysis

Author: JD Logistics Liang Jichao

Zookeeper is a distributed service framework that mainly solves various data problems common in distributed applications, such as cluster management, state synchronization, etc. In order to solve these problems, zookeeper needs Leader election to ensure the strong consistency mechanism and stability of data. This article analyzes the leader election source through the configuration of the cluster, so that readers can understand how to use the BIO communication mechanism and multi-threaded multi-layer queues to achieve high-performance architecture.

01Leader election mechanism

The Leader election mechanism uses a half-majority election algorithm.

Each zookeeper server is called a node, each


Each node has the right to vote, and votes for each node that has the right to vote. When one of the nodes elects more than half of the votes, this node will become the Leader, and the other nodes will become Followers.

02Leader election cluster configuration

  1. Rename the zoo_sample.cfg file to zoo1.cfg, zoo2.cfg, zoo3.cfg, zoo4.cfg

  2. Modify the zoo.cfg file, modify the value as follows:

【plain】
zoo1.cfg文件内容:
dataDir=/export/data/zookeeper-1
clientPort=2181
server.1=127.0.0.1:2001:3001
server.2=127.0.0.1:2002:3002:participant
server.3=127.0.0.1:2003:3003:participant
server.4=127.0.0.1:2004:3004:observer


zoo2.cfg文件内容:
dataDir=/export/data/zookeeper-2
clientPort=2182
server.1=127.0.0.1:2001:3001
server.2=127.0.0.1:2002:3002:participant
server.3=127.0.0.1:2003:3003:participant
server.4=127.0.0.1:2004:3004:observer


zoo3.cfg文件内容:
dataDir=/export/data/zookeeper-3
clientPort=2183
server.1=127.0.0.1:2001:3001
server.2=127.0.0.1:2002:3002:participant
server.3=127.0.0.1:2003:3003:participant
server.4=127.0.0.1:2004:3004:observer


zoo4.cfg文件内容:
dataDir=/export/data/zookeeper-4
clientPort=2184
server.1=127.0.0.1:2001:3001
server.2=127.0.0.1:2002:3002:participant
server.3=127.0.0.1:2003:3003:participant
server.4=127.0.0.1:2004:3004:observer

  1. server. The server number (corresponding to the content of the myid file) = ip: data synchronization port: election port: election ID
  • The participant participates in the election by default, and can be omitted. The observer does not participate in the election

4. Create myid files in the /export/data/zookeeper-1, /export/data/zookeeper-2, /export/data/zookeeper-3, /export/data/zookeeper-4 directories, and write 1 in the content of the files respectively. 2, 3, 4, used to identify sid (full name: Server ID) assignment.

  1. Start three zookeeper instances:
  • bin/zkServer.sh start conf/zoo1.cfg
  • bin/zkServer.sh start conf/zoo2.cfg
  • bin/zkServer.sh start conf/zoo3.cfg
  1. Every time an instance is started, the startup parameter configuration zoo.cfg file will be read, so that the instance can know its identity information sid as the server and how many instances in the cluster participate in the election.

03Leader election process

Figure 1 Voting process from the first round to the second round

premise:

Set the ticket data format vote(sid,zxid,epoch)

  • sid is the unique identifier of each service of the Server ID, and is the content of the myid file;
  • zxid is the data transaction id number;
  • Epoch is the election cycle. For the convenience of understanding, the following content is tentatively set as 1 first election, and will not be written in the following content.

Start sid=1, sid=2 nodes in order

First round of voting:

  1. sid=1 node: the initial vote is itself, and the vote (1,0) is sent to the sid=2 node;

  2. sid=2 node: the initial vote is itself, and the vote (2,0) is sent to the sid=1 node;

  3. sid=1 node: Receive the vote(2,0) of the node with sid=2 and the current vote(1,0), first compare the zxid value, the larger the zxid means the latest data, the vote with the largest zxid is preferred, If the zxids are the same, the largest sid is elected. The current voting result is vote(2,0), and the vote of the node with sid=1 becomes vote(2,0);

  4. sid=2 node: received vote(1,0) from sid=1 node and vote(2,0) from the current own vote, referring to the above election method, the election result is vote(2,0), sid=2 node vote unchanged;

  5. The first round of voting is over.

Second round of voting:

  1. sid=1 node: the current own vote is vote(2,0), send vote(2,0) to sid=2 node;

  2. sid=2 node: the current own vote is vote(2,0), send vote(2,0) to sid=1 node;

  3. sid=1 node: received vote(2,0) from node sid=2 and vote(2,0) from itself, according to the half-majority election algorithm, a total of 3 nodes participated in the election, and 2 nodes have already elected the same vote , recommend the sid=2 node as the Leader, and your own role becomes the Follower;

  4. Node with sid=2: After receiving vote(2,0) from node with sid=1 and vote(2,0) from yourself, the node with sid=2 is elected as Leader according to half of the election algorithm, and your own role becomes Leader.

At this time, after the sid=3 node is started, the leader has been elected in the cluster, and the sid=1 and sid=2 nodes will send their leader votes back to the sid=3 node. After half of the election results, the sid=2 node is still the leader.

3.1 Leader election adopts multi-layer queue architecture

The bottom layer of the zookeeper election is mainly divided into the election application layer and the message transmission queue layer. The first layer application layer queues receive and send votes uniformly, while the second layer transport layer queues are divided into multiple queues according to the server sid. Each server sends messages to affect each other. For example, unsuccessful sending to a certain machine will not affect the sending of the normal server.

Figure 2 Flow chart of interaction between upper and lower layers of multi-layer queues

04 Analysis code entry class

Find the service startup class by viewing the contents of the zkServer.sh file:

org.apache.zookeeper.server.quorum.QuorumPeerMain

05 Election process code analysis

Figure 3 Flow chart of election code implementation

  1. Load the configuration file QuorumPeerConfig.parse(path);

The key configuration information for Leader election is as follows:

  • Read the dataDir directory to find the content of the myid file, and set the current application sid as the voter's identity information. The myid variable encountered below is the sid identification of the current node itself.
    • Set whether the current application of peerType participates in the election
  • new QuorumMaj() parses the server.prefix to load cluster member information, load allMembers all members, votingMembers participate in election members, observingMembers observer members, set half value votingMembers.size()/2.
【Java】
public QuorumMaj(Properties props) throws ConfigException {
        for (Entry<Object, Object> entry : props.entrySet()) {
            String key = entry.getKey().toString();
            String value = entry.getValue().toString();
            //读取集群配置文件中的server.开头的应用实例配置信息
            if (key.startsWith("server.")) {
                int dot = key.indexOf('.');
                long sid = Long.parseLong(key.substring(dot + 1));
                QuorumServer qs = new QuorumServer(sid, value);
                allMembers.put(Long.valueOf(sid), qs);
                if (qs.type == LearnerType.PARTICIPANT)
//应用实例绑定的角色为PARTICIPANT意为参与选举
                    votingMembers.put(Long.valueOf(sid), qs);
                else {
                    //观察者成员
                    observingMembers.put(Long.valueOf(sid), qs);
                }
            } else if (key.equals("version")) {
                version = Long.parseLong(value, 16);
            }
        }
        //过半基数
        half = votingMembers.size() / 2;
    }

  1. QuorumPeerMain.runFromConfig(config) starts the service;

  2. QuorumPeer.startLeaderElection() starts the election service;

  • Set the current vote new Vote(sid,zxid,epoch)
【plain】
synchronized public void startLeaderElection(){
try {
           if (getPeerState() == ServerState.LOOKING) {
               //首轮:当前节点默认投票对象为自己
               currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
           }
       } catch(IOException e) {
           RuntimeException re = new RuntimeException(e.getMessage());
           re.setStackTrace(e.getStackTrace());
           throw re;
       }
//........
}

  • Create an election management class: QuorumCnxnManager;
  • Initialize recvQueue<Message(sid,ByteBuffer)> to receive voting queue (second layer transmission queue);
  • Initialize queueSendMap<sid,queue> to send voting queue according to sid (second layer transmission queue);
  • Initialize senderWorkerMap<sid,SendWorker> to send the voting worker thread container, indicating that the sid voting node is connected;
  • Initialize the election listening thread class QuorumCnxnManager.Listener.
【Java】
//QuorumPeer.createCnxnManager()
public QuorumCnxManager(QuorumPeer self,
                        final long mySid,
                        Map<Long,QuorumPeer.QuorumServer> view,
                        QuorumAuthServer authServer,
                        QuorumAuthLearner authLearner,
                        int socketTimeout,
                        boolean listenOnAllIPs,
                        int quorumCnxnThreadsSize,
                        boolean quorumSaslAuthEnabled) {
    //接收投票队列(第二层传输队列)
    this.recvQueue = new ArrayBlockingQueue<Message>(RECV_CAPACITY);
    //按sid发送投票队列(第二层传输队列)
    this.queueSendMap = new ConcurrentHashMap<Long, ArrayBlockingQueue<ByteBuffer>>();
    //发送投票工作线程容器,表示着与sid投票节点已连接 
    this.senderWorkerMap = new ConcurrentHashMap<Long, SendWorker>();
    this.lastMessageSent = new ConcurrentHashMap<Long, ByteBuffer>();


    String cnxToValue = System.getProperty("zookeeper.cnxTimeout");
    if(cnxToValue != null){
        this.cnxTO = Integer.parseInt(cnxToValue);
    }


    this.self = self;


    this.mySid = mySid;
    this.socketTimeout = socketTimeout;
    this.view = view;
    this.listenOnAllIPs = listenOnAllIPs;


    initializeAuth(mySid, authServer, authLearner, quorumCnxnThreadsSize,
            quorumSaslAuthEnabled);
    // Starts listener thread that waits for connection requests 
    //创建选举监听线程 接收选举投票请求
    listener = new Listener();
    listener.setName("QuorumPeerListener");
}
//QuorumPeer.createElectionAlgorithm
protected Election createElectionAlgorithm(int electionAlgorithm){
    Election le=null;
    //TODO: use a factory rather than a switch
    switch (electionAlgorithm) {
    case 0:
        le = new LeaderElection(this);
        break;
    case 1:
        le = new AuthFastLeaderElection(this);
        break;
    case 2:
        le = new AuthFastLeaderElection(this, true);
        break;
    case 3:
        qcm = createCnxnManager();// new QuorumCnxManager(... new Listener())
        QuorumCnxManager.Listener listener = qcm.listener;
        if(listener != null){
            listener.start();//启动选举监听线程
            FastLeaderElection fle = new FastLeaderElection(this, qcm);
            fle.start();
            le = fle;
        } else {
            LOG.error("Null listener when initializing cnx manager");
        }
        break;
    default:
        assert false;
    }
return le;}

  1. Open the election listening thread QuorumCnxnManager.Listener;
  • Create a ServerSocket to wait for a connection with a node greater than its own sid, and store the connection information in senderWorkerMap<sid,SendWorker>;
  • Only when sid>self.sid can be connected.
【Java】
//上面的listener.start()执行后,选择此方法
public void run() {
    int numRetries = 0;
    InetSocketAddress addr;
    Socket client = null;
    while((!shutdown) && (numRetries < 3)){
        try {
            ss = new ServerSocket();
            ss.setReuseAddress(true);
            if (self.getQuorumListenOnAllIPs()) {
                int port = self.getElectionAddress().getPort();
                addr = new InetSocketAddress(port);
            } else {
                // Resolve hostname for this server in case the
                // underlying ip address has changed.
                self.recreateSocketAddresses(self.getId());
                addr = self.getElectionAddress();
            }
            LOG.info("My election bind port: " + addr.toString());
            setName(addr.toString());
            ss.bind(addr);
            while (!shutdown) {
                client = ss.accept();
                setSockOpts(client);
                LOG.info("Received connection request "
                        + client.getRemoteSocketAddress());
                // Receive and handle the connection request
                // asynchronously if the quorum sasl authentication is
                // enabled. This is required because sasl server
                // authentication process may take few seconds to finish,
                // this may delay next peer connection requests.
                if (quorumSaslAuthEnabled) {
                    receiveConnectionAsync(client);
                } else {
//接收连接信息
                    receiveConnection(client);
                }
                numRetries = 0;
            }
        } catch (IOException e) {
            if (shutdown) {
                break;
            }
            LOG.error("Exception while listening", e);
            numRetries++;
            try {
                ss.close();
                Thread.sleep(1000);
            } catch (IOException ie) {
                LOG.error("Error closing server socket", ie);
            } catch (InterruptedException ie) {
                LOG.error("Interrupted while sleeping. " +
                    "Ignoring exception", ie);
            }
            closeSocket(client);
        }
    }
    LOG.info("Leaving listener");
    if (!shutdown) {
        LOG.error("As I'm leaving the listener thread, "
                + "I won't be able to participate in leader "
                + "election any longer: "
                + self.getElectionAddress());
    } else if (ss != null) {
        // Clean up for shutdown.
        try {
            ss.close();
        } catch (IOException ie) {
            // Don't log an error for shutdown.
            LOG.debug("Error closing server socket", ie);
        }
    }
}


//代码执行路径:receiveConnection()->handleConnection(...)
private void handleConnection(Socket sock, DataInputStream din)
            throws IOException {
//...省略
     if (sid < self.getId()) {
            /*
             * This replica might still believe that the connection to sid is
             * up, so we have to shut down the workers before trying to open a
             * new connection.
             */
            SendWorker sw = senderWorkerMap.get(sid);
            if (sw != null) {
                sw.finish();
            }


            /*
             * Now we start a new connection
             */
            LOG.debug("Create new connection to server: {}", sid);
            closeSocket(sock);


            if (electionAddr != null) {
                connectOne(sid, electionAddr);
            } else {
                connectOne(sid);
            }


        } else { // Otherwise start worker threads to receive data.
            SendWorker sw = new SendWorker(sock, sid);
            RecvWorker rw = new RecvWorker(sock, din, sid, sw);
            sw.setRecv(rw);


            SendWorker vsw = senderWorkerMap.get(sid);


            if (vsw != null) {
                vsw.finish();
            }
  //存储连接信息<sid,SendWorker>
            senderWorkerMap.put(sid, sw);


            queueSendMap.putIfAbsent(sid,
                    new ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY));


            sw.start();
            rw.start();
     }
}

  1. Create FastLeaderElection fast election service;
  • Initial ballot sending queue sendqueue (first layer queue)
  • Initial ballot receiving queue recvqueue (first layer queue)
  • Create thread WorkerSender
  • Create thread WorkerReceiver
【Java】
//FastLeaderElection.starter
private void starter(QuorumPeer self, QuorumCnxManager manager) {
    this.self = self;
    proposedLeader = -1;
    proposedZxid = -1;
    //发送队列sendqueue(第一层队列)
    sendqueue = new LinkedBlockingQueue<ToSend>();
    //接收队列recvqueue(第一层队列)
    recvqueue = new LinkedBlockingQueue<Notification>();
    this.messenger = new Messenger(manager);
}
//new Messenger(manager)
Messenger(QuorumCnxManager manager) {
    //创建线程WorkerSender
    this.ws = new WorkerSender(manager);


    this.wsThread = new Thread(this.ws,
            "WorkerSender[myid=" + self.getId() + "]");
    this.wsThread.setDaemon(true);
    //创建线程WorkerReceiver
    this.wr = new WorkerReceiver(manager);


    this.wrThread = new Thread(this.wr,
            "WorkerReceiver[myid=" + self.getId() + "]");
    this.wrThread.setDaemon(true);
}

  1. Start the WorkerSender and WorkerReceiver threads.

WorkerSender thread spins to get sendqueue first layer queue elements

  • The content of the sendqueue queue element is the relevant ballot information, see the ToSend class for details;
  • First determine whether the ballot sid is the same as its own sid value, and put it directly into the recvQueue queue if it is equal;
  • Not the same Dump sendqueue queue elements into queueSendMap<sid,queue> second-level transmission queue.
【Java】//FastLeaderElection.Messenger.WorkerSenderclass WorkerSender extends ZooKeeperThread{
//...
  public void run() {
    while (!stop) {
        try {
            ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);
            if(m == null) continue;
  //将投票信息发送出去
            process(m);
        } catch (InterruptedException e) {
            break;
        }
    }
    LOG.info("WorkerSender is down");
  }
}
//QuorumCnxManager#toSend
public void toSend(Long sid, ByteBuffer b) {
    /*
     * If sending message to myself, then simply enqueue it (loopback).
     */
    if (this.mySid == sid) {
         b.position(0);
         addToRecvQueue(new Message(b.duplicate(), sid));
        /*
         * Otherwise send to the corresponding thread to send.
         */
    } else {
         /*
          * Start a new connection if doesn't have one already.
          */
         ArrayBlockingQueue<ByteBuffer> bq = new ArrayBlockingQueue<ByteBuffer>(
            SEND_CAPACITY);
         ArrayBlockingQueue<ByteBuffer> oldq = queueSendMap.putIfAbsent(sid, bq);
         //转储到queueSendMap<sid,queue>第二层传输队列中
         if (oldq != null) {
             addToSendQueue(oldq, b);
         } else {
             addToSendQueue(bq, b);
         }
         connectOne(sid);     
    }
}

The WorkerReceiver thread spins to obtain recvQueue's second-level transmission queue elements and dumps them into the recvqueue's first-level queue.

【Java】
//WorkerReceiver
public void run() {
    Message response;
    while (!stop) {
      // Sleeps on receive
      try {
          //自旋获取recvQueue第二层传输队列元素
          response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS);
          if(response == null) continue;
          // The current protocol and two previous generations all send at least 28 bytes
          if (response.buffer.capacity() < 28) {
              LOG.error("Got a short response: " + response.buffer.capacity());
              continue;
          }
          //...
  if(self.getPeerState() == QuorumPeer.ServerState.LOOKING){
         //第二层传输队列元素转存到recvqueue第一层队列中
         recvqueue.offer(n);
         //...
      }
    }
//...
}

06 Election core logic

  1. Start thread QuorumPeer

Start Leader election voting makeLEStrategy().lookForLeader();

sendNotifications() sends ballot information to other nodes, and the ballot information is stored in the sendqueue queue. The sendqueue queue is handled by the WorkerSender thread.

【plain】
//QuorunPeer.run
//...
try {
   reconfigFlagClear();
    if (shuttingDownLE) {
       shuttingDownLE = false;
       startLeaderElection();
       }
    //makeLEStrategy().lookForLeader() 发送投票
    setCurrentVote(makeLEStrategy().lookForLeader());
} catch (Exception e) {
    LOG.warn("Unexpected exception", e);
    setPeerState(ServerState.LOOKING);
}  
//...
//FastLeaderElection.lookLeader
public Vote lookForLeader() throws InterruptedException {
//...
  //向其他应用发送投票
sendNotifications();
//...
}


private void sendNotifications() {
    //获取应用节点
    for (long sid : self.getCurrentAndNextConfigVoters()) {
        QuorumVerifier qv = self.getQuorumVerifier();
        ToSend notmsg = new ToSend(ToSend.mType.notification,
                proposedLeader,
                proposedZxid,
                logicalclock.get(),
                QuorumPeer.ServerState.LOOKING,
                sid,
                proposedEpoch, qv.toString().getBytes());
        if(LOG.isDebugEnabled()){
            LOG.debug("Sending Notification: " + proposedLeader + " (n.leader), 0x"  +
                  Long.toHexString(proposedZxid) + " (n.zxid), 0x" + Long.toHexString(logicalclock.get())  +
                  " (n.round), " + sid + " (recipient), " + self.getId() +
                  " (myid), 0x" + Long.toHexString(proposedEpoch) + " (n.peerEpoch)");
        }
        //储存投票信息
        sendqueue.offer(notmsg);
    }
}


class WorkerSender extends ZooKeeperThread {
    //...
    public void run() {
    while (!stop) {
        try {
//提取已储存的投票信息
            ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);
            if(m == null) continue;


            process(m);
        } catch (InterruptedException e) {
            break;
        }
    }
    LOG.info("WorkerSender is down");
  }
//...
}

Spin the recvqueue queue element to get the vote information:

【Java】
public Vote lookForLeader() throws InterruptedException {
//...
/*
 * Loop in which we exchange notifications until we find a leader
 */
while ((self.getPeerState() == ServerState.LOOKING) &&
        (!stop)){
    /*
     * Remove next notification from queue, times out after 2 times
     * the termination time
     */
    //提取投递过来的选票信息
    Notification n = recvqueue.poll(notTimeout,
            TimeUnit.MILLISECONDS);
/*
 * Sends more notifications if haven't received enough.
 * Otherwise processes new notification.
 */
if(n == null){
    if(manager.haveDelivered()){
        //已全部连接成功,并且前一轮投票都完成,需要再次发起投票
        sendNotifications();
    } else {
        //如果未收到选票信息,manager.contentAll()自动连接其它socket节点
        manager.connectAll();
    }
    /*
     * Exponential backoff
     */
    int tmpTimeOut = notTimeout*2;
    notTimeout = (tmpTimeOut < maxNotificationInterval?
            tmpTimeOut : maxNotificationInterval);
    LOG.info("Notification time out: " + notTimeout);
         }
     //....
    }
  //...
}

【Java】
//manager.connectAll()->connectOne(sid)->initiateConnection(...)->startConnection(...)


private boolean startConnection(Socket sock, Long sid)
        throws IOException {
    DataOutputStream dout = null;
    DataInputStream din = null;
    try {
        // Use BufferedOutputStream to reduce the number of IP packets. This is
        // important for x-DC scenarios.
        BufferedOutputStream buf = new BufferedOutputStream(sock.getOutputStream());
        dout = new DataOutputStream(buf);


        // Sending id and challenge
        // represents protocol version (in other words - message type)
        dout.writeLong(PROTOCOL_VERSION);
        dout.writeLong(self.getId());
        String addr = self.getElectionAddress().getHostString() + ":" + self.getElectionAddress().getPort();
        byte[] addr_bytes = addr.getBytes();
        dout.writeInt(addr_bytes.length);
        dout.write(addr_bytes);
        dout.flush();


        din = new DataInputStream(
                new BufferedInputStream(sock.getInputStream()));
    } catch (IOException e) {
        LOG.warn("Ignoring exception reading or writing challenge: ", e);
        closeSocket(sock);
        return false;
    }


    // authenticate learner
    QuorumPeer.QuorumServer qps = self.getVotingView().get(sid);
    if (qps != null) {
        // TODO - investigate why reconfig makes qps null.
        authLearner.authenticate(sock, qps.hostname);
    }


    // If lost the challenge, then drop the new connection
    //保证集群中所有节点之间只有一个通道连接
    if (sid > self.getId()) {
        LOG.info("Have smaller server identifier, so dropping the " +
                "connection: (" + sid + ", " + self.getId() + ")");
        closeSocket(sock);
        // Otherwise proceed with the connection
    } else {
        SendWorker sw = new SendWorker(sock, sid);
        RecvWorker rw = new RecvWorker(sock, din, sid, sw);
        sw.setRecv(rw);


        SendWorker vsw = senderWorkerMap.get(sid);


        if(vsw != null)
            vsw.finish();


        senderWorkerMap.put(sid, sw);
        queueSendMap.putIfAbsent(sid, new ArrayBlockingQueue<ByteBuffer>(
                SEND_CAPACITY));


        sw.start();
        rw.start();


        return true;


    }
    return false;
}

As shown in the above code, only when sid>self.sid can create and connect Socket and SendWorker, RecvWorker threads, and store them in senderWorkerMap<sid,SendWorker>. Corresponding to the sid<self.sid logic in step 2, ensure that there is only one channel connection between all nodes in the cluster.

Figure 4 Connections between nodes

【Java】


public Vote lookForLeader() throws InterruptedException {
//...
    if (n.electionEpoch > logicalclock.get()) {
        //当前选举周期小于选票周期,重置recvset选票池
        //大于当前周期更新当前选票信息,再次发送投票
        logicalclock.set(n.electionEpoch);
        recvset.clear();
        if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
            updateProposal(n.leader, n.zxid, n.peerEpoch);
        } else {
            updateProposal(getInitId(),
                    getInitLastLoggedZxid(),
                    getPeerEpoch());
        }
        sendNotifications();
    } else if (n.electionEpoch < logicalclock.get()) {
        if(LOG.isDebugEnabled()){
            LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                    + Long.toHexString(n.electionEpoch)
                    + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
        }
        break;
    } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
            proposedLeader, proposedZxid, proposedEpoch)) {//相同选举周期
        //接收的选票与当前选票PK成功后,替换当前选票
        updateProposal(n.leader, n.zxid, n.peerEpoch);
        sendNotifications();
    }
//...


}

In the above code, the spin obtains the ballot information from the recvqueue queue. Start the election:

  • Determine whether the current ballot cycle is consistent with the received ballot cycle
  • Update the current ballot information greater than the current period, and send the vote again
  • The cycle is equal: the current ballot information and the received ballot information are PK
【Java】
//接收的选票与当前选票PK
protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
        LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
                Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
        if(self.getQuorumVerifier().getWeight(newId) == 0){
            return false;
        }


        /*
         * We return true if one of the following three cases hold:
         * 1- New epoch is higher
         * 2- New epoch is the same as current epoch, but new zxid is higher
         * 3- New epoch is the same as current epoch, new zxid is the same
         *  as current zxid, but server id is higher.
         */
        return ((newEpoch > curEpoch) ||
                ((newEpoch == curEpoch) &&
                ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));wId > curId)))));
  }

The logic of the totalOrderPredicate method in the above code is as follows:

  • true if the election cycle is greater than the current cycle
  • The election cycle is equal, and the election zxid is greater than the current zxid is true
  • The election period is equal, the election zxid is equal to the current zxid, and the election sid is greater than the current sid is true
  • After the above conditions are judged to be true, the current ballot information will be replaced with the successful election ballot, and new ballots will be cast out again.
【Java】
public Vote lookForLeader() throws InterruptedException {
//...
   //存储节点对应的选票信息
    // key:选票来源sid  value:选票推举的Leader sid
    recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));


    //半数选举开始
    if (termPredicate(recvset,
            new Vote(proposedLeader, proposedZxid,
                    logicalclock.get(), proposedEpoch))) {
        // Verify if there is any change in the proposed leader
        while((n = recvqueue.poll(finalizeWait,
                TimeUnit.MILLISECONDS)) != null){
            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                    proposedLeader, proposedZxid, proposedEpoch)){
                recvqueue.put(n);
                break;
            }
        }
        /*WorkerSender
         * This predicate is true once we don't read any new
         * relevant message from the reception queue
         */
        if (n == null) {
            //已选举出leader 更新当前节点是否为leader 
            self.setPeerState((proposedLeader == self.getId()) ?
                    ServerState.LEADING: learningState());


            Vote endVote = new Vote(proposedLeader,
                    proposedZxid, proposedEpoch);
            leaveInstance(endVote);
            return endVote;
        }
    }
//...
}
/**
     * Termination predicate. Given a set of votes, determines if have
     * sufficient to declare the end of the election round.
     *
     * @param votes
     *            Set of votes
     * @param vote
     *            Identifier of the vote received last  PK后的选票
     */
private boolean termPredicate(HashMap<Long, Vote> votes, Vote vote) {
    SyncedLearnerTracker voteSet = new SyncedLearnerTracker();
    voteSet.addQuorumVerifier(self.getQuorumVerifier());
    if (self.getLastSeenQuorumVerifier() != null
            && self.getLastSeenQuorumVerifier().getVersion() > self
                    .getQuorumVerifier().getVersion()) {
        voteSet.addQuorumVerifier(self.getLastSeenQuorumVerifier());
    }
    /*
     * First make the views consistent. Sometimes peers will have different
     * zxids for a server depending on timing.
     */
    //votes 来源于recvset 存储各个节点推举出来的选票信息
    for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
//选举出的sid和其它节点选择的sid相同存储到voteSet变量中。
        if (vote.equals(entry.getValue())) {
//保存推举出来的sid
            voteSet.addAck(entry.getKey());
        }
    }
    //判断选举出来的选票数量是否过半
    return voteSet.hasAllQuorums();
}
//QuorumMaj#containsQuorum
public boolean containsQuorum(Set<Long> ackSet) {
    return (ackSet.size() > half);
   }

In the above code: recvset stores the vote information recommended by each sid.

The first round sid1:vote(1,0,1), sid2:vote(2,0,1);

The second round sid1:vote(2,0,1), sid2:vote(2,0,1).

Finally, the election information vote(2,0,1) is recommended as the leader, and the recommended leader is used to compare the number of the same votes in the recvset ballot pool to 2. Because there are a total of 3 nodes participating in the election, both sid1 and sid2 elect sid2 as the leader, which meets the requirement of more than half of the votes, so sid2 is confirmed as the leader.

  • setPeerState updates the current node role;
  • The sid elected by the proposedLeader is equal to its own sid, and it is set as the Leader;
  • The above conditions are not equal, set to Follower or Observing;
  • Update currentVote The current vote is Leader's vote (2,0,1).

07 Summary

Through the analysis of the source code of Leader election, we can know:

  1. The network communication between multiple application nodes uses BIO to vote for each other, and at the same time ensures that only one channel is used between each node to reduce the consumption of network resources, which is enough to show the technical importance in the development of BIO distributed middleware.

  2. Based on BIO, multi-threading and memory message queues are flexibly used to fully realize the multi-layer queue architecture. Each layer of queues is divided into different threads to improve the performance of fast elections.

  3. It brings valuable experience to the practice of BIO in multi-threading technology.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/8591775