zookeeper knowledge necessary to understand, interpret leader election principle from source

zookeeper cluster role

  • leader: Cluster leader, responsible for handling transaction requests, handling queries
  • follower: cluster follower, leader node data synchronization, conversion of transaction requests to the leader, handling queries, participation in elections
  • observer: the follower is different, does not participate in the voting

myid

Cluster node id attribute

epoch

Voting is incremented each round, by a recorded vote of "dynasty."

zxid (transaction id)

Zxid leader node will be incremented after each transaction operations, follower node synchronization leader of zxid, node data to ensure consistency with the leader node data.

Node status

  • looking: Election stage
  • following: follower node
  • observering: observer nodes, not to participate in elections
  • leading: leader node

leader election ballot to determine the order

Each round of elections will send myid / zxid / epoch

  • epoch maximum
  • zxid maximum
  • myid maximum
  • When a node number of votes greater than half the number of nodes in the cluster to become leader

leader of the electoral process

Suppose a cluster of three nodes, the node starts looking state, each time epoch for a node, each node to vote their leader (1 / zxid / 1,2 / zxid / 1,3 / zxid / 1), each node will receive voting information to other nodes. At this epoch judge zxid under the same circumstances, if all agree, indicating that each node are the latest data, it is determined myid, voting information to modify their own node in order to modify the maximum votes in. Node stores the votes sent by other nodes, vote to send their own message out. Each ballot is incremented epoch. Each node maintains its own epoch. When the node receives the vote sent by other nodes to determine epoch, equal to the judge zxid, if still equal then judge myid. And if the epoch are not equal, their voting information stored node is updated for each other voting information, voting and save all information received, sent his new ballot. Vote to summarize the information received, whether the received ballots with more than just a general update of the vote with their votes to save the same information leader, then select the node is a leader, otherwise continue to vote. If a node becomes a follower, you need to set the node status is follower, and then link leader node, and then synchronize data. Each node has a queue holds voting information sent from the other node, when not elected leader, the cycle of judgment ballot information, until a new vote is larger than the information currently stored there. And then continue to cycle until the elected leader.

Source REVIEW

First, based on the 3.5.5 version from the entrance zk server start to see zkServer.sh script, which start key command

//1
    ZOOMAIN="org.apache.zookeeper.server.quorum.QuorumPeerMain"
//2
    nohup "$JAVA" $ZOO_DATADIR_AUTOCREATE "-Dzookeeper.log.dir=${ZOO_LOG_DIR}" \
    "-Dzookeeper.log.file=${ZOO_LOG_FILE}" "-Dzookeeper.root.logger=${ZOO_LOG4J_PROP}" \
    -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError='kill -9 %p' \
    -cp "$CLASSPATH" $JVMFLAGS $ZOOMAIN "$ZOOCFG" > "$_ZOO_DAEMON_OUT" 2>&1 < /dev/null &复制代码

You can know org.apache.zookeeper.server.quorum.QuorumPeerMainto start entry

@InterfaceAudience.Public
public class QuorumPeerMain {
    private static final Logger LOG = LoggerFactory.getLogger(QuorumPeerMain.class);
    private static final String USAGE = "Usage: QuorumPeerMain configfile";
    protected QuorumPeer quorumPeer;
    public static void main(String[] args) {
        QuorumPeerMain main = new QuorumPeerMain();
        try {
            main.initializeAndRun(args);
        } catch (IllegalArgumentException e) {
           //
      }
      //...省略代码
    }

    protected void initializeAndRun(String[] args)
        throws ConfigException, IOException, AdminServerException
    {
        QuorumPeerConfig config = new QuorumPeerConfig();
        if (args.length == 1) {
            // 解析配置文件
            config.parse(args[0]);
        }

        // ....
        if (args.length == 1 && config.isDistributed()) {
            runFromConfig(config);
        } else {
            LOG.warn("Either no config or no quorum defined in config, running "
                    + " in standalone mode");
            // there is only server in the quorum -- run as standalone
            ZooKeeperServerMain.main(args);
        }
    }

    public void runFromConfig(QuorumPeerConfig config)
            throws IOException, AdminServerException
    {
     // ... 
    }
}复制代码

Start methodQuorumPeerMain #runFromConfig

      public void runFromConfig(QuorumPeerConfig config)
            throws IOException, AdminServerException
    {
      try {
          // 节点通信连接相关
          ServerCnxnFactory cnxnFactory = null;
          ServerCnxnFactory secureCnxnFactory = null;
          if (config.getClientPortAddress() != null) {
              // 默认没有指定为 NIOServerCnxnFactory
              cnxnFactory = ServerCnxnFactory.createFactory();
              cnxnFactory.configure(config.getClientPortAddress(),
                      config.getMaxClientCnxns(),
                      false);
          }

          if (config.getSecureClientPortAddress() != null) {
              secureCnxnFactory = ServerCnxnFactory.createFactory();
              secureCnxnFactory.configure(config.getSecureClientPortAddress(),
                      config.getMaxClientCnxns(),
                      true);
          }
          // 节点参数的配置
          quorumPeer = getQuorumPeer();
          quorumPeer.setTxnFactory(new FileTxnSnapLog(
                      config.getDataLogDir(),
                      config.getDataDir()));
          quorumPeer.enableLocalSessions(config.areLocalSessionsEnabled());
          quorumPeer.enableLocalSessionsUpgrading(
              config.isLocalSessionsUpgradingEnabled());
          //quorumPeer.setQuorumPeers(config.getAllMembers());
          // leader选举类型 默认为3,其他类型再源码中也已标记为过期
          quorumPeer.setElectionType(config.getElectionAlg());
          quorumPeer.setMyid(config.getServerId());
          quorumPeer.setTickTime(config.getTickTime());
          quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());
          quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());
          quorumPeer.setInitLimit(config.getInitLimit());
          quorumPeer.setSyncLimit(config.getSyncLimit());
          quorumPeer.setConfigFileName(config.getConfigFilename());
          // 数据存储
          quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
          // 默认实现QuorumMaj 集群节点信息,(集群节点总数,投票节点总数,observer节点数)
          quorumPeer.setQuorumVerifier(config.getQuorumVerifier(), false);
          if (config.getLastSeenQuorumVerifier()!=null) {
              quorumPeer.setLastSeenQuorumVerifier(config.getLastSeenQuorumVerifier(), false);
          }
          quorumPeer.initConfigInZKDatabase();
          quorumPeer.setCnxnFactory(cnxnFactory);
          quorumPeer.setSecureCnxnFactory(secureCnxnFactory);
          quorumPeer.setSslQuorum(config.isSslQuorum());
          quorumPeer.setUsePortUnification(config.shouldUsePortUnification());
          // 节点类型
          quorumPeer.setLearnerType(config.getPeerType());
          quorumPeer.setSyncEnabled(config.getSyncEnabled());
          quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs());
          if (config.sslQuorumReloadCertFiles) {
              quorumPeer.getX509Util().enableCertFileReloading();
          }

          // sets quorum sasl authentication configurations
          quorumPeer.setQuorumSaslEnabled(config.quorumEnableSasl);
          if(quorumPeer.isQuorumSaslAuthEnabled()){
              quorumPeer.setQuorumServerSaslRequired(config.quorumServerRequireSasl);
              quorumPeer.setQuorumLearnerSaslRequired(config.quorumLearnerRequireSasl);
              quorumPeer.setQuorumServicePrincipal(config.quorumServicePrincipal);
              quorumPeer.setQuorumServerLoginContext(config.quorumServerLoginContext);
              quorumPeer.setQuorumLearnerLoginContext(config.quorumLearnerLoginContext);
          }
          quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
          quorumPeer.initialize();
          // 启动 QuorumPeer,这里的run方法中判断投票结果
          quorumPeer.start();
          quorumPeer.join();
      } catch (InterruptedException e) {
          // warn, but generally this is ok
          LOG.warn("Quorum Peer interrupted", e);
      }
    }复制代码

View startup method quorumPeer.start();, will be mentioned later in the run method, remember to super.start()call is QuorumPeerthe run()method

@Override
    public synchronized void start() {
        if (!getView().containsKey(myid)) {
            throw new RuntimeException("My id " + myid + " not in the peer list");
         }
        // 加载磁盘数据,设置zxid,epoch信息
        loadDataBase();
        // 启动socket服务
        startServerCnxnFactory();
        try {
            adminServer.start();
        } catch (AdminServerException e) {
            LOG.warn("Problem starting AdminServer", e);
            System.out.println(e);
        }
        // 开始选举投票工作
        startLeaderElection();
        // 子类重写了run方法,实质调用子类的run方法
        super.start();
    }

    synchronized public void startLeaderElection() {
       try {
           if (getPeerState() == ServerState.LOOKING) {
               //创建选票信息
               currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
           }
       } catch(IOException e) {
        //...
       }
        // 默认3 其他类型也已经被标记为弃用,返回FastLeaderElection
        this.electionAlg = createElectionAlgorithm(electionType);
    }复制代码

FastLeaderElectionClass structure

public class FastLeaderElection{
   QuorumPeer self;
   // 负责与其他节点的通信,选票的发送、接收
   QuorumCnxManager manager;
   // 负责选票的生成、接收 JVM线程间
   Messenger messenger;
   // 负责与其他节点的通信,选票的发送、接收
   QuorumCnxManager manager;
   //票据发送队列
   LinkedBlockingQueue<ToSend> sendqueue;
   //票据接收队列
   LinkedBlockingQueue<Notification> recvqueue;
   public void start() {
        this.messenger.start();
    }
}
   protected class Messenger {
         // 投票信息发送
        WorkerSender ws;
        // 接收投票信息
        WorkerReceiver wr;
        // 封装线程
        Thread wsThread = null;
        // 封装线程
        Thread wrThread = null;
        void start(){
            this.wsThread.start();
            this.wrThread.start();
        }
}复制代码

View WorkerReceiver andWorkerSender

        class WorkerReceiver extends ZooKeeperThread  {
            volatile boolean stop;
            QuorumCnxManager manager;
            WorkerReceiver(QuorumCnxManager manager) {
                super("WorkerReceiver");
                this.stop = false;
                this.manager = manager;
            }

            public void run() {
                Message response;
                while (!stop) {
                    // Sleeps on receive
                    try {
                        // 获取接收到的投票信息
                        response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS);
                        if(response == null) continue;
                        //。。。
                        // Instantiate Notification and set its attributes
                        Notification n = new Notification();
                        int rstate = response.buffer.getInt();
                        long rleader = response.buffer.getLong();
                        long rzxid = response.buffer.getLong();
                        long relectionEpoch = response.buffer.getLong();
                        long rpeerepoch;
                        int version = 0x0;
                        QuorumVerifier rqv = null;

                        // ... 省略代码
                        // 下面做的是判断票据信息
                        if(!validVoter(response.sid)) {
                            Vote current = self.getCurrentVote();
                            QuorumVerifier qv = self.getQuorumVerifier();
                            ToSend notmsg = new ToSend(ToSend.mType.notification,
                                    current.getId(),
                                    current.getZxid(),
                                    logicalclock.get(),
                                    self.getPeerState(),
                                    response.sid,
                                    current.getPeerEpoch(),
                                    qv.toString().getBytes());
                            // 节点不在集群列表中,发送投票数据
                            sendqueue.offer(notmsg);
                        } else {
                            QuorumPeer.ServerState ackstate = QuorumPeer.ServerState.LOOKING;
                            switch (rstate) {
                            case 0:
                                ackstate = QuorumPeer.ServerState.LOOKING;
                                break;
                            case 1:
                                ackstate = QuorumPeer.ServerState.FOLLOWING;
                                break;
                            case 2:
                                ackstate = QuorumPeer.ServerState.LEADING;
                                break;
                            case 3:
                                ackstate = QuorumPeer.ServerState.OBSERVING;
                                break;
                            default:
                                continue;
                            }

                            n.leader = rleader;
                            n.zxid = rzxid;
                            n.electionEpoch = relectionEpoch;
                            n.state = ackstate;
                            n.sid = response.sid;
                            n.peerEpoch = rpeerepoch;
                            n.version = version;
                            n.qv = rqv;
                            //选举中保存接收到的信息到队列
                            if(self.getPeerState() == QuorumPeer.ServerState.LOOKING){
                                recvqueue.offer(n);
                                // 发送方也在选举中,且朝代比自己老,则需要发送投票信息
                                if((ackstate == QuorumPeer.ServerState.LOOKING)
                                        && (n.electionEpoch < logicalclock.get())){
                                    Vote v = getVote();
                                    QuorumVerifier qv = self.getQuorumVerifier();
                                    ToSend notmsg = new ToSend(ToSend.mType.notification,
                                            v.getId(),
                                            v.getZxid(),
                                            logicalclock.get(),
                                            self.getPeerState(),
                                            response.sid,
                                            v.getPeerEpoch(),
                                            qv.toString().getBytes());
                                    sendqueue.offer(notmsg);
                                }
                            } else {
                                // 当前节点已完成选举,发送方还在选举,发送当前投票信息
                                Vote current = self.getCurrentVote();
                                if(ackstate == QuorumPeer.ServerState.LOOKING){
                                    QuorumVerifier qv = self.getQuorumVerifier();
                                    ToSend notmsg = new ToSend(
                                            ToSend.mType.notification,
                                            current.getId(),
                                            current.getZxid(),
                                            current.getElectionEpoch(),
                                            self.getPeerState(),
                                            response.sid,
                                            current.getPeerEpoch(),
                                            qv.toString().getBytes());
                                    sendqueue.offer(notmsg);
                                }
                            }
                        }
                    } catch (InterruptedException e) {
                        LOG.warn("Interrupted Exception while waiting for new message" +
                                e.toString());
                    }
                }
                LOG.info("WorkerReceiver is down");
            }
        }

       // 发送线程
        class WorkerSender extends ZooKeeperThread {
            volatile boolean stop;
            QuorumCnxManager manager;
            public void run() {
                while (!stop) {
                    try {
                        ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);
                        if(m == null) continue;
                        // 获取发送队列信息,发送
                        process(m);
                    } catch (InterruptedException e) {
                        break;
                    }
                }
            }
            void process(ToSend m) {
                ByteBuffer requestBuffer = buildMsg(m.state.ordinal(),
                                                    m.leader,
                                                    m.zxid,
                                                    m.electionEpoch,
                                                    m.peerEpoch,
                                                    m.configData);

                manager.toSend(m.sid, requestBuffer);
          }
        }复制代码

Here are voting logic to send the information, which involves several queues rotation data, you need to pay attention to the relationship between the queue built data. Let's look at the logic judgment election results,QuorumPeer#run

   @Override
    public void run() {
        updateThreadName();
        // ...

        try {
            /*
             * Main loop
             */
            while (running) {
                switch (getPeerState()) {
                case LOOKING:
                    // 通过setCurrentVote(makeLEStrategy().lookForLeader());启动leader选举 
                    if (Boolean.getBoolean("readonlymode.enabled")) {
                        // ...
                        try {
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        } catch (Exception e) {
                            LOG.warn("Unexpected exception", e);
                            setPeerState(ServerState.LOOKING);
                        } finally {
                          // ...
                        }
                    } else {
                        try {
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        } catch (Exception e) {
                            LOG.warn("Unexpected exception", e);
                            setPeerState(ServerState.LOOKING);
                        }                        
                    }
                    break;
                case OBSERVING:
                   // ...
                    break;
                case FOLLOWING:
                  // ...
                    break;
                case LEADING:
                  // ...
                    break;
                }
                start_fle = Time.currentElapsedTime();
            }
        } finally {
          // ...
        }
    }复制代码

makeLEStrategy().lookForLeader()The actual call isFastLeaderElection#lookForLeader()

    public Vote lookForLeader() throws InterruptedException {
         // ...
        try {
            // 保存投票信息,用于判断结果
            HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
            HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
            int notTimeout = finalizeWait;
            synchronized(this){
                logicalclock.incrementAndGet();
                // 更新当前票据
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

            // 发送消息
            sendNotifications();

           // 循环执行判断
            while ((self.getPeerState() == ServerState.LOOKING) &&
                    (!stop)){
              // 从接收到的票据队列中拿数据
                Notification n = recvqueue.poll(notTimeout,
                        TimeUnit.MILLISECONDS);
                if(n == null){
                    if(manager.haveDelivered()){
                        sendNotifications();
                    } else {
                        manager.connectAll();
                    }
                } 
                else if (validVoter(n.sid) && validVoter(n.leader)) {
                    switch (n.state) {
                    case LOOKING:
                        // 接收到的票据信息比较自己的朝代大
                        if (n.electionEpoch > logicalclock.get()) {
                            logicalclock.set(n.electionEpoch);
                            // 清空历史投票数据
                            recvset.clear();
                            //比较当前节点的票据与接收到的票据哪个比较大,更新当前节点投票信息
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications();
                        // 发送方的投票信息比自己还旧,不处理这个投票
                        } else if (n.electionEpoch < logicalclock.get()) {
                            break;
                        // 朝代相同 判断用哪个比较大的票据信息,更新节点票据信息,发送新的票据信息
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            sendNotifications();
                        }
                        // 保存投票信息
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
                        // 这里方法判断投票结果
                        if (termPredicate(recvset,
                                new Vote(proposedLeader, proposedZxid,
                                        logicalclock.get(), proposedEpoch))) {
                            // 过滤接收队列中朝代比当前旧的数据
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    recvqueue.put(n);
                                    break;
                                }
                            }
                          // 所有数据都是最新的且已经选举完成,更新节点状态
                            if (n == null) {
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                                Vote endVote = new Vote(proposedLeader,
                                        proposedZxid, logicalclock.get(), 
                                        proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }
                        break;
                    case OBSERVING:
                        // ...
                        break;
                    case FOLLOWING:
                    case LEADING:
                        if(n.electionEpoch == logicalclock.get()){
                            recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
                            if(termPredicate(recvset, new Vote(n.version, n.leader,
                                            n.zxid, n.electionEpoch, n.peerEpoch, n.state))
                                            && checkLeader(outofelection, n.leader, n.electionEpoch)) {
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                                Vote endVote = new Vote(n.leader, 
                                        n.zxid, n.electionEpoch, n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }
                        // 当前节点已经完成选举,不参与投票,epoch不会变。其他节点任然在投票
                        outofelection.put(n.sid, new Vote(n.version, n.leader, 
                                n.zxid, n.electionEpoch, n.peerEpoch, n.state));
                        if (termPredicate(outofelection, new Vote(n.version, n.leader,
                                n.zxid, n.electionEpoch, n.peerEpoch, n.state))
                                && checkLeader(outofelection, n.leader, n.electionEpoch)) {
                            synchronized(this){
                                logicalclock.set(n.electionEpoch);
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader, n.zxid, 
                                    n.electionEpoch, n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;
                    default:
                        LOG.warn("Notification state unrecoginized: " + n.state
                              + " (n.state), " + n.sid + " (n.sid)");
                        break;
                    }
                } else {
                    if (!validVoter(n.leader)) {
                        LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
                    }
                    if (!validVoter(n.sid)) {
                        LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
                    }
                }
            }
            return null;
        } finally {
            try {
                if(self.jmxLeaderElectionBean != null){
                    MBeanRegistry.getInstance().unregister(
                            self.jmxLeaderElectionBean);
                }
            } catch (Exception e) {
                LOG.warn("Failed to unregister with JMX", e);
            }
            self.jmxLeaderElectionBean = null;
            LOG.debug("Number of connection processing threads: {}",
                    manager.getConnectionThreadCount());
        }
    }
复制代码

termPredicatemethod

   protected boolean termPredicate(Map<Long, Vote> votes, Vote vote) {
        SyncedLearnerTracker voteSet = new SyncedLearnerTracker();
        voteSet.addQuorumVerifier(self.getQuorumVerifier());
        if (self.getLastSeenQuorumVerifier() != null
                && self.getLastSeenQuorumVerifier().getVersion() > self
                        .getQuorumVerifier().getVersion()) {
            voteSet.addQuorumVerifier(self.getLastSeenQuorumVerifier());
        }
        // voteSet维护了一个HashSet集合,将所有节点的投票信息与当前Vote相同节点的添加到set中
        for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
            if (vote.equals(entry.getValue())) {
                voteSet.addAck(entry.getKey());
            }
        }
        // 这个方法判断set中有超过半数节点投票为vote中的节点,结束选举
        return voteSet.hasAllQuorums();
    }复制代码

Several major class attributes, methodsZK-Leader election .jpg

Notice FastLeaderElectionunder Messengercontained WorkerSender, WorkerReceiverthese two threads responsible for updating the information of the current node votes. And QuorumCnxManagerin SendWorkerand RecvWorkerresponsible for a delivery to other nodes of the vote, which is done by Socket Communications. Look at the period from the transfer of the vote and then cluster.ZK-Leader election data stream .jpg

Guess you like

Origin juejin.im/post/5da1a606f265da5b7c4529cc