3. SOFAJRaft source code analysis - How does the election?

Begins

In the previous article we explained NodeImpl inside the init method initializes the words of action, which the election is conducted in this method, this article is to talk about the details from this method in the election process.

Since I presented here is how to achieve, so please look at the principle: SOFAJRaft election analysis mechanism | SOFAJRaft implementation principle

A long article, I slowly wrote two weeks time -

Election Process Analysis

I am here only to the relevant election code listed, other code Ignoring
NodeImpl # init

public boolean init(final NodeOptions opts) {
        ....
    // Init timers
    //设置投票计时器
    this.voteTimer = new RepeatedTimer("JRaft-VoteTimer", this.options.getElectionTimeoutMs()) {

        @Override
        protected void onTrigger() {
            //处理投票超时
            handleVoteTimeout();
        }

        @Override
        protected int adjustTimeout(final int timeoutMs) {
            //在一定范围内返回一个随机的时间戳
            return randomTimeout(timeoutMs);
        }
    };
    //设置预投票计时器
    //当leader在规定的一段时间内没有与 Follower 舰船进行通信时,
    // Follower 就可以认为leader已经不能正常担任旗舰的职责,则 Follower 可以去尝试接替leader的角色。
    // 这段通信超时被称为 Election Timeout
    //候选者在发起投票之前,先发起预投票
    this.electionTimer = new RepeatedTimer("JRaft-ElectionTimer", this.options.getElectionTimeoutMs()) {

        @Override
        protected void onTrigger() {
            handleElectionTimeout();
        }

        @Override
        protected int adjustTimeout(final int timeoutMs) {
            //在一定范围内返回一个随机的时间戳
            //为了避免同时发起选举而导致失败
            return randomTimeout(timeoutMs);
        }
    };
    //leader下台的计时器
    //定时检查是否需要重新选举leader
    this.stepDownTimer = new RepeatedTimer("JRaft-StepDownTimer", this.options.getElectionTimeoutMs() >> 1) {

        @Override
        protected void onTrigger() {
            handleStepDownTimeout();
        }
    };
        ....
    if (!this.conf.isEmpty()) {
        //新启动的node需要重新选举
        stepDown(this.currTerm, false, new Status());
    }
        ....
}

Inside the init method initializes the timer is three and elections:

  • voteTimer: The timer is responsible for regular inspection, if the state is the current state of the candidate (STATE_CANDIDATE), it will initiate elections
  • electionTimer: within a certain period of time if the leader is not communicating with the Follower, Follower leader can no longer can be considered as the normal duties of leader, it will be elections, before the elections will first initiate pre-vote, if not more than half the nodes feedback, the candidate will be good grace to give up the election. So this timer is responsible for the pre-vote
  • stepDownTimer: Regularly check whether you need to re-election leader, if the current leader does not receive more than half of the Follower response, then the leader should step down and new elections.

RepeatedTimer analysis I have written: 2. SOFAJRaft source code analysis -JRaft timing task scheduler is how to do?

Let's follow the idea of ​​looking down init method, in general this.conf was inside information for the entire cluster node is not empty, it will call stepDown, so start with this method looks.

leader to step down

private void stepDown(final long term, final boolean wakeupCandidate, final Status status) {
    LOG.debug("Node {} stepDown, term={}, newTerm={}, wakeupCandidate={}.", getNodeId(), this.currTerm, term,
        wakeupCandidate);
    //校验一下当前节点的状态是否有异常,或正在关闭
    if (!this.state.isActive()) {
        return;
    }
    //如果是候选者,那么停止选举
    if (this.state == State.STATE_CANDIDATE) {
        //调用voteTimer的stop方法
        stopVoteTimer();
        //如果当前状态是leader或TRANSFERRING
    } else if (this.state.compareTo(State.STATE_TRANSFERRING) <= 0) {
        //让启动的stepDownTimer停止运作
        stopStepDownTimer();
        //清空选票箱中的内容
        this.ballotBox.clearPendingTasks();
        // signal fsm leader stop immediately
        if (this.state == State.STATE_LEADER) {
            //发送leader下台的事件给其他Follower
            onLeaderStop(status);
        }
    }
    // reset leader_id
    //重置当前节点的leader
    resetLeaderId(PeerId.emptyPeer(), status);

    // soft state in memory
    this.state = State.STATE_FOLLOWER;
    //重置Configuration的上下文
    this.confCtx.reset();
    updateLastLeaderTimestamp(Utils.monotonicMs());
    if (this.snapshotExecutor != null) {
        //停止当前的快照生成
        this.snapshotExecutor.interruptDownloadingSnapshots(term);
    }

    //设置任期为大的那个
    // meta state
    if (term > this.currTerm) {
        this.currTerm = term;
        this.votedId = PeerId.emptyPeer();
        //重设元数据信息保存到文件中
        this.metaStorage.setTermAndVotedFor(term, this.votedId);
    }

    if (wakeupCandidate) {
        this.wakingCandidate = this.replicatorGroup.stopAllAndFindTheNextCandidate(this.conf);
        if (this.wakingCandidate != null) {
            Replicator.sendTimeoutNowAndStop(this.wakingCandidate, this.options.getElectionTimeoutMs());
        }
    } else {
        //把replicatorGroup里面的所有replicator标记为stop
        this.replicatorGroup.stopAll();
    }
    //leader转移的时候会用到
    if (this.stopTransferArg != null) {
        if (this.transferTimer != null) {
            this.transferTimer.cancel(true);
        }
        // There is at most one StopTransferTimer at the same term, it's safe to
        // mark stopTransferArg to NULL
        this.stopTransferArg = null;
    }
    //启动
    this.electionTimer.start();
}

A leader to step down to do a lot of transfer of work:

  1. If the current node is a candidate (STATE_CANDIDATE), so this time it will temporarily not to vote
  2. If the current node status is (STATE_TRANSFERRING) said it is transmitted leader or leader (STATE_LEADER), then you need to put the current node stepDownTimer the timer to shut down
  3. If the current is the leader (STATE_LEADER), then you need to tell the state machine leader to step down, step down can be done to deal with the action in the state machine
  4. Reset leader of the current node, the current node state is set to state Follower, context reset confCtx
  5. Stop the current snapshot generation, setting a new term of office, so that all copy node stops working
  6. Start electionTimer

Inside stopVoteTimer call and calls the corresponding method is primarily stopStepDownTimer stop method RepeatedTimer is provided a method to stop ture in which state will be stopped, and the timeout is set to be canceled, and to this was added cancelledTimeouts timeout set to:
if the saw 2 . SOFAJRaft source code analysis -JRaft timing task scheduler is how to do? This article, then the following piece of code should see to understand how the story is.

public void stop() {
    this.lock.lock();
    try {
        if (this.stopped) {
            return;
        }
        this.stopped = true;
        if (this.timeout != null) {
            this.timeout.cancel();
            this.running = false;
            this.timeout = null;
        }
    } finally {
        this.lock.unlock();
    }
}

The state machine processing LEADER_STOP event

In onLeaderStop NodeImpl method call, the method is actually called a onLeaderStop FSMCallerImpl of
NodeImpl # onLeaderStop

private void onLeaderStop(final Status status) {
    this.replicatorGroup.clearFailureReplicators();
    this.fsmCaller.onLeaderStop(status);
}

FSMCallerImpl#onLeaderStop

public boolean onLeaderStop(final Status status) {
    return enqueueTask((task, sequence) -> {
          //设置当前task的状态为LEADER_STOP
        task.type = TaskType.LEADER_STOP;
        task.status = new Status(status);
    });
}

private boolean enqueueTask(final EventTranslator<ApplyTask> tpl) {
    if (this.shutdownLatch != null) {
        // Shutting down
        LOG.warn("FSMCaller is stopped, can not apply new task.");
        return false;
    }
    //使用Disruptor发布事件
    this.taskQueue.publishEvent(tpl);
    return true;
}

This method was like taskQueue queue which released a LEADER_STOP event, taskQueue is initialized in the init method FSMCallerImpl in:

public boolean init(final FSMCallerOptions opts) {
    .....
    this.disruptor = DisruptorBuilder.<ApplyTask>newInstance() //
            .setEventFactory(new ApplyTaskFactory()) //
            .setRingBufferSize(opts.getDisruptorBufferSize()) //
            .setThreadFactory(new NamedThreadFactory("JRaft-FSMCaller-Disruptor-", true)) //
            .setProducerType(ProducerType.MULTI) //
            .setWaitStrategy(new BlockingWaitStrategy()) //
            .build();
    this.disruptor.handleEventsWith(new ApplyTaskHandler());
    this.disruptor.setDefaultExceptionHandler(new LogExceptionHandler<Object>(getClass().getSimpleName()));
    this.taskQueue = this.disruptor.start();
    .....
}

Will be treated to ApplyTaskHandler released after a mission in the taskQueue

ApplyTaskHandler

private class ApplyTaskHandler implements EventHandler<ApplyTask> {
    // max committed index in current batch, reset to -1 every batch
    private long maxCommittedIndex = -1;

    @Override
    public void onEvent(final ApplyTask event, final long sequence, final boolean endOfBatch) throws Exception {
        this.maxCommittedIndex = runApplyTask(event, this.maxCommittedIndex, endOfBatch);
    }
}

Whenever there is a queue of tasks to reach taskQueue calls onEvent ApplyTaskHandler method to handle events, perform specific logic is processed by the method runApplyTask

FSMCallerImpl#runApplyTask

private long runApplyTask(final ApplyTask task, long maxCommittedIndex, final boolean endOfBatch) {
    CountDownLatch shutdown = null;
    ...
     switch (task.type) {
         ...
        case LEADER_STOP:
            this.currTask = TaskType.LEADER_STOP;
            doLeaderStop(task.status);
            break;
        ...
    }
        ....
}

Many of the events will be processed in runApplyTask method, we look LEADER_STOP here is how to do:

Calls in a switch in doLeaderStop method, this method calls the method FSMCallerImpl onLeaderStart inside package StateMachine state machine:

private void doLeaderStop(final Status status) {
    this.fsm.onLeaderStop(status);
}

This can be customized to deal with when the leader stopped.

Reset leader

The next calls resetLeaderId (PeerId.emptyPeer (), status); method to reset leader

private void resetLeaderId(final PeerId newLeaderId, final Status status) {
    if (newLeaderId.isEmpty()) {
        //这个判断表示如果当前节点是候选者或者是Follower,并且已经有leader了
        if (!this.leaderId.isEmpty() && this.state.compareTo(State.STATE_TRANSFERRING) > 0) {
            //向状态机装发布停止跟随该leader的事件
            this.fsmCaller.onStopFollowing(new LeaderChangeContext(this.leaderId.copy(), this.currTerm, status));
        }
        //把当前的leader设置为一个空值
        this.leaderId = PeerId.emptyPeer();
    } else {
        //如果当前节点没有leader
        if (this.leaderId == null || this.leaderId.isEmpty()) {
            //那么发布要跟随该leader的事件
            this.fsmCaller.onStartFollowing(new LeaderChangeContext(newLeaderId, this.currTerm, status));
        }
        this.leaderId = newLeaderId.copy();
    }
}

This method consists of two role, if passed newLeaderId is not empty, then it will set up a new leader, and send a START_FOLLOWING event state machine; if the incoming newLeaderId is empty, it will send a STOP_FOLLOWING events, and the current leader blank.

Start electionTimer, be leader election

electionTimer is RepeatedTimer implementation class, where I will not say more, the article has been introduced.

I come here to see onTrigger method electionTimer is how to deal with election events, onTrigger method electionTimer of calls handleElectionTimeout method NodeImpl, so direct look at this method:

NodeImpl#handleElectionTimeout

private void handleElectionTimeout() {
    boolean doUnlock = true;
    this.writeLock.lock();
    try {
        if (this.state != State.STATE_FOLLOWER) {
            return;
        }
        //如果当前选举没有超时则说明此轮选举有效
        if (isCurrentLeaderValid()) {
            return;
        }
        resetLeaderId(PeerId.emptyPeer(), new Status(RaftError.ERAFTTIMEDOUT, "Lost connection from leader %s.",
            this.leaderId));
        doUnlock = false;
        //预投票 (pre-vote) 环节
        //候选者在发起投票之前,先发起预投票,
        // 如果没有得到半数以上节点的反馈,则候选者就会识趣的放弃参选
        preVote();
    } finally {
        if (doUnlock) {
            this.writeLock.unlock();
        }
    }
}

In this method, the first will add a write lock, and then check the last to launch a pre-vote.

When the check will verify the current status is not a follower, leader and follower communication time check is not more than the last ElectionTimeoutMs, if not exceeded, indicating leader alive, no need to initiate the election; if the communication timeout, then will leader blank, and then call the pre-election.

NodeImpl#isCurrentLeaderValid

private boolean isCurrentLeaderValid() {
    return Utils.monotonicMs() - this.lastLeaderTimestamp < this.options.getElectionTimeoutMs();
}

Subtraction with the current time and the last communication time leader, is less than ElectionTimeoutMs (lS default), then there is no timeout, indicating effective leader

Preselection votes preVote

We handleElectionTimeout last method called preVote method, the next focus look at this method.

Now I will preVote split into several parts to explain:
NodeImpl # preVote part1

private void preVote() {
    long oldTerm;
    try {
        LOG.info("Node {} term {} start preVote.", getNodeId(), this.currTerm);
        if (this.snapshotExecutor != null && this.snapshotExecutor.isInstallingSnapshot()) {
            LOG.warn(
                "Node {} term {} doesn't do preVote when installing snapshot as the configuration may be out of date.",
                getNodeId());
            return;
        }
        //conf里面记录了集群节点的信息,如果当前的节点不包含在集群里说明是由问题的
        if (!this.conf.contains(this.serverId)) {
            LOG.warn("Node {} can't do preVote as it is not in conf <{}>.", getNodeId(), this.conf);
            return;
        }
        //设置一下当前的任期
        oldTerm = this.currTerm;
    } finally {
        this.writeLock.unlock();
    } 
      ....
}

This part of the code is a start into preVote this method first to go through some checking, for example, when the current node can not install a snapshot of an election; look at the current node is not in the conf own set inside, this property contains conf all nodes in the cluster; and finally set about after the current term of office unlocked.

NodeImpl#preVote part2

private void preVote() {
      ....
    //返回最新的log实体类
    final LogId lastLogId = this.logManager.getLastLogId(true);

    boolean doUnlock = true;
    this.writeLock.lock();
    try {
        // pre_vote need defense ABA after unlock&writeLock
        //因为在上面没有重新加锁的间隙里可能会被别的线程改变了,所以这里校验一下
        if (oldTerm != this.currTerm) {
            LOG.warn("Node {} raise term {} when get lastLogId.", getNodeId(), this.currTerm);
            return;
        }
        //初始化预投票投票箱
        this.prevVoteCtx.init(this.conf.getConf(), this.conf.isStable() ? null : this.conf.getOldConf());
        for (final PeerId peer : this.conf.listPeers()) {
            //如果遍历的节点是当前节点就跳过
            if (peer.equals(this.serverId)) {
                continue;
            }
            //失联的节点也跳过
            if (!this.rpcService.connect(peer.getEndpoint())) {
                LOG.warn("Node {} channel init failed, address={}.", getNodeId(), peer.getEndpoint());
                continue;
            }
            //设置一个回调的类
            final OnPreVoteRpcDone done = new OnPreVoteRpcDone(peer, this.currTerm);
            //向被遍历到的这个节点发送一个预投票的请求
            done.request = RequestVoteRequest.newBuilder() //
                .setPreVote(true) // it's a pre-vote request.
                .setGroupId(this.groupId) //
                .setServerId(this.serverId.toString()) //
                .setPeerId(peer.toString()) //
                .setTerm(this.currTerm + 1) // next term,注意这里发送过去的任期会加一
                .setLastLogIndex(lastLogId.getIndex()) //
                .setLastLogTerm(lastLogId.getTerm()) //
                .build();
            this.rpcService.preVote(peer.getEndpoint(), done.request, done);
        }
        //自己也可以投给自己
        this.prevVoteCtx.grant(this.serverId);
        if (this.prevVoteCtx.isGranted()) {
            doUnlock = false;
            electSelf();
        }
    } finally {
        if (doUnlock) {
            this.writeLock.unlock();
        }
    }
}

This part of the code:

  1. Will first log to obtain the latest information, the LogId package, which comprises two parts, one term is a term index node when the log and writes the corresponding log
  2. Pre-initialize the ballot box to vote
  3. Through all cluster nodes
  4. If the current node is a node traversed is skipped, if the nodes traversed since the down or offline manually reasons not connected to skip
  5. RequestVoteRequest send a request to the node traversal to their pre-vote
  6. Finally, because he is also a cluster node, so they also vote for yourself

Pre-initialize the ballot box to vote

Initialization is called the pre-ballot box Ballot init method to initialize, respectively the incoming new cluster node information, and the old cluster node information

public boolean init(Configuration conf, Configuration oldConf) {
    this.peers.clear();
    this.oldPeers.clear();
    quorum = oldQuorum = 0;
    int index = 0;
    //初始化新的节点
    if (conf != null) {
        for (PeerId peer : conf) {
            this.peers.add(new UnfoundPeerId(peer, index++, false));
        }
    }
    //设置需要多少票数才能成为leader
    this.quorum = this.peers.size() / 2 + 1;
    ....
    return true;
}

I order to make the logic more clearly here, assuming no oldConf, oldConf setting thereof is omitted.
This method was traverses all peer nodes, peer and packaged into UnfoundPeerId inserted peers set; then set quorum attribute that will be decreased by one each obtained a vote, reduced to 0 when the following description is obtained enough It votes on behalf of the pre-poll success.

Initiate pre-poll request

//设置一个回调的类
final OnPreVoteRpcDone done = new OnPreVoteRpcDone(peer, this.currTerm);
//向被遍历到的这个节点发送一个预投票的请求
done.request = RequestVoteRequest.newBuilder() //
    .setPreVote(true) // it's a pre-vote request.
    .setGroupId(this.groupId) //
    .setServerId(this.serverId.toString()) //
    .setPeerId(peer.toString()) //
    .setTerm(this.currTerm + 1) // next term,注意这里发送过去的任期会加一
    .setLastLogIndex(lastLogId.getIndex()) //
    .setLastLogTerm(lastLogId.getTerm()) //
    .build();
this.rpcService.preVote(peer.getEndpoint(), done.request, done);

When the construction RequestVoteRequest, will PreVote property is set to true, that the request is pre-voting; sets the current node is ServerId; pass each other term is a term of the current node plus one. Finally, after sending a successful response is received callback OnPreVoteRpcDone will run method.

OnPreVoteRpcDone#run

public void run(final Status status) {
    NodeImpl.this.metrics.recordLatency("pre-vote", Utils.monotonicMs() - this.startMs);
    if (!status.isOk()) {
        LOG.warn("Node {} PreVote to {} error: {}.", getNodeId(), this.peer, status);
    } else {
        handlePreVoteResponse(this.peer, this.term, getResponse());
    }
}

In this method, if a normal response is received, then the method of treatment response calls handlePreVoteResponse

OnPreVoteRpcDone # handlePreVoteResponse

public void handlePreVoteResponse(final PeerId peerId, final long term, final RequestVoteResponse response) {
    boolean doUnlock = true;
    this.writeLock.lock();
    try {
        //只有follower才可以尝试发起选举
        if (this.state != State.STATE_FOLLOWER) {
            LOG.warn("Node {} received invalid PreVoteResponse from {}, state not in STATE_FOLLOWER but {}.",
                getNodeId(), peerId, this.state);
            return;
        }
        
        if (term != this.currTerm) {
            LOG.warn("Node {} received invalid PreVoteResponse from {}, term={}, currTerm={}.", getNodeId(),
                peerId, term, this.currTerm);
            return;
        }
        //如果返回的任期大于当前的任期,那么这次请求也是无效的
        if (response.getTerm() > this.currTerm) {
            LOG.warn("Node {} received invalid PreVoteResponse from {}, term {}, expect={}.", getNodeId(), peerId,
                response.getTerm(), this.currTerm);
            stepDown(response.getTerm(), false, new Status(RaftError.EHIGHERTERMRESPONSE,
                "Raft node receives higher term pre_vote_response."));
            return;
        }
        LOG.info("Node {} received PreVoteResponse from {}, term={}, granted={}.", getNodeId(), peerId,
            response.getTerm(), response.getGranted());
        // check granted quorum?
        if (response.getGranted()) {
            this.prevVoteCtx.grant(peerId);
            //得到了半数以上的响应
            if (this.prevVoteCtx.isGranted()) {
                doUnlock = false;
                //进行选举
                electSelf();
            }
        }
    } finally {
        if (doUnlock) {
            this.writeLock.unlock();
        }
    }
}

Here to do a triple check, we were to talk about:

  1. The first retry check the current status, if not then you can not initiate FOLLOWER election. Because if a leader node, then it will not be elections, only stepdown step down, put themselves into FOLLOWER after the re-election; if it is CANDIDATE, you can only vote initiated by the FOLLOWER, so functionally it can only initiate FOLLOWER election.
    From the Raft of the designs can only be initiated by the election FOLLOWER, so here were verified.
  2. The second major recalibration term and is received when the check is sent in response to a request is not a term or, if not that then that is not the last round of elections, the election was a failure
  3. The third check recalibration is returned in response to the current term is not greater than a term, if the term is greater than the current, then reset the current leader

Complete response after the node returns an authorization check, if it grant authorization Ballot method is invoked, the current node indicates a vote

Ballot#grant

public void grant(PeerId peerId) {
    this.grant(peerId, new PosHint());
}

public PosHint grant(PeerId peerId, PosHint hint) {
    UnfoundPeerId peer = findPeer(peerId, peers, hint.pos0);
    if (peer != null) {
        if (!peer.found) {
            peer.found = true;
            this.quorum--;
        }
        hint.pos0 = peer.index;
    } else {
        hint.pos0 = -1;
    }
    .... 
    return hint;
}

The method will find UnfoundPeerId grant package according peerId instance is set to cluster inside, and then determine what, if not recorded too, will then subtract a quorum, a vote acknowledged receipt, and then found to have approached showing ture a.

Inside look UnfoundPeerId instance when the method made a very interesting set:
First, to peers into the collection inside when something like this:

int index = 0;
for (PeerId peer : conf) {
    this.peers.add(new UnfoundPeerId(peer, index++, false));
}

Here will traverse conf, then you will be credited index, index from scratch.
Then look for when you want to pass peerId and posHint also peers collections:

private UnfoundPeerId findPeer(PeerId peerId, List<UnfoundPeerId> peers, int posHint) {
    if (posHint < 0 || posHint >= peers.size() || !peers.get(posHint).peerId.equals(peerId)) {
        for (UnfoundPeerId ufp : peers) {
            if (ufp.peerId.equals(peerId)) {
                return ufp;
            }
        }
        return null;
    }

    return peers.get(posHint);
}

Here incoming posHint default is -1, that is, if it is the first time passed, it goes through the entire collection of peers, then one returns after more than right.

Because PosHint instances will pos0 to index peer after call completion, if the grant is not the first time a method call, then you can get when you call findPeer method directly through the get method, do not have to traverse the entire collection.

Such an approach can also be applied to the usual code.

Ballot will call after call to grant method isGranted determine what the response has reached more than half.
Ballot # isGranted

public boolean isGranted() {
    return this.quorum <= 0 && oldQuorum <= 0;
}

That is, to determine what the ballot boxes inside the votes are not to be reduced to zero. If the return is the case, then the call electSelf elections.

The method of election being the first not to see, we take a look at the pre-election request is being processed how

In response to a request RequestVoteRequest

RequestVoteRequest request processor is arranged in the RaftRpcServerFactory addRaftRequestProcessors method, when the specific processing RequestVoteRequestProcessor.

Specific approach is referred to processRequest0 method to deal with.

RequestVoteRequestProcessor#processRequest0

public Message processRequest0(RaftServerService service, RequestVoteRequest request, RpcRequestClosure done) {
    //如果是预选举
    if (request.getPreVote()) {
        return service.handlePreVoteRequest(request);
    } else {
        return service.handleRequestVoteRequest(request);
    }
}

Because the processor can handle pre-election and election, so the addition of a judge. Pre-election method to handlePreVoteRequest NodeImpl to concrete implementation.

NodeImpl # handlePreVoteRequest

public Message handlePreVoteRequest(final RequestVoteRequest request) {
    boolean doUnlock = true;
    this.writeLock.lock();
    try {
        //校验这个节点是不是正常的节点
        if (!this.state.isActive()) {
            LOG.warn("Node {} is not in active state, currTerm={}.", getNodeId(), this.currTerm);
            return RpcResponseFactory.newResponse(RaftError.EINVAL, "Node %s is not in active state, state %s.",
                getNodeId(), this.state.name());
        }
        final PeerId candidateId = new PeerId();
        //发送过来的request请求携带的ServerId格式不能错
        if (!candidateId.parse(request.getServerId())) {
            LOG.warn("Node {} received PreVoteRequest from {} serverId bad format.", getNodeId(),
                request.getServerId());
            return RpcResponseFactory.newResponse(RaftError.EINVAL, "Parse candidateId failed: %s.",
                request.getServerId());
        }
        boolean granted = false;
        // noinspection ConstantConditions
        do {
            //已经有leader的情况
            if (this.leaderId != null && !this.leaderId.isEmpty() && isCurrentLeaderValid()) {
                LOG.info(
                    "Node {} ignore PreVoteRequest from {}, term={}, currTerm={}, because the leader {}'s lease is still valid.",
                    getNodeId(), request.getServerId(), request.getTerm(), this.currTerm, this.leaderId);
                break;
            }
            //请求的任期小于当前的任期
            if (request.getTerm() < this.currTerm) {
                LOG.info("Node {} ignore PreVoteRequest from {}, term={}, currTerm={}.", getNodeId(),
                    request.getServerId(), request.getTerm(), this.currTerm);
                // A follower replicator may not be started when this node become leader, so we must check it.
                  //那么这个节点也可能是leader,所以校验一下请求的节点是不是复制节点,重新加入到replicatorGroup中
                checkReplicator(candidateId);
                break;
            } else if (request.getTerm() == this.currTerm + 1) {
                // A follower replicator may not be started when this node become leader, so we must check it.
                // check replicator state
                //因为请求的任期和当前的任期相等,那么这个节点也可能是leader,
                 // 所以校验一下请求的节点是不是复制节点,重新加入到replicatorGroup中
                checkReplicator(candidateId);
            }
            doUnlock = false;
            this.writeLock.unlock();
            //获取最新的日志
            final LogId lastLogId = this.logManager.getLastLogId(true);

            doUnlock = true;
            this.writeLock.lock();
            final LogId requestLastLogId = new LogId(request.getLastLogIndex(), request.getLastLogTerm());
            //比较当前节点的日志完整度和请求节点的日志完整度
            granted = requestLastLogId.compareTo(lastLogId) >= 0;

            LOG.info(
                "Node {} received PreVoteRequest from {}, term={}, currTerm={}, granted={}, requestLastLogId={}, lastLogId={}.",
                getNodeId(), request.getServerId(), request.getTerm(), this.currTerm, granted, requestLastLogId,
                lastLogId);
        } while (false);//这个while蛮有意思,为了用break想尽了办法

        return RequestVoteResponse.newBuilder() //
            .setTerm(this.currTerm) //
            .setGranted(granted) //
            .build();
    } finally {
        if (doUnlock) {
            this.writeLock.unlock();
        }
    }
}

This method which is also very interesting, write very long, but the logic is clear.

  1. First call isActive, look at the current node is not a normal node, the node is not normal to be returned Error Information
  2. ServerId pass over the request to resolve instances candidateId
  3. Check if the current node has a leader, and the leader valid, then directly BREAK, granted to false returns
  4. If the current term is greater than the term of the request, the call checkReplicator check that he is not leader, if a leader, then the current node is removed from the failureReplicators, rejoin replicatorMap in. Then directly break
  5. Request term and the current term of office should equal the check, but do not break
  6. If the request is newer than the current log of the most recent log, then returned granted authorization is true, on behalf of success
There is a place is interesting, because java can only goto in the circulation, it is used here do-while (false) only a single loop, so that you can do block of code used in the break.

The following little look checkReplicator:
NodeImpl # checkReplicator

private void checkReplicator(final PeerId candidateId) {
    if (this.state == State.STATE_LEADER) {
        this.replicatorGroup.checkReplicator(candidateId, false);
    }
}

Here a judgment is not leader, and then will call the checkReplicator ReplicatorGroupImpl

ReplicatorGroupImpl#checkReplicator

private final ConcurrentMap<PeerId, ThreadId> replicatorMap      = new ConcurrentHashMap<>();

private final Set<PeerId>                     failureReplicators = new ConcurrentHashSet<>();

public void checkReplicator(final PeerId peer, final boolean lockNode) {
    //根据传入的peer获取相应的ThreadId
    final ThreadId rid = this.replicatorMap.get(peer);
    // noinspection StatementWithEmptyBody
    if (rid == null) {
        // Create replicator if it's not found for leader.
        final NodeImpl node = this.commonOptions.getNode();
        if (lockNode) {
            node.writeLock.lock();
        }
        try {
            //如果当前的节点是leader,并且传入的peer在failureReplicators中,那么重新添加到replicatorMap
            if (node.isLeader() && this.failureReplicators.contains(peer) && addReplicator(peer)) {
                this.failureReplicators.remove(peer);
            }
        } finally {
            if (lockNode) {
                node.writeLock.unlock();
            }
        }
    } else { // NOPMD
        // Unblock it right now.
        // Replicator.unBlockAndSendNow(rid);
    }
}

checkReplicator from replicatorMap example will check what the passed peer is not empty. Because replicatorMap inside the store all the nodes in the cluster. CommonOptions ReplicatorGroupImpl then acquired by the current Node instance, if the current node is an example Leader, and if that fails, then the presence of the collection failureReplicators add replicatorMap in re.

ReplicatorGroupImpl#addReplicator

public boolean addReplicator(final PeerId peer) {
    //校验当前的任期
    Requires.requireTrue(this.commonOptions.getTerm() != 0);
    //如果replicatorMap里面已经有这个节点了,那么将它从failureReplicators集合中移除
    if (this.replicatorMap.containsKey(peer)) {
        this.failureReplicators.remove(peer);
        return true;
    }
    //赋值一个新的ReplicatorOptions
    final ReplicatorOptions opts = this.commonOptions == null ? new ReplicatorOptions() : this.commonOptions.copy();
    //新的ReplicatorOptions添加这个PeerId
    opts.setPeerId(peer);
    final ThreadId rid = Replicator.start(opts, this.raftOptions);
    if (rid == null) {
        LOG.error("Fail to start replicator to peer={}.", peer);
        this.failureReplicators.add(peer);
        return false;
    }
    return this.replicatorMap.put(peer, rid) == null;
}

addReplicator which is mainly do two things: 1 will be added to the node is removed from the failureReplicators the collection; 2 nodes will be added into the replicatorMap set to go...

Voting electSelf

private void electSelf() {
    long oldTerm;
    try {
        LOG.info("Node {} start vote and grant vote self, term={}.", getNodeId(), this.currTerm);
        //1. 如果当前节点不在集群里面则不进行选举
        if (!this.conf.contains(this.serverId)) {
            LOG.warn("Node {} can't do electSelf as it is not in {}.", getNodeId(), this.conf);
            return;
        }
        //2. 大概是因为要进行正式选举了,把预选举关掉
        if (this.state == State.STATE_FOLLOWER) {
            LOG.debug("Node {} stop election timer, term={}.", getNodeId(), this.currTerm);
            this.electionTimer.stop();
        }
        //3. 清空leader
        resetLeaderId(PeerId.emptyPeer(), new Status(RaftError.ERAFTTIMEDOUT,
            "A follower's leader_id is reset to NULL as it begins to request_vote."));
        this.state = State.STATE_CANDIDATE;
        this.currTerm++;
        this.votedId = this.serverId.copy();
        LOG.debug("Node {} start vote timer, term={} .", getNodeId(), this.currTerm);
        //4. 开始发起投票定时器,因为可能投票失败需要循环发起投票
        this.voteTimer.start();
        //5. 初始化投票箱
        this.voteCtx.init(this.conf.getConf(), this.conf.isStable() ? null : this.conf.getOldConf());
        oldTerm = this.currTerm;
    } finally {
        this.writeLock.unlock();
    }

    final LogId lastLogId = this.logManager.getLastLogId(true);

    this.writeLock.lock();
    try {
        // vote need defense ABA after unlock&writeLock
        if (oldTerm != this.currTerm) {
            LOG.warn("Node {} raise term {} when getLastLogId.", getNodeId(), this.currTerm);
            return;
        }
        //6. 遍历所有节点
        for (final PeerId peer : this.conf.listPeers()) {
            if (peer.equals(this.serverId)) {
                continue;
            }
            if (!this.rpcService.connect(peer.getEndpoint())) {
                LOG.warn("Node {} channel init failed, address={}.", getNodeId(), peer.getEndpoint());
                continue;
            }
            final OnRequestVoteRpcDone done = new OnRequestVoteRpcDone(peer, this.currTerm, this);
            done.request = RequestVoteRequest.newBuilder() //
                .setPreVote(false) // It's not a pre-vote request.
                .setGroupId(this.groupId) //
                .setServerId(this.serverId.toString()) //
                .setPeerId(peer.toString()) //
                .setTerm(this.currTerm) //
                .setLastLogIndex(lastLogId.getIndex()) //
                .setLastLogTerm(lastLogId.getTerm()) //
                .build();
            this.rpcService.requestVote(peer.getEndpoint(), done.request, done);
        }

        this.metaStorage.setTermAndVotedFor(this.currTerm, this.serverId);
        this.voteCtx.grant(this.serverId);
        if (this.voteCtx.isGranted()) {
            //7. 投票成功,那么就晋升为leader
            becomeLeader();
        }
    } finally {
        this.writeLock.unlock();
    }
}

Do not look at this method so long, in fact, are the pre-election and methods preVote front of a high degree of repetition. The method is too long, it marked a number from the top number to explain step by step:

  1. The current node check if the current node is not in a cluster which is not an election
  2. Because Follower sponsored elections, it is probably due to the elected official, the pre-election timer to turn off
  3. Clear leader and then elections, attention here will votedId set to the current node, on behalf of their candidates
  4. Poll timer start, because it may need to loop a poll vote fails, voteTimer which calls electSelf according to the current state of the electoral CANDIDATE
  5. Call the init method to initialize the ballot box, and here is the same prevVoteCtx
  6. Traversing all nodes, and then send requests to other cluster nodes RequestVoteRequest, and preVote here is the same, the request is processed RequestVoteRequestProcessor processor.
  7. If more than half of the voting node is selected, then call becomeLeader promoted to leader

I take a look at how to deal with the election RequestVoteRequestProcessor:
in RequestVoteRequestProcessor of processRequest0 calls handleRequestVoteRequest NodeImpl to handle specific logic.

Voting process requests

NodeImpl#handleRequestVoteRequest

public Message handleRequestVoteRequest(final RequestVoteRequest request) {
    boolean doUnlock = true;
    this.writeLock.lock();
    try {
        //是否存活
        if (!this.state.isActive()) {
            LOG.warn("Node {} is not in active state, currTerm={}.", getNodeId(), this.currTerm);
            return RpcResponseFactory.newResponse(RaftError.EINVAL, "Node %s is not in active state, state %s.",
                getNodeId(), this.state.name());
        }
        final PeerId candidateId = new PeerId();
        if (!candidateId.parse(request.getServerId())) {
            LOG.warn("Node {} received RequestVoteRequest from {} serverId bad format.", getNodeId(),
                request.getServerId());
            return RpcResponseFactory.newResponse(RaftError.EINVAL, "Parse candidateId failed: %s.",
                request.getServerId());
        }

        // noinspection ConstantConditions
        do {
            // check term
            if (request.getTerm() >= this.currTerm) {
                LOG.info("Node {} received RequestVoteRequest from {}, term={}, currTerm={}.", getNodeId(),
                        request.getServerId(), request.getTerm(), this.currTerm);
                //1. 如果请求的任期大于当前任期
                // increase current term, change state to follower
                if (request.getTerm() > this.currTerm) {
                    stepDown(request.getTerm(), false, new Status(RaftError.EHIGHERTERMRESPONSE,
                        "Raft node receives higher term RequestVoteRequest."));
                }
            } else {
                // ignore older term
                LOG.info("Node {} ignore RequestVoteRequest from {}, term={}, currTerm={}.", getNodeId(),
                    request.getServerId(), request.getTerm(), this.currTerm);
                break;
            }
            doUnlock = false;
            this.writeLock.unlock();

            final LogId lastLogId = this.logManager.getLastLogId(true);

            doUnlock = true;
            this.writeLock.lock();
            // vote need ABA check after unlock&writeLock
            if (request.getTerm() != this.currTerm) {
                LOG.warn("Node {} raise term {} when get lastLogId.", getNodeId(), this.currTerm);
                break;
            }
            //2. 判断日志完整度
            final boolean logIsOk = new LogId(request.getLastLogIndex(), request.getLastLogTerm())
                .compareTo(lastLogId) >= 0;
            //3. 判断当前的节点是不是已经投过票了
            if (logIsOk && (this.votedId == null || this.votedId.isEmpty())) {
                stepDown(request.getTerm(), false, new Status(RaftError.EVOTEFORCANDIDATE,
                    "Raft node votes for some candidate, step down to restart election_timer."));
                this.votedId = candidateId.copy();
                this.metaStorage.setVotedFor(candidateId);
            }
        } while (false);

        return RequestVoteResponse.newBuilder() //
            .setTerm(this.currTerm) //
            //4.同意投票的条件是当前的任期和请求的任期一样,并且已经将votedId设置为请求节点
            .setGranted(request.getTerm() == this.currTerm && candidateId.equals(this.votedId)) //
            .build();
    } finally {
        if (doUnlock) {
            this.writeLock.unlock();
        }
    }
}

This method is also roughly similar and handlePreVoteRequest. I am here just analyze my label.

  1. Here it is the determination of the current term is smaller than the requested term, and the call request stepDown term to the current term, the current settings are Follower
  2. As a leader do the complete log is certainly better than the requested node, it is judged that the log is not complete here than the requested node logs
  3. If the log is complete, and the requested node did not vote for other candidates, then the node will votedId set to the current request
  4. In response to the request, the vote is the same approval condition term and term of the current request, and has set the requesting node votedId

Promotion leader

After the completion of the vote if more than half the number of votes received, it will be promoted to leader, calling becomeLeader method.

private void becomeLeader() {
    Requires.requireTrue(this.state == State.STATE_CANDIDATE, "Illegal state: " + this.state);
    LOG.info("Node {} become leader of group, term={}, conf={}, oldConf={}.", getNodeId(), this.currTerm,
        this.conf.getConf(), this.conf.getOldConf());
    // cancel candidate vote timer
    //晋升leader之后就会把选举的定时器关闭了
    stopVoteTimer();
    //设置当前的状态为leader
    this.state = State.STATE_LEADER;
    this.leaderId = this.serverId.copy();
    //复制集群中设置新的任期
    this.replicatorGroup.resetTerm(this.currTerm);
    //遍历所有的集群节点
    for (final PeerId peer : this.conf.listPeers()) {
        if (peer.equals(this.serverId)) {
            continue;
        }
        LOG.debug("Node {} add replicator, term={}, peer={}.", getNodeId(), this.currTerm, peer);
        //如果成为leader,那么需要把自己的日志信息复制到其他节点
        if (!this.replicatorGroup.addReplicator(peer)) {
            LOG.error("Fail to add replicator, peer={}.", peer);
        }
    }
    // init commit manager
    this.ballotBox.resetPendingIndex(this.logManager.getLastLogIndex() + 1);
    // Register _conf_ctx to reject configuration changing before the first log
    // is committed.
    if (this.confCtx.isBusy()) {
        throw new IllegalStateException();
    }
    this.confCtx.flush(this.conf.getConf(), this.conf.getOldConf());
    //如果是leader了,那么就要定时的检查不是有资格胜任
    this.stepDownTimer.start();
}

This method will stop inside the first election the timer and set the current state leader, and set the value of a term, then traverse all the nodes will be added to the copy nodes in the cluster, and finally stepDownTimer open, verify timing is not another leader more than half of the nodes in response to the current leader.

Well, here it is finished, I hope next time you can see you again ~

Guess you like

Origin www.cnblogs.com/qq575654643/p/11743716.html