8. SOFAJRaft source code analysis - JRaft is how to copy the log?

Foreword

A few days ago, and Tencent Gangster dinner chat, talk about my understanding of SOFAJRaft, I naturally thought I was really understand them, but big brother asked about the log between my SOFAJRaft cluster is how to copy?
I was speechless, I can not tell how to achieve, so this is to analyze SOFAJRaft log replication is how to do.

Leader transmission of the probe acquired LastLogIndex Follower

Leader node after establishing connection Replicator and Follower, a request to send a probe Probe type, the aim has been known Follower position log, in order to send a subsequent log to Follower.

Process substantially as follows:

NodeImpl#becomeLeader->replicatorGroup#addReplicator->Replicator#start->Replicator#sendEmptyEntries

Eventually to get Follower of LastLogIndex by calling sendEmptyEntries method Replicator to send probes

Replicator#sendEmptyEntries

private void sendEmptyEntries(final boolean isHeartbeat,
                              final RpcResponseClosure<AppendEntriesResponse> heartBeatClosure) {
    final AppendEntriesRequest.Builder rb = AppendEntriesRequest.newBuilder();
    //将集群配置设置到rb中,例如Term,GroupId,ServerId等
    if (!fillCommonFields(rb, this.nextIndex - 1, isHeartbeat)) {
        // id is unlock in installSnapshot
        installSnapshot();
        if (isHeartbeat && heartBeatClosure != null) {
            Utils.runClosureInThread(heartBeatClosure, new Status(RaftError.EAGAIN,
                "Fail to send heartbeat to peer %s", this.options.getPeerId()));
        }
        return;
    }
    try {
        final long monotonicSendTimeMs = Utils.monotonicMs();
        final AppendEntriesRequest request = rb.build();

        if (isHeartbeat) {
            ....//省略心跳代码
        } else {
            //statInfo这个类没看到哪里有用到,
            // Sending a probe request.
            //leader发送探针获取Follower的LastLogIndex
            this.statInfo.runningState = RunningState.APPENDING_ENTRIES;
            //将lastLogIndex设置为比firstLogIndex小1
            this.statInfo.firstLogIndex = this.nextIndex;
            this.statInfo.lastLogIndex = this.nextIndex - 1;
            this.appendEntriesCounter++;
            //设置当前Replicator为发送探针
            this.state = State.Probe;
            final int stateVersion = this.version;
            //返回reqSeq,并将reqSeq加一
            final int seq = getAndIncrementReqSeq();
            final Future<Message> rpcFuture = this.rpcService.appendEntries(this.options.getPeerId().getEndpoint(),
                request, -1, new RpcResponseClosureAdapter<AppendEntriesResponse>() {

                    @Override
                    public void run(final Status status) {
                        onRpcReturned(Replicator.this.id, RequestType.AppendEntries, status, request,
                            getResponse(), seq, stateVersion, monotonicSendTimeMs);
                    }

                });
            //Inflight 是对批量发送出去的 logEntry 的一种抽象,他表示哪些 logEntry 已经被封装成日志复制 request 发送出去了
            //这里是将logEntry封装到Inflight中
            addInflight(RequestType.AppendEntries, this.nextIndex, 0, 0, seq, rpcFuture);
        }
        LOG.debug("Node {} send HeartbeatRequest to {} term {} lastCommittedIndex {}", this.options.getNode()
            .getNodeId(), this.options.getPeerId(), this.options.getTerm(), request.getCommittedIndex());
    } finally {
        this.id.unlock();
    }
}

When you call sendEmptyEntries method will pass isHeartbeat is false and heartBeatClosure is null, because our main method is to send a probe to obtain displacement Follower of.
FillCommonFields method first calls the term, groupId, ServerId, PeerIdLogIndex rb provided to the other, as:

private boolean fillCommonFields(final AppendEntriesRequest.Builder rb, long prevLogIndex, final boolean isHeartbeat) {
    final long prevLogTerm = this.options.getLogManager().getTerm(prevLogIndex);
    ....
    rb.setTerm(this.options.getTerm());
    rb.setGroupId(this.options.getGroupId());
    rb.setServerId(this.options.getServerId().toString());
    rb.setPeerId(this.options.getPeerId().toString());
    rb.setPrevLogIndex(prevLogIndex);
    rb.setPrevLogTerm(prevLogTerm);
    rb.setCommittedIndex(this.options.getBallotBox().getLastCommittedIndex());
    return true;
}

Note prevLogIndex is nextIndex-1, represents the current index
continues to go down, it will set an example statInfo inside the property, but statInfo I did not see where the object to be useful too.
Then sends a request to the Follower AppendEntriesRequest, onRpcReturned responsible for responding to requests.
After sending a request to initialize a call addInflight Inflight instance, was added to inflights set as follows:

private void addInflight(final RequestType reqType, final long startIndex, final int count, final int size,
                         final int seq, final Future<Message> rpcInfly) {
    this.rpcInFly = new Inflight(reqType, startIndex, count, size, seq, rpcInfly);
    this.inflights.add(this.rpcInFly);
    this.nodeMetrics.recordSize("replicate-inflights-count", this.inflights.size());
}

Inflight is an abstract representation of logEntry batch sent out, he said logEntry which has been packaged into a log copy request sent out, here in the Inflight logEntry encapsulated.

Leader batch send logs to Follower

Replicator#sendEntries

private boolean sendEntries(final long nextSendingIndex) {
    final AppendEntriesRequest.Builder rb = AppendEntriesRequest.newBuilder();
    //填写当前Replicator的配置信息到rb中
    if (!fillCommonFields(rb, nextSendingIndex - 1, false)) {
        // unlock id in installSnapshot
        installSnapshot();
        return false;
    }

    ByteBufferCollector dataBuf = null;
    //获取最大的size为1024
    final int maxEntriesSize = this.raftOptions.getMaxEntriesSize();

    //这里使用了类似对象池的技术,避免重复创建对象
    final RecyclableByteBufferList byteBufList = RecyclableByteBufferList.newInstance();
    try {
        //循环遍历出所有的logEntry封装到byteBufList和emb中
        for (int i = 0; i < maxEntriesSize; i++) {
            final RaftOutter.EntryMeta.Builder emb = RaftOutter.EntryMeta.newBuilder();
            //nextSendingIndex代表下一个要发送的index,i代表偏移量
            if (!prepareEntry(nextSendingIndex, i, emb, byteBufList)) {
                break;
            }
            rb.addEntries(emb.build());
        }
        //如果EntriesCount为0的话,说明LogManager里暂时没有新数据
        if (rb.getEntriesCount() == 0) {
            if (nextSendingIndex < this.options.getLogManager().getFirstLogIndex()) {
                installSnapshot();
                return false;
            }
            // _id is unlock in _wait_more
            waitMoreEntries(nextSendingIndex);
            return false;
        }
        //将byteBufList里面的数据放入到rb中
        if (byteBufList.getCapacity() > 0) {
            dataBuf = ByteBufferCollector.allocateByRecyclers(byteBufList.getCapacity());
            for (final ByteBuffer b : byteBufList) {
                dataBuf.put(b);
            }
            final ByteBuffer buf = dataBuf.getBuffer();
            buf.flip();
            rb.setData(ZeroByteStringHelper.wrap(buf));
        }
    } finally {
        //回收一下byteBufList
        RecycleUtil.recycle(byteBufList);
    }

    final AppendEntriesRequest request = rb.build();
    if (LOG.isDebugEnabled()) {
        LOG.debug(
            "Node {} send AppendEntriesRequest to {} term {} lastCommittedIndex {} prevLogIndex {} prevLogTerm {} logIndex {} count {}",
            this.options.getNode().getNodeId(), this.options.getPeerId(), this.options.getTerm(),
            request.getCommittedIndex(), request.getPrevLogIndex(), request.getPrevLogTerm(), nextSendingIndex,
            request.getEntriesCount());
    }
    //statInfo没找到哪里有用到过
    this.statInfo.runningState = RunningState.APPENDING_ENTRIES;
    this.statInfo.firstLogIndex = rb.getPrevLogIndex() + 1;
    this.statInfo.lastLogIndex = rb.getPrevLogIndex() + rb.getEntriesCount();

    final Recyclable recyclable = dataBuf;
    final int v = this.version;
    final long monotonicSendTimeMs = Utils.monotonicMs();
    final int seq = getAndIncrementReqSeq();
    final Future<Message> rpcFuture = this.rpcService.appendEntries(this.options.getPeerId().getEndpoint(),
        request, -1, new RpcResponseClosureAdapter<AppendEntriesResponse>() {

            @Override
            public void run(final Status status) {
                //回收资源
                RecycleUtil.recycle(recyclable);
                onRpcReturned(Replicator.this.id, RequestType.AppendEntries, status, request, getResponse(), seq,
                    v, monotonicSendTimeMs);
            }

        });
    //添加Inflight
    addInflight(RequestType.AppendEntries, nextSendingIndex, request.getEntriesCount(), request.getData().size(),
        seq, rpcFuture);
    return true;

}
  1. First calls fillCommonFields method, fill in the current Replicator configuration information to the rb;
  2. Call prepareEntry, calculate the current offset of the current I and nextSendingIndex, and then to find the corresponding LogManager LogEntry, then LogEntry to emb inside the property, and the data inside LogEntry added to the RecyclableByteBufferList;
  3. LogEntry if there is no new data will then EntriesCount is 0, then return;
  4. ByteBufList data traversing inside, add data to rb, so that the data inside front rear rb term, type, length, and other data, rb is the real data;
  5. Examples AppendEntriesRequest new transmission request;
  6. Inflight added to the queue. Leader maintain a queue, each issued on behalf of a group of logEntry to add Inflight this group logEntry to the queue, so it knows when a group of logEntry copy fails, you can rely queue of Inflight to the batch and the subsequent logEntry recopy all of the logs to the follower. Both to ensure log replication to complete, and ensures the constant sequence replication log

Which RecyclableByteBufferList instantiated using object pooling, object pooling information I can look at this: 7. SOFAJRaft source code analysis - how to achieve a lightweight object pool?

Here we explain what specific methods sendEntries inside.

prepareEntry filling emb property

Replicator#prepareEntry

boolean prepareEntry(final long nextSendingIndex, final int offset, final RaftOutter.EntryMeta.Builder emb,
                     final RecyclableByteBufferList dateBuffer) {
    if (dateBuffer.getCapacity() >= this.raftOptions.getMaxBodySize()) {
        return false;
    }
    //设置当前要发送的index
    final long logIndex = nextSendingIndex + offset;
    //如果这个index已经在LogManager中找不到了,那么直接返回
    final LogEntry entry = this.options.getLogManager().getEntry(logIndex);
    if (entry == null) {
        return false;
    }
    //下面就是把LogEntry里面的属性设置到emb中
    emb.setTerm(entry.getId().getTerm());
    if (entry.hasChecksum()) {
        emb.setChecksum(entry.getChecksum()); //since 1.2.6
    }
    emb.setType(entry.getType());
    if (entry.getPeers() != null) {
        Requires.requireTrue(!entry.getPeers().isEmpty(), "Empty peers at logIndex=%d", logIndex);
        for (final PeerId peer : entry.getPeers()) {
            emb.addPeers(peer.toString());
        }
        if (entry.getOldPeers() != null) {
            for (final PeerId peer : entry.getOldPeers()) {
                emb.addOldPeers(peer.toString());
            }
        }
    } else {
        Requires.requireTrue(entry.getType() != EnumOutter.EntryType.ENTRY_TYPE_CONFIGURATION,
            "Empty peers but is ENTRY_TYPE_CONFIGURATION type at logIndex=%d", logIndex);
    }
    final int remaining = entry.getData() != null ? entry.getData().remaining() : 0;
    emb.setDataLen(remaining);
    //把LogEntry里面的数据放入到dateBuffer中
    if (entry.getData() != null) {
        // should slice entry data
        dateBuffer.add(entry.getData().slice());
    }
    return true;
}
  1. Compare whether the incoming dateBuffer capacity has exceeded the capacity (512 * 1024) set by the system, if more than a false return
  2. The given index and calculating the initial offset offset logIndex, then go inside LogManager LogEntry The acquisition index, if it could not be found is returned, then return directly false, if judgment is executed out of the outer break cycle
  3. Then LogEntry property to emb inside the object, and finally added to the data inside LogEntry dateBuffer, and attribute data to do this separation

Follower logs copy request sent Leader

After AppendEntriesRequest leader after sending the request, the requested data will be processed by the AppendEntriesRequestProcessor in Follower

A specific processing method is processRequest0

public Message processRequest0(final RaftServerService service, final AppendEntriesRequest request,
                               final RpcRequestClosure done) {

    final Node node = (Node) service;

    //默认使用pipeline
    if (node.getRaftOptions().isReplicatorPipeline()) {
        final String groupId = request.getGroupId();
        final String peerId = request.getPeerId();
        //获取请求的次数,以groupId+peerId为一个维度
        final int reqSequence = getAndIncrementSequence(groupId, peerId, done.getBizContext().getConnection());
        //Follower处理leader发过来的日志请求
        final Message response = service.handleAppendEntriesRequest(request, new SequenceRpcRequestClosure(done,
            reqSequence, groupId, peerId));
        //正常的数据只返回null,异常的数据会返回response
        if (response != null) {
            sendSequenceResponse(groupId, peerId, reqSequence, done.getAsyncContext(), done.getBizContext(),
                response);
        }
        return null;
    } else {
        return service.handleAppendEntriesRequest(request, done);
    }
}

The handleAppendEntriesRequest call service calls to handleAppendEntriesRequest method NodeIml in, handleAppendEntriesRequest method will return only when unusual circumstances and no leader to send data, normally return null

Processing the response requests log replication

NodeIml # handleAppendEntriesRequest

public Message handleAppendEntriesRequest(final AppendEntriesRequest request, final RpcRequestClosure done) {
    boolean doUnlock = true;
    final long startMs = Utils.monotonicMs();
    this.writeLock.lock();
    //获取entryLog个数
    final int entriesCount = request.getEntriesCount();
    try {
        //校验当前节点是否活跃
        if (!this.state.isActive()) {
            LOG.warn("Node {} is not in active state, currTerm={}.", getNodeId(), this.currTerm);
            return RpcResponseFactory.newResponse(RaftError.EINVAL, "Node %s is not in active state, state %s.",
                getNodeId(), this.state.name());
        }
        //校验传入的serverId是否能被正常解析
        final PeerId serverId = new PeerId();
        if (!serverId.parse(request.getServerId())) {
            LOG.warn("Node {} received AppendEntriesRequest from {} serverId bad format.", getNodeId(),
                request.getServerId());
            return RpcResponseFactory.newResponse(RaftError.EINVAL, "Parse serverId failed: %s.",
                request.getServerId());
        }
        //校验任期
        // Check stale term
        if (request.getTerm() < this.currTerm) {
            LOG.warn("Node {} ignore stale AppendEntriesRequest from {}, term={}, currTerm={}.", getNodeId(),
                request.getServerId(), request.getTerm(), this.currTerm);
            return AppendEntriesResponse.newBuilder() //
                .setSuccess(false) //
                .setTerm(this.currTerm) //
                .build();
        }

        // Check term and state to step down
        //当前节点如果不是Follower节点的话要执行StepDown操作
        checkStepDown(request.getTerm(), serverId);
        //这说明请求的节点不是当前节点的leader
        if (!serverId.equals(this.leaderId)) {
            LOG.error("Another peer {} declares that it is the leader at term {} which was occupied by leader {}.",
                serverId, this.currTerm, this.leaderId);
            // Increase the term by 1 and make both leaders step down to minimize the
            // loss of split brain
            stepDown(request.getTerm() + 1, false, new Status(RaftError.ELEADERCONFLICT,
                "More than one leader in the same term."));
            return AppendEntriesResponse.newBuilder() //
                .setSuccess(false) //
                .setTerm(request.getTerm() + 1) //
                .build();
        }

        updateLastLeaderTimestamp(Utils.monotonicMs());

        //校验是否正在生成快照
        if (entriesCount > 0 && this.snapshotExecutor != null && this.snapshotExecutor.isInstallingSnapshot()) {
            LOG.warn("Node {} received AppendEntriesRequest while installing snapshot.", getNodeId());
            return RpcResponseFactory.newResponse(RaftError.EBUSY, "Node %s:%s is installing snapshot.",
                this.groupId, this.serverId);
        }
        //传入的是发起请求节点的nextIndex-1
        final long prevLogIndex = request.getPrevLogIndex();
        final long prevLogTerm = request.getPrevLogTerm();
        final long localPrevLogTerm = this.logManager.getTerm(prevLogIndex);
        //发起请求的节点prevLogIndex对应的任期和当前节点的index所对应的任期不匹配
        if (localPrevLogTerm != prevLogTerm) {
            final long lastLogIndex = this.logManager.getLastLogIndex();

            LOG.warn(
                "Node {} reject term_unmatched AppendEntriesRequest from {}, term={}, prevLogIndex={}, prevLogTerm={}, localPrevLogTerm={}, lastLogIndex={}, entriesSize={}.",
                getNodeId(), request.getServerId(), request.getTerm(), prevLogIndex, prevLogTerm, localPrevLogTerm,
                lastLogIndex, entriesCount);

            return AppendEntriesResponse.newBuilder() //
                .setSuccess(false) //
                .setTerm(this.currTerm) //
                .setLastLogIndex(lastLogIndex) //
                .build();
        }
        //响应心跳或者发送的是sendEmptyEntry
        if (entriesCount == 0) {
            // heartbeat
            final AppendEntriesResponse.Builder respBuilder = AppendEntriesResponse.newBuilder() //
                .setSuccess(true) //
                .setTerm(this.currTerm)
                //  返回当前节点的最新的index
                .setLastLogIndex(this.logManager.getLastLogIndex());
            doUnlock = false;
            this.writeLock.unlock();
            // see the comments at FollowerStableClosure#run()
            this.ballotBox.setLastCommittedIndex(Math.min(request.getCommittedIndex(), prevLogIndex));
            return respBuilder.build();
        }

        // Parse request
        long index = prevLogIndex;
        final List<LogEntry> entries = new ArrayList<>(entriesCount);
        ByteBuffer allData = null;
        if (request.hasData()) {
            allData = request.getData().asReadOnlyByteBuffer();
        }
        //获取所有数据
        final List<RaftOutter.EntryMeta> entriesList = request.getEntriesList();
        for (int i = 0; i < entriesCount; i++) {
            final RaftOutter.EntryMeta entry = entriesList.get(i);
            index++;
            if (entry.getType() != EnumOutter.EntryType.ENTRY_TYPE_UNKNOWN) {
                //给logEntry属性设值
                final LogEntry logEntry = new LogEntry();
                logEntry.setId(new LogId(index, entry.getTerm()));
                logEntry.setType(entry.getType());
                if (entry.hasChecksum()) {
                    logEntry.setChecksum(entry.getChecksum()); // since 1.2.6
                }
                //将数据填充到logEntry
                final long dataLen = entry.getDataLen();
                if (dataLen > 0) {
                    final byte[] bs = new byte[(int) dataLen];
                    assert allData != null;
                    allData.get(bs, 0, bs.length);
                    logEntry.setData(ByteBuffer.wrap(bs));
                }

                if (entry.getPeersCount() > 0) {
                    //只有配置类型的entry才有多个Peer
                    if (entry.getType() != EnumOutter.EntryType.ENTRY_TYPE_CONFIGURATION) {
                        throw new IllegalStateException(
                                "Invalid log entry that contains peers but is not ENTRY_TYPE_CONFIGURATION type: "
                                        + entry.getType());
                    }

                    final List<PeerId> peers = new ArrayList<>(entry.getPeersCount());
                    for (final String peerStr : entry.getPeersList()) {
                        final PeerId peer = new PeerId();
                        peer.parse(peerStr);
                        peers.add(peer);
                    }
                    logEntry.setPeers(peers);

                    if (entry.getOldPeersCount() > 0) {
                        final List<PeerId> oldPeers = new ArrayList<>(entry.getOldPeersCount());
                        for (final String peerStr : entry.getOldPeersList()) {
                            final PeerId peer = new PeerId();
                            peer.parse(peerStr);
                            oldPeers.add(peer);
                        }
                        logEntry.setOldPeers(oldPeers);
                    }
                } else if (entry.getType() == EnumOutter.EntryType.ENTRY_TYPE_CONFIGURATION) {
                    throw new IllegalStateException(
                            "Invalid log entry that contains zero peers but is ENTRY_TYPE_CONFIGURATION type");
                }

                // Validate checksum
                if (this.raftOptions.isEnableLogEntryChecksum() && logEntry.isCorrupted()) {
                    long realChecksum = logEntry.checksum();
                    LOG.error(
                            "Corrupted log entry received from leader, index={}, term={}, expectedChecksum={}, " +
                             "realChecksum={}",
                            logEntry.getId().getIndex(), logEntry.getId().getTerm(), logEntry.getChecksum(),
                            realChecksum);
                    return RpcResponseFactory.newResponse(RaftError.EINVAL,
                            "The log entry is corrupted, index=%d, term=%d, expectedChecksum=%d, realChecksum=%d",
                            logEntry.getId().getIndex(), logEntry.getId().getTerm(), logEntry.getChecksum(),
                            realChecksum);
                }

                entries.add(logEntry);
            }
        }
        //存储日志,并回调返回response
        final FollowerStableClosure closure = new FollowerStableClosure(request, AppendEntriesResponse.newBuilder()
            .setTerm(this.currTerm), this, done, this.currTerm);
        this.logManager.appendEntries(entries, closure);
        // update configuration after _log_manager updated its memory status
        this.conf = this.logManager.checkAndSetConfiguration(this.conf);
        return null;
    } finally {
        if (doUnlock) {
            this.writeLock.unlock();
        }
        this.metrics.recordLatency("handle-append-entries", Utils.monotonicMs() - startMs);
        this.metrics.recordSize("handle-append-entries-count", entriesCount);
    }
}

handleAppendEntriesRequest method of writing is very long, but in fact do a lot of checking things, not much specific processing logic

  1. Node check whether the current node is still in the active state, if not, then a response is returned directly to the error
  2. serverId verification request format is correct, the error is not a proper response is returned
  3. Term verification request is less than the current term, if the return type of a response AppendEntriesResponse
  4. Call checkStepDown method for detecting the current node's term of office, and the state, such as whether a leader
  5. If the requested and current node leaderId serverId is not the same, it is not used to verify leader initiated request, if not a return AppendEntriesResponse
  6. Whether the check is a snapshot
  7. LogEntry Index term corresponding to the acquisition request in the current node is not the same term and incoming requests, returns different words AppendEntriesResponse
  8. If the incoming entriesCount zero, then the leader may be sending heartbeat or sent sendEmptyEntry, return AppendEntriesResponse, and the current term of office and the latest index return package
  9. The requested data is not empty, then through all the data
  10. Examples of a logentry, and the data and the property logentry example, the last set of entries placed into logentry
  11. LogManager call log data batch submitted written RocksDB

Sending a response to the leader

The final response is sent to the leader is transmitted through the sendSequenceResponse AppendEntriesRequestProcessor

void sendSequenceResponse(final String groupId, final String peerId, final int seq,
                          final AsyncContext asyncContext, final BizContext bizContext, final Message msg) {
    final Connection connection = bizContext.getConnection();
    //获取context,维度是groupId和peerId
    final PeerRequestContext ctx = getPeerRequestContext(groupId, peerId, connection);
    final PriorityQueue<SequenceMessage> respQueue = ctx.responseQueue;
    assert (respQueue != null);

    synchronized (Utils.withLockObject(respQueue)) {
        //将要响应的数据放入到优先队列中
        respQueue.add(new SequenceMessage(asyncContext, msg, seq));
        //校验队列里面的数据是否超过了256
        if (!ctx.hasTooManyPendingResponses()) {
            while (!respQueue.isEmpty()) {
                final SequenceMessage queuedPipelinedResponse = respQueue.peek();
                //如果序列对应不上,那么就不发送响应
                if (queuedPipelinedResponse.sequence != getNextRequiredSequence(groupId, peerId, connection)) {
                    // sequence mismatch, waiting for next response.
                    break;
                }
                respQueue.remove();
                try {
                    //发送响应
                    queuedPipelinedResponse.sendResponse();
                } finally {
                    //序列加一
                    getAndIncrementNextRequiredSequence(groupId, peerId, connection);
                }
            }
        } else {
            LOG.warn("Closed connection to peer {}/{}, because of too many pending responses, queued={}, max={}",
                ctx.groupId, peerId, respQueue.size(), ctx.maxPendingResponses);
            connection.close();
            // Close the connection if there are too many pending responses in queue.
            removePeerRequestContext(groupId, peerId);
        }
    }
}

This method will send data to be sequentially pushed into the priority queue to sort PriorityQueue, and obtain the serial number and the smallest element nextRequiredSequence comparison, if not equal, then it is appeared in the case of out of order, it does not send a request

Leader logs copy of Response

Calls onRpcReturned method Replicator after receipt Follower Leader sent me Response response

static void onRpcReturned(final ThreadId id, final RequestType reqType, final Status status, final Message request,
                          final Message response, final int seq, final int stateVersion, final long rpcSendTime) {
    if (id == null) {
        return;
    }
    final long startTimeMs = Utils.nowMs();
    Replicator r;
    if ((r = (Replicator) id.lock()) == null) {
        return;
    }
    //检查版本号,因为每次resetInflights都会让version加一,所以检查一下
    if (stateVersion != r.version) {
        LOG.debug(
            "Replicator {} ignored old version response {}, current version is {}, request is {}\n, and response is {}\n, status is {}.",
            r, stateVersion, r.version, request, response, status);
        id.unlock();
        return;
    }
    //使用优先队列按seq排序,最小的会在第一个
    final PriorityQueue<RpcResponse> holdingQueue = r.pendingResponses;
    //这里用一个优先队列是因为响应是异步的,seq小的可能响应比seq大慢
    holdingQueue.add(new RpcResponse(reqType, seq, status, request, response, rpcSendTime));
    //默认holdingQueue队列里面的数量不能超过256
    if (holdingQueue.size() > r.raftOptions.getMaxReplicatorInflightMsgs()) {
        LOG.warn("Too many pending responses {} for replicator {}, maxReplicatorInflightMsgs={}",
            holdingQueue.size(), r.options.getPeerId(), r.raftOptions.getMaxReplicatorInflightMsgs());
        //重新发送探针
        //清空数据
        r.resetInflights();
        r.state = State.Probe;
        r.sendEmptyEntries(false);
        return;
    }

    boolean continueSendEntries = false;

    final boolean isLogDebugEnabled = LOG.isDebugEnabled();
    StringBuilder sb = null;
    if (isLogDebugEnabled) {
        sb = new StringBuilder("Replicator ").append(r).append(" is processing RPC responses,");
    }
    try {
        int processed = 0;
        while (!holdingQueue.isEmpty()) {
            //取出holdingQueue里seq最小的数据
            final RpcResponse queuedPipelinedResponse = holdingQueue.peek();

            //如果Follower没有响应的话就会出现次序对不上的情况,那么就不往下走了
            //sequence mismatch, waiting for next response.
            if (queuedPipelinedResponse.seq != r.requiredNextSeq) {
                // 如果之前存在处理,则到此直接break循环
                if (processed > 0) {
                    if (isLogDebugEnabled) {
                        sb.append("has processed ").append(processed).append(" responses,");
                    }
                    break;
                } else {
                    //Do not processed any responses, UNLOCK id and return.
                    continueSendEntries = false;
                    id.unlock();
                    return;
                }
            }
            //走到这里说明seq对的上,那么就移除优先队列里面seq最小的数据
            holdingQueue.remove();
            processed++;
            //获取inflights队列里的第一个元素
            final Inflight inflight = r.pollInflight();
            //发起一个请求的时候会将inflight放入到队列中
            //如果为空,那么就忽略
            if (inflight == null) {
                // The previous in-flight requests were cleared.
                if (isLogDebugEnabled) {
                    sb.append("ignore response because request not found:").append(queuedPipelinedResponse)
                        .append(",\n");
                }
                continue;
            }
            //seq没有对上,说明顺序乱了,重置状态
            if (inflight.seq != queuedPipelinedResponse.seq) {
                // reset state
                LOG.warn(
                    "Replicator {} response sequence out of order, expect {}, but it is {}, reset state to try again.",
                    r, inflight.seq, queuedPipelinedResponse.seq);
                r.resetInflights();
                r.state = State.Probe;
                continueSendEntries = false;
                // 锁住节点,根据错误类别等待一段时间
                r.block(Utils.nowMs(), RaftError.EREQUEST.getNumber());
                return;
            }
            try {
                switch (queuedPipelinedResponse.requestType) {
                    case AppendEntries:
                        //处理日志复制的response
                        continueSendEntries = onAppendEntriesReturned(id, inflight, queuedPipelinedResponse.status,
                            (AppendEntriesRequest) queuedPipelinedResponse.request,
                            (AppendEntriesResponse) queuedPipelinedResponse.response, rpcSendTime, startTimeMs, r);
                        break;
                    case Snapshot:
                        //处理快照的response
                        continueSendEntries = onInstallSnapshotReturned(id, r, queuedPipelinedResponse.status,
                            (InstallSnapshotRequest) queuedPipelinedResponse.request,
                            (InstallSnapshotResponse) queuedPipelinedResponse.response);
                        break;
                }
            } finally {
                if (continueSendEntries) {
                    // Success, increase the response sequence.
                    r.getAndIncrementRequiredNextSeq();
                } else {
                    // The id is already unlocked in onAppendEntriesReturned/onInstallSnapshotReturned, we SHOULD break out.
                    break;
                }
            }
        }
    } finally {
        if (isLogDebugEnabled) {
            sb.append(", after processed, continue to send entries: ").append(continueSendEntries);
            LOG.debug(sb.toString());
        }
        if (continueSendEntries) {
            // unlock in sendEntries.
            r.sendEntries();
        }
    }
}
  1. Check the version number, because every time resetInflights will make version plus one, so the check is not the same batch of data
  2. Get the Replicator pendingResponses queue, then the data package in response to the current instance is added to the queue RpcResponse
  3. Checking whether the queue 256 is greater than the inside of the element, it is greater than 256 clear the data resynchronization
  4. HoldingQueue check queue inside seq minimal sequence data and the current requiredNextSeq sequence is the same, different, then if it is just entering the circulation loop exits directly break
  5. Get inflights the first element in the queue, if the no seq described chaotic sequence, the reset state
  6. Call onAppendEntriesReturned approach to log replication of response
  7. If the process is successful, then the call sendEntries continue to send logs to replicate Follower

Replicator#onAppendEntriesReturned

private static boolean onAppendEntriesReturned(final ThreadId id, final Inflight inflight, final Status status,
                                               final AppendEntriesRequest request,
                                               final AppendEntriesResponse response, final long rpcSendTime,
                                               final long startTimeMs, final Replicator r) {
    //校验数据序列有没有错
    if (inflight.startIndex != request.getPrevLogIndex() + 1) {
        LOG.warn(
            "Replicator {} received invalid AppendEntriesResponse, in-flight startIndex={}, request prevLogIndex={}, reset the replicator state and probe again.",
            r, inflight.startIndex, request.getPrevLogIndex());
        r.resetInflights();
        r.state = State.Probe;
        // unlock id in sendEmptyEntries
        r.sendEmptyEntries(false);
        return false;
    }
    //度量
    // record metrics
    if (request.getEntriesCount() > 0) {
        r.nodeMetrics.recordLatency("replicate-entries", Utils.monotonicMs() - rpcSendTime);
        r.nodeMetrics.recordSize("replicate-entries-count", request.getEntriesCount());
        r.nodeMetrics.recordSize("replicate-entries-bytes", request.getData() != null ? request.getData().size()
            : 0);
    }

    final boolean isLogDebugEnabled = LOG.isDebugEnabled();
    StringBuilder sb = null;
    if (isLogDebugEnabled) {
        sb = new StringBuilder("Node "). //
            append(r.options.getGroupId()).append(":").append(r.options.getServerId()). //
            append(" received AppendEntriesResponse from "). //
            append(r.options.getPeerId()). //
            append(" prevLogIndex=").append(request.getPrevLogIndex()). //
            append(" prevLogTerm=").append(request.getPrevLogTerm()). //
            append(" count=").append(request.getEntriesCount());
    }
    //如果follower因为崩溃,RPC调用失败等原因没有收到成功响应
    //那么需要阻塞一段时间再进行调用
    if (!status.isOk()) {
        // If the follower crashes, any RPC to the follower fails immediately,
        // so we need to block the follower for a while instead of looping until
        // it comes back or be removed
        // dummy_id is unlock in block
        if (isLogDebugEnabled) {
            sb.append(" fail, sleep.");
            LOG.debug(sb.toString());
        }
        //如果注册了Replicator状态监听器,那么通知所有监听器
        notifyReplicatorStatusListener(r, ReplicatorEvent.ERROR, status);
        if (++r.consecutiveErrorTimes % 10 == 0) {
            LOG.warn("Fail to issue RPC to {}, consecutiveErrorTimes={}, error={}", r.options.getPeerId(),
                r.consecutiveErrorTimes, status);
        }
        r.resetInflights();
        r.state = State.Probe;
        // unlock in in block
        r.block(startTimeMs, status.getCode());
        return false;
    }
    r.consecutiveErrorTimes = 0;
    //响应失败
    if (!response.getSuccess()) {
        // Leader 的切换,表明可能出现过一次网络分区,从新跟随新的 Leader
        if (response.getTerm() > r.options.getTerm()) {
            if (isLogDebugEnabled) {
                sb.append(" fail, greater term ").append(response.getTerm()).append(" expect term ")
                    .append(r.options.getTerm());
                LOG.debug(sb.toString());
            }
            // 获取当前本节点的表示对象——NodeImpl
            final NodeImpl node = r.options.getNode();
            r.notifyOnCaughtUp(RaftError.EPERM.getNumber(), true);
            r.destroy();
            // 调整自己的 term 任期值
            node.increaseTermTo(response.getTerm(), new Status(RaftError.EHIGHERTERMRESPONSE,
                "Leader receives higher term heartbeat_response from peer:%s", r.options.getPeerId()));
            return false;
        }
        if (isLogDebugEnabled) {
            sb.append(" fail, find nextIndex remote lastLogIndex ").append(response.getLastLogIndex())
                .append(" local nextIndex ").append(r.nextIndex);
            LOG.debug(sb.toString());
        }
        if (rpcSendTime > r.lastRpcSendTimestamp) {
            r.lastRpcSendTimestamp = rpcSendTime;
        }
        // Fail, reset the state to try again from nextIndex.
        r.resetInflights();
        //如果Follower最新的index小于下次要发送的index,那么设置为Follower响应的index
        // prev_log_index and prev_log_term doesn't match
        if (response.getLastLogIndex() + 1 < r.nextIndex) {
            LOG.debug("LastLogIndex at peer={} is {}", r.options.getPeerId(), response.getLastLogIndex());
            // The peer contains less logs than leader
            r.nextIndex = response.getLastLogIndex() + 1;
        } else {
            // The peer contains logs from old term which should be truncated,
            // decrease _last_log_at_peer by one to test the right index to keep
            if (r.nextIndex > 1) {
                LOG.debug("logIndex={} dismatch", r.nextIndex);
                r.nextIndex--;
            } else {
                LOG.error("Peer={} declares that log at index=0 doesn't match, which is not supposed to happen",
                    r.options.getPeerId());
            }
        }
        //响应失败需要重新获取Follower的日志信息,用来重新同步
        // dummy_id is unlock in _send_heartbeat
        r.sendEmptyEntries(false);
        return false;
    }
    if (isLogDebugEnabled) {
        sb.append(", success");
        LOG.debug(sb.toString());
    }
    // success
    //响应成功检查任期
    if (response.getTerm() != r.options.getTerm()) {
        r.resetInflights();
        r.state = State.Probe;
        LOG.error("Fail, response term {} dismatch, expect term {}", response.getTerm(), r.options.getTerm());
        id.unlock();
        return false;
    }
    if (rpcSendTime > r.lastRpcSendTimestamp) {
        r.lastRpcSendTimestamp = rpcSendTime;
    }
    // 本次提交的日志数量
    final int entriesSize = request.getEntriesCount();
    if (entriesSize > 0) {
        // 节点确认提交
        r.options.getBallotBox().commitAt(r.nextIndex, r.nextIndex + entriesSize - 1, r.options.getPeerId());
        if (LOG.isDebugEnabled()) {
            LOG.debug("Replicated logs in [{}, {}] to peer {}", r.nextIndex, r.nextIndex + entriesSize - 1,
                r.options.getPeerId());
        }
    } else {
        // The request is probe request, change the state into Replicate.
        r.state = State.Replicate;
    }
    r.nextIndex += entriesSize;
    r.hasSucceeded = true;
    r.notifyOnCaughtUp(RaftError.SUCCESS.getNumber(), false);
    // dummy_id is unlock in _send_entries
    if (r.timeoutNowIndex > 0 && r.timeoutNowIndex < r.nextIndex) {
        r.sendTimeoutNow(false, false);
    }
    return true;
}

onAppendEntriesReturned method is also very long, but we have to look down a little patience

  1. There is nothing wrong with the check data sequence
  2. And to measure splicing operation log
  3. Determine what is returned if the state is not normal, then the notification listener, reset operation and block certain period of time before sending
  4. If the return Success status is false, then check what the term, since switching Leader, indicating that there may have been a network partitioning, need to follow the new Leader; then there is no problem if the term reset operation, and according to the latest Follower returned the index to re-set value nextIndex
  5. If various check no problem, then submit a log to confirm, update the latest position of the index log submission

Guess you like

Origin www.cnblogs.com/luozhiyun/p/12005975.html