5. SOFAJRaft source code analysis - RheaKV how to store data?

Outline

Previous talked about how RheaKV is initialized, because RheaKV is mainly used for storage of KV, RheaKV literacy is quite complex, we will write together too voluminous, so this one is mainly concerned about how RheaKV storage of data.

Here we use the example of a client to begin this explanation:

public static void main(final String[] args) throws Exception {
    final Client client = new Client();
    client.init();
    //get(client.getRheaKVStore());
    RheaKVStore rheaKVStore = client.getRheaKVStore();
    final byte[] key = writeUtf8("hello");
    final byte[] value = writeUtf8("world");
    rheaKVStore.bPut(key, value);
    client.shutdown();
}

We start our example from this main method, call rheaKVStore.bPut (key, value) method to put data into RheaKV in.

public class Client {

    private final RheaKVStore rheaKVStore = new DefaultRheaKVStore();

    public void init() {
        final List<RegionRouteTableOptions> regionRouteTableOptionsList = MultiRegionRouteTableOptionsConfigured
            .newConfigured() //
            .withInitialServerList(-1L /* default id */, Configs.ALL_NODE_ADDRESSES) //
            .config();
        final PlacementDriverOptions pdOpts = PlacementDriverOptionsConfigured.newConfigured() //
            .withFake(true) //
            .withRegionRouteTableOptionsList(regionRouteTableOptionsList) //
            .config();
        final RheaKVStoreOptions opts = RheaKVStoreOptionsConfigured.newConfigured() //
            .withClusterName(Configs.CLUSTER_NAME) //
            .withPlacementDriverOptions(pdOpts) //
            .config();
        System.out.println(opts);
        rheaKVStore.init(opts);
    }

    public void shutdown() {
        this.rheaKVStore.shutdown();
    }

    public RheaKVStore getRheaKVStore() {
        return rheaKVStore;
    }
}

public class Configs { 
    public static String ALL_NODE_ADDRESSES = "127.0.0.1:8181,127.0.0.1:8182,127.0.0.1:8183";

    public static String CLUSTER_NAME       = "rhea_example";
}

Client when calling the init method to initialize rheaKVStore the server and example in the previous section we talked about like, the difference is less StoreEngineOptions setup and configuration of a multi-regionRouteTableOptionsList instance.

bPut stored data

BPut method we will call here the data stored in the DefaultRheaKVStore:
DefaultRheaKVStore # bPut

public Boolean bPut(final byte[] key, final byte[] value) {
    return FutureHelper.get(put(key, value), this.futureTimeoutMillis);
}

Inside the main method of operation bPut data stored in it put the method to do, put method returns a CompletableFuture FutureHelper method to get the call, and in which bPut method places a timeout, initialization in the init method, default 5 seconds.

Next we put into the process:
DefaultRheaKVStore put #

public CompletableFuture<Boolean> put(final byte[] key, final byte[] value) {
    Requires.requireNonNull(key, "key");
    Requires.requireNonNull(value, "value");
    //是否尝试进行批量的put
    return put(key, value, new CompletableFuture<>(), true);
}

Here we call the method overloaded put, the third parameter is passed an empty callback function, the fourth parameter indicates the use Batch bulk storage
DefaultRheaKVStore # put

private CompletableFuture<Boolean> put(final byte[] key, final byte[] value,
                                       final CompletableFuture<Boolean> future, final boolean tryBatching) {
    //校验一下是否已经init初始化了
    checkState();
    if (tryBatching) {
        //putBatching实例在init方法中被初始化
        final PutBatching putBatching = this.putBatching;
        if (putBatching != null && putBatching.apply(new KVEntry(key, value), future)) {
            //由于我们传入的是一个空的实例,所以这里直接返回
            return future;
        }
    }
    //直接存入数据
    internalPut(key, value, future, this.failoverRetries, null);
    return future;
}
checkState method'll go check started This property has not been set, if DefaultRheaKVStore the init method calls had been initialized, it will set started to ture. Here also calls putBatching example init method inside initialized, below we take a look at putBatching examples of what has been done.

putBatching bulk stored data

putBatching when init Initializes an instance will be passed as a PutBatchingHandler processor:

this.putBatching = new PutBatching(KVEvent::new, "put_batching",
        new PutBatchingHandler("put"));

Here we take a look at the constructor PutBatching:

public PutBatching(EventFactory<KVEvent> factory, String name, PutBatchingHandler handler) {
    super(factory, batchingOpts.getBufSize(), name, handler);
}

Because here PutBatching inherited Batching this abstract class, so instantiated directly call the constructor of the parent class is instantiated:

public Batching(EventFactory<T> factory, int bufSize, String name, EventHandler<T> handler) {
    this.name = name;
    this.disruptor = new Disruptor<>(factory, bufSize, new NamedThreadFactory(name, true));
    this.disruptor.handleEventsWith(handler);
    this.disruptor.setDefaultExceptionHandler(new LogExceptionHandler<Object>(name));
    this.ringBuffer = this.disruptor.start();
}

In Batching constructor which will initialize a Disruptor instance and we pass PutBatchingHandler processor as the processor Disruptor, PutBatching all incoming data will be processed through PutBatchingHandler.

Here is a look at how we deal with data PutBatchingHandler:
PutBatchingHandler # onEvent

public void onEvent(final KVEvent event, final long sequence, final boolean endOfBatch) throws Exception {
    //1.把传入的时间加入到集合中
    this.events.add(event);
    //加上key和value的长度
    this.cachedBytes += event.kvEntry.length();
    final int size = this.events.size();
    //BatchSize等于100 ,并且maxWriteBytes字节数32768
    //2. 如果不是最后一个event,也没有这么多数量的数据,那么就不发送
    if (!endOfBatch && size < batchingOpts.getBatchSize() && this.cachedBytes < batchingOpts.getMaxWriteBytes()) {
        return;
    }
    //3.如果传入的size为1,那么就重新调用put方法放入到Batching里面
    if (size == 1) {
        //重置events和cachedBytes
        reset();
        final KVEntry kv = event.kvEntry;
        try {
            put(kv.getKey(), kv.getValue(), event.future, false);
        } catch (final Throwable t) {
            exceptionally(t, event.future);
        }
    //    4.如果size不为1,那么把数据遍历到集合里面批量处理
    } else {
        //初始化一个长度为size的list
        final List<KVEntry> entries = Lists.newArrayListWithCapacity(size);
        final CompletableFuture<Boolean>[] futures = new CompletableFuture[size];
        for (int i = 0; i < size; i++) {
            final KVEvent e = this.events.get(i);
            entries.add(e.kvEntry);
            //使用CompletableFuture构建异步应用
            futures[i] = e.future;
        }
        //遍历完events数据到entries之后,重置
        reset();
        try {
            //当put方法完成后执行whenComplete中的内容
            put(entries).whenComplete((result, throwable) -> {
                //如果没有抛出异常,那么通知所有future已经执行完毕了
                if (throwable == null) {
                    for (int i = 0; i < futures.length; i++) {
                        futures[i].complete(result);
                    }
                    return;
                }
                exceptionally(throwable, futures);
            });
        } catch (final Throwable t) {
            exceptionally(t, futures);
        }
    }
} 
  1. When this method will enter this event is added to the set of events, and events of the aggregated length and size
  2. Since all of the event are sent to the Disruptor, and then distributed to PutBatchingHandler processing, it is possible to determine the parameters by endOfBatch distributed over the event is not the last, if not the last, and the total number of event does not exceed the default 100, cachedBytes no more than 32,768, then return directly, and so lobbied for the batch reprocessing
  3. Come to this judgment, indicating that only one data over, then called again put method, tryBatching set to false, then the method will directly go internalPut
  4. If the size is not equal to 1, it will put all the event are added to the collection inside, and then call the put method batch processing, after processing calls whenComplete when the results returned by the method or a callback handler

In bulk put to RheaKV set value

Here I am speaking about PutBatchingHandler # onEvent in put (entries) This method is how to deal with a batch of data, this method will be called to put the DefaultRheaKVStore method.

DefaultRheaKVStore#put

public CompletableFuture<Boolean> put(final List<KVEntry> entries) {
    //检查状态
    checkState();
    Requires.requireNonNull(entries, "entries");
    Requires.requireTrue(!entries.isEmpty(), "entries empty");
    //存放数据
    final FutureGroup<Boolean> futureGroup = internalPut(entries, this.failoverRetries, null);
    //处理返回状态
    return FutureHelper.joinBooleans(futureGroup);
}

The method calls for setting values ​​internalPut operation.

DefaultRheaKVStore#internalPut

private FutureGroup<Boolean> internalPut(final List<KVEntry> entries, final int retriesLeft,
                                         final Throwable lastCause) {
    //组装Region和KVEntry的映射关系
    final Map<Region, List<KVEntry>> regionMap = this.pdClient
            .findRegionsByKvEntries(entries, ApiExceptionHelper.isInvalidEpoch(lastCause));
    final List<CompletableFuture<Boolean>> futures = Lists.newArrayListWithCapacity(regionMap.size());
    final Errors lastError = lastCause == null ? null : Errors.forException(lastCause);
    for (final Map.Entry<Region, List<KVEntry>> entry : regionMap.entrySet()) {
        final Region region = entry.getKey();
        final List<KVEntry> subEntries = entry.getValue();
        //设置重试回调函数,并将重试次数减一
        final RetryCallable<Boolean> retryCallable = retryCause -> internalPut(subEntries, retriesLeft - 1,
                retryCause);
        final BoolFailoverFuture future = new BoolFailoverFuture(retriesLeft, retryCallable);
        //把数据存放到region中
        internalRegionPut(region, subEntries, future, retriesLeft, lastError);
        futures.add(future);
    }
    return new FutureGroup<>(futures);
}

Because there will be a lot of Store Region, so this method will first go to the assembly Region and KVEntry determine the KVEntry which belongs to the Region.
Then call internalRegionPut method After setting a callback function subEntries deposited into the Region.

Region KVEntry assembly and mapping relationships

Here we see how the assembly:
pdClient is an example of FakePlacementDriverClient inherited AbstractPlacementDriverClient, so the call is findRegionsByKvEntries parent class
AbstractPlacementDriverClient # findRegionsByKvEntries

public Map<Region, List<KVEntry>> findRegionsByKvEntries(final List<KVEntry> kvEntries, final boolean forceRefresh) {
    if (forceRefresh) {
        refreshRouteTable();
    }
    //regionRouteTable里面存了region的路由信息
    return this.regionRouteTable.findRegionsByKvEntries(kvEntries);
}

Because we are here with FakePlacementDriverClient, so refreshRouteTable returns an empty method, so go down method is invoked RegionRouteTable of findRegionsByKvEntries of
RegionRouteTable # findRegionsByKvEntries

public Map<Region, List<KVEntry>> findRegionsByKvEntries(final List<KVEntry> kvEntries) {
    Requires.requireNonNull(kvEntries, "kvEntries");
    //实例化一个map
    final Map<Region, List<KVEntry>> regionMap = Maps.newHashMap();
    final StampedLock stampedLock = this.stampedLock;
    final long stamp = stampedLock.readLock();
    try {
        for (final KVEntry kvEntry : kvEntries) {
            //根据kvEntry的key去找和region的startKey最接近的region
            final Region region = findRegionByKeyWithoutLock(kvEntry.getKey());
            //设置region和KVEntry的映射关系
            regionMap.computeIfAbsent(region, k -> Lists.newArrayList()).add(kvEntry);
        }
        return regionMap;
    } finally {
        stampedLock.unlockRead(stamp);
    }
}

private Region findRegionByKeyWithoutLock(final byte[] key) {
    // return the greatest key less than or equal to the given key
    //rangeTable里面存的是region的startKey,value是regionId
    // 这里返回小于等于key的第一个元素
    final Map.Entry<byte[], Long> entry = this.rangeTable.floorEntry(key);
    if (entry == null) {
        reportFail(key);
        throw reject(key, "fail to find region by key");
    }
    //regionTable里面存的regionId,value是region
    return this.regionTable.get(entry.getValue());
}

findRegionsByKvEntries method will traverse all KVEntry set, then go to call findRegionByKeyWithoutLock rangeTable inside to find the appropriate Region, because rangeTable TreeMap is a, so called floorEntry return key is less than or equal to the first region.
The region was then put to regionMap years, key is regionMap, value is a KVEntry collection.

regionRouteTable which data is passed in at initialization time DefaultRheaKVStore, I do not remember the students shows how to initialize the routing table:

DefaultRheaKVStore#init->FakePlacementDriverClient#init->
AbstractPlacementDriverClient#init->AbstractPlacementDriverClient#initRouteTableByRegion->regionRouteTable#addOrUpdateRegion
Data stored in the corresponding region of

We then see the method internalPut of DefaultRheaKVStore down internalRegionPut method, which is truly a place to store data:

DefaultRheaKVStore#internalRegionPut

private void internalRegionPut(final Region region, final List<KVEntry> subEntries,
                               final CompletableFuture<Boolean> future, final int retriesLeft,
                               final Errors lastCause) {
    //获取regionEngine
    final RegionEngine regionEngine = getRegionEngine(region.getId(), true);
    //重试函数,会回调当前的方法
    final RetryRunner retryRunner = retryCause -> internalRegionPut(region, subEntries, future,
            retriesLeft - 1, retryCause);
    final FailoverClosure<Boolean> closure = new FailoverClosureImpl<>(future, false, retriesLeft,
            retryRunner);
    if (regionEngine != null) {
        if (ensureOnValidEpoch(region, regionEngine, closure)) {
            //获取MetricsRawKVStore
            final RawKVStore rawKVStore = getRawKVStore(regionEngine);
            //在init方法中根据useParallelKVExecutor属性决定是不是空
            if (this.kvDispatcher == null) {
                //调用RockDB的api进行插入
                rawKVStore.put(subEntries, closure);
            } else {
                //把put操作分发到kvDispatcher中异步执行
                this.kvDispatcher.execute(() -> rawKVStore.put(subEntries, closure));
            }
        }
    } else {
        //如果当前节点不是leader,那么则返回的regionEngine为null
        //那么发起rpc调用到leader节点中
        final BatchPutRequest request = new BatchPutRequest();
        request.setKvEntries(subEntries);
        request.setRegionId(region.getId());
        request.setRegionEpoch(region.getRegionEpoch());
        this.rheaKVRpcService.callAsyncWithRpc(request, closure, lastCause);
    }
}

This method first calls getRegionEngine get regionEngine, because here we are client nodes, not initialized RegionEngine, so get here is empty, the request will be sent directly via rpc, and then handed over to KVCommandProcessor for processing.
If the current node is a server, and the RegionEngine a leader, it will then call the put method call rawKVStore inserted into RockDB in.

Finally, we look at rpc request sent rheaKVRpcService how he has been treated.

Send to the server data request into BatchPutRequest

Send put a request to the server is initiated by callAsyncWithRpc method of calling DefaultRheaKVRpcService:
DefaultRheaKVRpcService # callAsyncWithRpc

public <V> CompletableFuture<V> callAsyncWithRpc(final BaseRequest request, final FailoverClosure<V> closure,
                                                 final Errors lastCause) {
    return callAsyncWithRpc(request, closure, lastCause, true);
}

public <V> CompletableFuture<V> callAsyncWithRpc(final BaseRequest request, final FailoverClosure<V> closure,
                                                 final Errors lastCause, final boolean requireLeader) {
    final boolean forceRefresh = ErrorsHelper.isInvalidPeer(lastCause);
    //获取leader的endpoint
    final Endpoint endpoint = getRpcEndpoint(request.getRegionId(), forceRefresh, this.rpcTimeoutMillis,
            requireLeader);
    //发起rpc调用
    internalCallAsyncWithRpc(endpoint, request, closure);
    return closure.future();
}

In this method calls getRpcEndpoint method to get the endpoint region corresponding to the server, and then calls rpc request for this node. bolt framework calls rpc requests are sofa's call, so let's look at how to obtain endpoint focus

DefaultRheaKVRpcService#getRpcEndpoint

public Endpoint getRpcEndpoint(final long regionId, final boolean forceRefresh, final long timeoutMillis,
                               final boolean requireLeader) {
    if (requireLeader) {
        //获取leader
        return getLeader(regionId, forceRefresh, timeoutMillis);
    } else {
        //轮询获取一个不是自己的节点
        return getLuckyPeer(regionId, forceRefresh, timeoutMillis);
    }
}

There are two branches, a leader node is acquired, one polling acquisition nodes. Because of these two methods is very interesting, so we have to talk about the following two methods

Gets leader node according to regionId

According regionId obtain leader node is triggered by getLeader method in the init method we call DefaultRheaKVStore instantiation DefaultRheaKVRpcService time will override the getLeader method:
DefaultRheaKVStore # init

this.rheaKVRpcService = new DefaultRheaKVRpcService(this.pdClient, selfEndpoint) {

    @Override
    public Endpoint getLeader(final long regionId, final boolean forceRefresh, final long timeoutMillis) {
        final Endpoint leader = getLeaderByRegionEngine(regionId);
        if (leader != null) {
            return leader;
        }
        return super.getLeader(regionId, forceRefresh, timeoutMillis);
    }
};

getLeader overridden method calls getLeaderByRegionEngine method according to regionId area looking for Endpoint, if not, it will call the parent class method getLeader.

DefaultRheaKVStore#getLeaderByRegionEngine

private Endpoint getLeaderByRegionEngine(final long regionId) {
    final RegionEngine regionEngine = getRegionEngine(regionId);
    if (regionEngine != null) {
        final PeerId leader = regionEngine.getLeaderId();
        if (leader != null) {
            final String raftGroupId = JRaftHelper.getJRaftGroupId(this.pdClient.getClusterName(), regionId);
            RouteTable.getInstance().updateLeader(raftGroupId, leader);
            return leader.getEndpoint();
        }
    }
    return null;
}

This method will get RegionEngine here, but here is the client node is not initialized RegionEngine, so here will return null, then return to the previous method getLeader call the parent class.

DefaultRheaKVRpcService#getLeader

public Endpoint getLeader(final long regionId, final boolean forceRefresh, final long timeoutMillis) {
    return this.pdClient.getLeader(regionId, forceRefresh, timeoutMillis);
}

Here calls getLeader method pdClient, where we passed pdClient is FakePlacementDriverClient, it inherits AbstractPlacementDriverClient, so calls to getLeader method of the parent class.

AbstractPlacementDriverClient#getLeader

public Endpoint getLeader(final long regionId, final boolean forceRefresh, final long timeoutMillis) {
    //这里会根据clusterName和regionId拼接出raftGroupId
    final String raftGroupId = JRaftHelper.getJRaftGroupId(this.clusterName, regionId);
    //去路由表里找这个集群的leader
    PeerId leader = getLeader(raftGroupId, forceRefresh, timeoutMillis);
    if (leader == null && !forceRefresh) {
        // Could not found leader from cache, try again and force refresh cache
        // 如果第一次没有找到,那么执行强制刷新的方法再找一次
        leader = getLeader(raftGroupId, true, timeoutMillis);
    }
    if (leader == null) {
        throw new RouteTableException("no leader in group: " + raftGroupId);
    }
    return leader.getEndpoint();
}

According to this method which will clusterName and regionId stitching raftGroupId, if the incoming clusterName for the demo, regionId 1, then spliced out raftGroupId it is: demo--1.
Then you will get to call getLeader PeerId leader, the first call to this method passed forceRefresh to false, indicating no refresh, if the return is null, will perform a forced refresh and then go again.

AbstractPlacementDriverClient#getLeader

protected PeerId getLeader(final String raftGroupId, final boolean forceRefresh, final long timeoutMillis) {
    final RouteTable routeTable = RouteTable.getInstance();
    //是否要强制刷新路由表
    if (forceRefresh) {
        final long deadline = System.currentTimeMillis() + timeoutMillis;
        final StringBuilder error = new StringBuilder();
        // A newly launched raft group may not have been successful in the election,
        // or in the 'leader-transfer' state, it needs to be re-tried
        Throwable lastCause = null;
        for (;;) {
            try {
                //刷新节点路由表
                final Status st = routeTable.refreshLeader(this.cliClientService, raftGroupId, 2000);
                if (st.isOk()) {
                    break;
                }
                error.append(st.toString());
            } catch (final InterruptedException e) {
                ThrowUtil.throwException(e);
            } catch (final Throwable t) {
                lastCause = t;
                error.append(t.getMessage());
            }
            //如果还没有到截止时间,那么sleep10毫秒之后再刷新
            if (System.currentTimeMillis() < deadline) {
                LOG.debug("Fail to find leader, retry again, {}.", error);
                error.append(", ");
                try {
                    Thread.sleep(10);
                } catch (final InterruptedException e) {
                    ThrowUtil.throwException(e);
                }
            //    到了截止时间,那么抛出异常
            } else {
                throw lastCause != null ? new RouteTableException(error.toString(), lastCause)
                    : new RouteTableException(error.toString());
            }
        }
    }
    //返回路由表里面的leader
    return routeTable.selectLeader(raftGroupId);
}

If you want to perform a forced refresh, it will calculate the time-out, then call an infinite loop, the loop body which will go to refresh the routing table, if not refreshed success nor a timeout, it will sleep10 millisecond re-painted.

RouteTable#refreshLeader

public Status refreshLeader(final CliClientService cliClientService, final String groupId, final int timeoutMs)
                                                                                                               throws InterruptedException,
                                                                                                               TimeoutException {
    Requires.requireTrue(!StringUtils.isBlank(groupId), "Blank group id");
    Requires.requireTrue(timeoutMs > 0, "Invalid timeout: " + timeoutMs);
    //根据集群的id去获取集群的配置信息,里面包括集群的ip和端口号
    final Configuration conf = getConfiguration(groupId);
    if (conf == null) {
        return new Status(RaftError.ENOENT,
            "Group %s is not registered in RouteTable, forgot to call updateConfiguration?", groupId);
    }
    final Status st = Status.OK();
    final CliRequests.GetLeaderRequest.Builder rb = CliRequests.GetLeaderRequest.newBuilder();
    rb.setGroupId(groupId);
    //发送获取leader节点的请求
    final CliRequests.GetLeaderRequest request = rb.build();
    TimeoutException timeoutException = null;
    for (final PeerId peer : conf) {
        //如果连接不上,先设置状态为error,然后continue
        if (!cliClientService.connect(peer.getEndpoint())) {
            if (st.isOk()) {
                st.setError(-1, "Fail to init channel to %s", peer);
            } else {
                final String savedMsg = st.getErrorMsg();
                st.setError(-1, "%s, Fail to init channel to %s", savedMsg, peer);
            }
            continue;
        }
        //向这个节点发送获取leader的GetLeaderRequest请求
        final Future<Message> result = cliClientService.getLeader(peer.getEndpoint(), request, null);
        try {
            final Message msg = result.get(timeoutMs, TimeUnit.MILLISECONDS);
            //异常情况的处理
            if (msg instanceof RpcRequests.ErrorResponse) {
                if (st.isOk()) {
                    st.setError(-1, ((RpcRequests.ErrorResponse) msg).getErrorMsg());
                } else {
                    final String savedMsg = st.getErrorMsg();
                    st.setError(-1, "%s, %s", savedMsg, ((RpcRequests.ErrorResponse) msg).getErrorMsg());
                }
            } else {
                final CliRequests.GetLeaderResponse response = (CliRequests.GetLeaderResponse) msg;
                //重置leader
                updateLeader(groupId, response.getLeaderId());
                return Status.OK();
            }
        } catch (final TimeoutException e) {
            timeoutException = e;
        } catch (final ExecutionException e) {
            if (st.isOk()) {
                st.setError(-1, e.getMessage());
            } else {
                final String savedMsg = st.getErrorMsg();
                st.setError(-1, "%s, %s", savedMsg, e.getMessage());
            }
        }
    }
    if (timeoutException != null) {
        throw timeoutException;
    }

    return st;
}

We do not start to be such a long way to deluded, this method is actually very simple:

  1. groupId configuration information based on cluster nodes, including ip and port number of the other nodes
  2. Inside the cluster nodes to traverse conf
  3. Attempts to connect nodes to be traversed, if the connection is not directly change continue to the next node
  4. GetLeaderRequest send a request to the node, if a normal response may be returned within the timeout period, then the leader is invoked to update information updateLeader

updateLeader method is quite node, which is the update about the leader attribute of the routing table to see how we deal with here is server requests GetLeaderRequest

GetLeaderRequest GetLeaderRequestProcessor be processed by the processor.
GetLeaderRequestProcessor # processRequest

public Message processRequest(GetLeaderRequest request, RpcRequestClosure done) {
    List<Node> nodes = new ArrayList<>();
    String groupId = getGroupId(request);
    //如果请求是指定某个PeerId
    //那么则则去集群里找到指定Peer所对应的node
    if (request.hasPeerId()) {
        String peerIdStr = getPeerId(request);
        PeerId peer = new PeerId();
        if (peer.parse(peerIdStr)) {
            Status st = new Status();
            nodes.add(getNode(groupId, peer, st));
            if (!st.isOk()) {
                return RpcResponseFactory.newResponse(st);
            }
        } else {
            return RpcResponseFactory.newResponse(RaftError.EINVAL, "Fail to parse peer id %", peerIdStr);
        }
    } else {
        //获取集群所有的节点
        nodes = NodeManager.getInstance().getNodesByGroupId(groupId);
    }
    if (nodes == null || nodes.isEmpty()) {
        return RpcResponseFactory.newResponse(RaftError.ENOENT, "No nodes in group %s", groupId);
    }
    //遍历集群node,获取leaderId
    for (Node node : nodes) {
        PeerId leader = node.getLeaderId();
        if (leader != null && !leader.isEmpty()) {
            return GetLeaderResponse.newBuilder().setLeaderId(leader.toString()).build();
        }
    }
    return RpcResponseFactory.newResponse(RaftError.EAGAIN, "Unknown leader");
}

Since we came here through the request does not carry PeerId, it is not going to get leaderId specified peer node corresponds to a node, but will go to find all the nodes in the cluster corresponding groupId, and then find the corresponding node traversal leaderId.

Get a node polling getLuckyPeer

In the above we finished getLeader is how to achieve, let's talk about getLuckyPeer which is how this method of operation.

public Endpoint getLuckyPeer(final long regionId, final boolean forceRefresh, final long timeoutMillis) {
    return this.pdClient.getLuckyPeer(regionId, forceRefresh, timeoutMillis, this.selfEndpoint);
}

Here and getLeader the same way as calls to getLuckyPeer method AbstractPlacementDriverClient in
AbstractPlacementDriverClient # getLuckyPeer

public Endpoint getLuckyPeer(final long regionId, final boolean forceRefresh, final long timeoutMillis,
                             final Endpoint unExpect) {
    final String raftGroupId = JRaftHelper.getJRaftGroupId(this.clusterName, regionId);
    final RouteTable routeTable = RouteTable.getInstance();
    //是否要强制刷新一下最新的集群节点信息
    if (forceRefresh) {
        final long deadline = System.currentTimeMillis() + timeoutMillis;
        final StringBuilder error = new StringBuilder();
        // A newly launched raft group may not have been successful in the election,
        // or in the 'leader-transfer' state, it needs to be re-tried
        for (;;) {
            try {
                final Status st = routeTable.refreshConfiguration(this.cliClientService, raftGroupId, 5000);
                if (st.isOk()) {
                    break;
                }
                error.append(st.toString());
            } catch (final InterruptedException e) {
                ThrowUtil.throwException(e);
            } catch (final TimeoutException e) {
                error.append(e.getMessage());
            }
            if (System.currentTimeMillis() < deadline) {
                LOG.debug("Fail to get peers, retry again, {}.", error);
                error.append(", ");
                try {
                    Thread.sleep(5);
                } catch (final InterruptedException e) {
                    ThrowUtil.throwException(e);
                }
            } else {
                throw new RouteTableException(error.toString());
            }
        }
    }
    final Configuration configs = routeTable.getConfiguration(raftGroupId);
    if (configs == null) {
        throw new RouteTableException("empty configs in group: " + raftGroupId);
    }
    final List<PeerId> peerList = configs.getPeers();
    if (peerList == null || peerList.isEmpty()) {
        throw new RouteTableException("empty peers in group: " + raftGroupId);
    }
    //如果这个集群里只有一个节点了,那么直接返回就好了
    final int size = peerList.size();
    if (size == 1) {
        return peerList.get(0).getEndpoint();
    }
    //获取负载均衡器,这里用的是轮询策略
    final RoundRobinLoadBalancer balancer = RoundRobinLoadBalancer.getInstance(regionId);
    for (int i = 0; i < size; i++) {
        final PeerId candidate = balancer.select(peerList);
        final Endpoint luckyOne = candidate.getEndpoint();
        if (!luckyOne.equals(unExpect)) {
            return luckyOne;
        }
    }
    throw new RouteTableException("have no choice in group(peers): " + raftGroupId);
}

This method which also has to determine whether to force a refresh, and getLeader methods, like, not repeat them. Then what will determine if there is more than one valid cluster node, it will call the poll strategy to select nodes, the polling operation is very simple, it is a global index plus one each call, and then passed peerList collection size modulo.

Here DefaultRheaKVRpcService method of callAsyncWithRpc explain almost finished, and then initiates a request to the server-side processing BatchPutRequest request KVCommandProcessor.

Server side processing requests BatchPutRequest

BatchPutRequest requests are processed in KVCommandProcessor.
KVCommandProcessor # handleRequest

public void handleRequest(final BizContext bizCtx, final AsyncContext asyncCtx, final T request) {
    Requires.requireNonNull(request, "request");
    final RequestProcessClosure<BaseRequest, BaseResponse<?>> closure = new RequestProcessClosure<>(request,
        bizCtx, asyncCtx);
    //根据传入的RegionId去找到对应的RegionKVService
    //每个 RegionKVService 对应一个 Region,只处理本身 Region 范畴内的请求
    final RegionKVService regionKVService = this.storeEngine.getRegionKVService(request.getRegionId());
    if (regionKVService == null) {
        //如果不存在则返回空
        final NoRegionFoundResponse noRegion = new NoRegionFoundResponse();
        noRegion.setRegionId(request.getRegionId());
        noRegion.setError(Errors.NO_REGION_FOUND);
        noRegion.setValue(false);
        closure.sendResponse(noRegion);
        return;
    }
    switch (request.magic()) {
        case BaseRequest.PUT:
            regionKVService.handlePutRequest((PutRequest) request, closure);
            break;
        case BaseRequest.BATCH_PUT:
            regionKVService.handleBatchPutRequest((BatchPutRequest) request, closure);
            break;
        .....
        default:
            throw new RheaRuntimeException("Unsupported request type: " + request.getClass().getName());
    }
}

handleRequest will first go RegionKVService according to RegionId, RegionKVService in the initialization RegionEngine will be registered in the regionKVServiceTable.
Depending on the type of the request and then determines what the request is a request. Here we omit other requests, just look BATCH_PUT is how to do.

Speaking before the code down, I'll have to call a process pointing to the road:
www.wityx.com

BATCH_PUT correspond to calls to handleBatchPutRequest method DefaultRegionKVService in.
DefaultRegionKVService # handleBatchPutRequest

public void handlePutRequest(final PutRequest request,
                             final RequestProcessClosure<BaseRequest, BaseResponse<?>> closure) {
    //设置一个响应response
    final PutResponse response = new PutResponse();
    response.setRegionId(getRegionId());
    response.setRegionEpoch(getRegionEpoch());
    try {
        KVParameterRequires.requireSameEpoch(request, getRegionEpoch());
        final byte[] key = KVParameterRequires.requireNonNull(request.getKey(), "put.key");
        final byte[] value = KVParameterRequires.requireNonNull(request.getValue(), "put.value");
        //这个实例是MetricsRawKVStore
        this.rawKVStore.put(key, value, new BaseKVStoreClosure() {

            //设置回调函数
            @Override
            public void run(final Status status) {
                if (status.isOk()) {
                    response.setValue((Boolean) getData());
                } else {
                    setFailure(request, response, status, getError());
                }
                closure.sendResponse(response);
            }
        });
    } catch (final Throwable t) {
        LOG.error("Failed to handle: {}, {}.", request, StackTraceUtil.stackTrace(t));
        response.setError(Errors.forException(t));
        closure.sendResponse(response);
    }
}

handlePutRequest method is very simple, MetricsRawKVStore's put method after obtaining the key and value calls, incoming and key value and set a callback function.

MetricsRawKVStore#put

public void put(final byte[] key, final byte[] value, final KVStoreClosure closure) {
    final KVStoreClosure c = metricsAdapter(closure, PUT, 1, value.length);
    //rawKVStore是RaftRawKVStore的实例
    this.rawKVStore.put(key, value, c);
}

put method will continue to call the put method RaftRawKVStore.
RaftRawKVStore # put

public void put(final byte[] key, final byte[] value, final KVStoreClosure closure) {
    applyOperation(KVOperation.createPut(key, value), closure);
}

Put method calls the static method KVOperation create a type KVOperation put instance and then call applyOperation method.

RaftRawKVStore#applyOperation

private void applyOperation(final KVOperation op, final KVStoreClosure closure) {
    //这里必须保证 Leader 节点操作申请任务
    if (!isLeader()) {
        closure.setError(Errors.NOT_LEADER);
        closure.run(new Status(RaftError.EPERM, "Not leader"));
        return;
    }
    final Task task = new Task();
    //封装数据
    task.setData(ByteBuffer.wrap(Serializers.getDefault().writeObject(op)));
    //封装回调方法
    task.setDone(new KVClosureAdapter(closure, op));
    //调用NodeImpl的apply方法
    this.node.apply(task);
}

applyOperation method which will check is not a leader, if not the leader so the operation will not be able to perform tasks application. After calling NodeImple then instantiate a Task instance, setting data and callback Adapter apply publishing tasks.

NodeImpl#apply

public void apply(final Task task) {
    //检查Node是不是被关闭了
    if (this.shutdownLatch != null) {
        Utils.runClosureInThread(task.getDone(), new Status(RaftError.ENODESHUTDOWN, "Node is shutting down."));
        throw new IllegalStateException("Node is shutting down");
    }
    //校验不能为空
    Requires.requireNonNull(task, "Null task");

    //将task里面的数据放入到LogEntry中
    final LogEntry entry = new LogEntry();
    entry.setData(task.getData());
    //重试次数
    int retryTimes = 0;
    try {
        //实例化一个Disruptor事件
        final EventTranslator<LogEntryAndClosure> translator = (event, sequence) -> {
            event.reset();
            event.done = task.getDone();
            event.entry = entry;
            event.expectedTerm = task.getExpectedTerm();
        };
        while (true) {
            //发布事件后交给LogEntryAndClosureHandler事件处理器处理
            if (this.applyQueue.tryPublishEvent(translator)) {
                break;
            } else {
                retryTimes++;
                //最多重试3次
                if (retryTimes > MAX_APPLY_RETRY_TIMES) {
                    //不成功则进行回调,通知处理状态
                    Utils.runClosureInThread(task.getDone(),
                            new Status(RaftError.EBUSY, "Node is busy, has too many tasks."));
                    LOG.warn("Node {} applyQueue is overload.", getNodeId());
                    this.metrics.recordTimes("apply-task-overload-times", 1);
                    return;
                }
                ThreadHelper.onSpinWait();
            }
        }

    } catch (final Exception e) {
        Utils.runClosureInThread(task.getDone(), new Status(RaftError.EPERM, "Node is down."));
    }
}

Inside apply LogEntry method for encapsulating the data to the instance, and then packaged into a LogEntry Disruptor applyQueue queue event to the inside. applyQueue queue inside the init method NodeImpl initialization, and sets the processor LogEntryAndClosureHandler.

LogEntryAndClosureHandler # onevent

private final List<LogEntryAndClosure> tasks = new ArrayList<>(NodeImpl.this.raftOptions.getApplyBatch());

@Override
public void onEvent(final LogEntryAndClosure event, final long sequence, final boolean endOfBatch)
        throws Exception {
    //如果接收到了要关闭的请求
    if (event.shutdownLatch != null) {
        //tasks队列里面的任务又不为空,那么先处理队列里面的数据
        if (!this.tasks.isEmpty()) {
            //处理tasks
            executeApplyingTasks(this.tasks);
        }
        final int num = GLOBAL_NUM_NODES.decrementAndGet();
        LOG.info("The number of active nodes decrement to {}.", num);
        event.shutdownLatch.countDown();
        return;
    }
    //将新的event加入到tasks中
    this.tasks.add(event);
    //因为设置了32为一个批次,所以如果tasks里面的任务达到了32或者已经是最后一个event,
    // 那么就执行tasks集合里面的数据
    if (this.tasks.size() >= NodeImpl.this.raftOptions.getApplyBatch() || endOfBatch) {
        executeApplyingTasks(this.tasks);
        this.tasks.clear();
    }
}

OnEvent method if the event will receive a check is requested to close the queue, if so, it will first perform data collection tasks completed and then returned inside. If it is a normal event, so check what the set number of tasks which are not already reached 32, or is not already the last event, then executes executeApplyingTasks batch processing data.

NodeImpl#executeApplyingTasks

private void executeApplyingTasks(final List<LogEntryAndClosure> tasks) {
    this.writeLock.lock();
    try {
        final int size = tasks.size();
        //如果当前节点不是leader,那么就不往下进行
        if (this.state != State.STATE_LEADER) {
            final Status st = new Status();

            if (this.state != State.STATE_TRANSFERRING) {
                st.setError(RaftError.EPERM, "Is not leader.");
            } else {
                st.setError(RaftError.EBUSY, "Is transferring leadership.");
            }
            LOG.debug("Node {} can't apply, status={}.", getNodeId(), st);
            //处理所有的LogEntryAndClosure,发送回调响应
            for (int i = 0; i < size; i++) {
                Utils.runClosureInThread(tasks.get(i).done, st);
            }
            return;
        }
        final List<LogEntry> entries = new ArrayList<>(size);
        for (int i = 0; i < size; i++) {
            final LogEntryAndClosure task = tasks.get(i);
            //如果任其不对,那么直接调用回调函数发送Error
            if (task.expectedTerm != -1 && task.expectedTerm != this.currTerm) {
                LOG.debug("Node {} can't apply task whose expectedTerm={} doesn't match currTerm={}.", getNodeId(),
                    task.expectedTerm, this.currTerm);
                if (task.done != null) {
                    final Status st = new Status(RaftError.EPERM, "expected_term=%d doesn't match current_term=%d",
                        task.expectedTerm, this.currTerm);
                    Utils.runClosureInThread(task.done, st);
                }
                continue;
            }
            //保存应用上下文
            if (!this.ballotBox.appendPendingTask(this.conf.getConf(),
                this.conf.isStable() ? null : this.conf.getOldConf(), task.done)) {
                Utils.runClosureInThread(task.done, new Status(RaftError.EINTERNAL, "Fail to append task."));
                continue;
            }
            // set task entry info before adding to list.
            task.entry.getId().setTerm(this.currTerm);
            //设置entry的类型为ENTRY_TYPE_DATA
            task.entry.setType(EnumOutter.EntryType.ENTRY_TYPE_DATA);
            entries.add(task.entry);
        }
        //批量提交申请任务日志写入 RocksDB
        this.logManager.appendEntries(entries, new LeaderStableClosure(entries));
        // update conf.first
        this.conf = this.logManager.checkAndSetConfiguration(this.conf);
    } finally {
        this.writeLock.unlock();
    }
}

executeApplyingTasks will check the current node is not a leader, because Raft copy of the application execution node Node task to check the current status is STATE_LEADER, must ensure Leader node operations application tasks.
Estimated term of service event loop through a node to determine whether the task is equal to the current node tenure, Leader logs submitted in phase change does not occur with the same number of Term, node Node meet the expected term of the agreement Raft ballot box BallotBox call appendPendingTask (conf, oldConf save application context before, done) log replication, the vote is created based on the current configuration and the original configuration node added to the ballot ballot two-way queue pendingMetaQueue.
Then call the bottom of the log manager LogManager log storage LogStorage # appendEntries (entries) batch job submission log write RocksDB.

The next task application submitted by Node # apply (task) will eventually replicate applied to all nodes Raft state machine, RheaKV state machine through inheritance StateMachineAdapter adapter KVStoreStateMachine state machine representation.
Raft state machine KVStoreStateMachine call onApply (iterator) method in accordance with the order of submission of application tasks list to the state machine.
KVStoreStateMachine iteration state machine status output status list listing key accumulated batch application RocksRawKVStore call batch (kvStates) operation method for operating the corresponding key stored RocksDB.

to sum up

This is a fairly long process is very complex, very careful various parts of the code written inside. We introduced the putBatching leather processors is how to use batch processing of data Disruptor, thereby leading to improve the overall throughput. It also explains how, when initiating the request is to get the server-side endpoint. Then also learned how the server BatchPutRequest request is processed, and how embodied in the code by Batch + full asynchronous mechanism greatly enhance throughput.

Guess you like

Origin www.cnblogs.com/w4ctech/p/11830820.html