3.2.5 写主分片节点流程
代码入口:TransportReplicationAction.PrimaryOperationTransportHandler#messageReceived,然后进入AsyncPrimaryAction#doRun方法。
检查请求: 1.当前是否为主分片;2.allocationId是否是预期值;3.PrimaryTerm是否是预期值
if (shardRouting.primary() == false) {
.....
}
final String actualAllocationId = shardRouting.allocationId().getId();
if (actualAllocationId.equals(targetAllocationID) == false) {
......
}
final long actualTerm = indexShard.getPendingPrimaryTerm();
if (actualTerm != primaryTerm) {
......
}
查看主分片是否迁移:
如果已经迁移:1.将phase状态设为“primary_delegation”;2.关闭当前分片的primaryShardReference,及时释放资源;3.获取已经迁移到的目标节点,将请求转发到该节点,并等待执行结果;4.拿到结果后,将task状态更新为“finish”。
transportService.sendRequest(relocatingNode, transportPrimaryAction,
new ConcreteShardRequest<>(request, primary.allocationId().getRelocationId(), primaryTerm),
transportOptions,
new TransportChannelResponseHandler<Response>(logger, channel, "rerouting indexing to target primary " + primary,
reader) {
@Override
public void handleResponse(Response response) {
setPhase(replicationTask, "finished");
super.handleResponse(response);
}
@Override
public void handleException(TransportException exp) {
setPhase(replicationTask, "finished");
super.handleException(exp);
}
});
如果没有迁移:
1.将task状态更新为“primary”;2.主分片准备操作(主要部分);3.转发请求给副本分片
setPhase(replicationTask, "primary");
final ActionListener<Response> listener = createResponseListener(primaryShardReference);
createReplicatedOperation(request,
ActionListener.wrap(result -> result.respond(listener), listener::onFailure),
primaryShardReference).execute(); //入口
primary所在的node收到协调节点发过来的写入请求后,开始正式执行写入的逻辑,写入执行的入口是在ReplicationOperation类的execute方法,该方法中执行的两个关键步骤是,首先写主shard,如果主shard写入成功,再将写入请求发送到从shard所在的节点。
public void execute() throws Exception {
.......
//关键,这里开始执行写主分片
primaryResult = primary.perform(request);
.......
final ReplicaRequest replicaRequest = primaryResult.replicaRequest();
if (replicaRequest != null) {
........
markUnavailableShardsAsStale(replicaRequest, replicationGroup);
// 关键步骤,写完primary后这里转发请求到replicas
performOnReplicas(replicaRequest, globalCheckpoint, maxSeqNoOfUpdatesOrDeletes, replicationGroup);
}
successfulShards.incrementAndGet(); // mark primary as successful
decPendingAndFinishIfNeeded();
}
下面,我们来看写primary的关键代码,写primary入口函数为TransportShardBulkAction#shardOperationOnPrimary,最后走入到index(…) --> InternalEngine#index()的过程,这是写数据的主要过程。先是通过index获取对应的策略,即plan,通过plan执行对应操作,如要正常写入,则到了indexIntoLucene(…),然后写translog。如下所示:
public IndexResult index(Index index) throws IOException {
.......
final IndexResult indexResult;
if (plan.earlyResultOnPreFlightError.isPresent()) {
indexResult = plan.earlyResultOnPreFlightError.get();
assert indexResult.getResultType() == Result.Type.FAILURE : indexResult.getResultType();
} else if (plan.indexIntoLucene || plan.addStaleOpToLucene) {
// 将数据写入lucene,最终会调用lucene的文档写入接口
indexResult = indexIntoLucene(index, plan);
} else {
indexResult = new IndexResult(
plan.versionForIndexing, getPrimaryTerm(), plan.seqNoForIndexing, plan.currentNotFoundOrDeleted);
}
if (index.origin().isFromTranslog() == false) {
final Translog.Location location;
if (indexResult.getResultType() == Result.Type.SUCCESS) {
location = translog.add(new Translog.Index(index, indexResult)); //写translog
......
indexResult.setTranslogLocation(location);
}
.......
}
ES的写入操作是先写lucene,将数据写入到lucene内存后再写translog。ES之所以先写lucene后写log主要原因大概是写入Lucene时,Lucene会再对数据进行一些检查,有可能出现写入Lucene失败的情况。如果先写translog,那么就要处理写入translog成功但是写入Lucene一直失败的问题,所以ES采用了先写Lucene的方式。
在写完primary后,会继续写replicas,接下来需要将请求转发到从节点上,如果replica shard未分配,则直接忽略;如果replica shard正在搬迁数据到其他节点,则将请求转发到搬迁的目标shard上,否则,转发到replica shard。replicaRequest是在写入主分片后,从primaryResult中获取,并非原始Request。这块代码如下:
private void performOnReplicas(final ReplicaRequest replicaRequest, final long globalCheckpoint,
final long maxSeqNoOfUpdatesOrDeletes, final ReplicationGroup replicationGroup) {
totalShards.addAndGet(replicationGroup.getSkippedShards().size());
final ShardRouting primaryRouting = primary.routingEntry();
for (final ShardRouting shard : replicationGroup.getReplicationTargets()) {
if (shard.isSameAllocation(primaryRouting) == false) {
performOnReplica(shard, replicaRequest, globalCheckpoint, maxSeqNoOfUpdatesOrDeletes);
}
}
}
performOnReplica方法会将请求转发到目标节点,如果出现异常,如对端节点挂掉、shard写入失败等,对于这些异常,primary认为该replica shard发生故障不可用,将会向master汇报并移除该replica。这块的代码如下:
private void performOnReplica(final ShardRouting shard, final ReplicaRequest replicaRequest,
final long globalCheckpoint, final long maxSeqNoOfUpdatesOrDeletes) {
.....
totalShards.incrementAndGet();
pendingActions.incrementAndGet();
replicasProxy.performOn(shard, replicaRequest, globalCheckpoint, maxSeqNoOfUpdatesOrDeletes, new ActionListener<ReplicaResponse>() {
@Override
public void onResponse(ReplicaResponse response) {
successfulShards.incrementAndGet();
try {
primary.updateLocalCheckpointForShard(shard.allocationId().getId(), response.localCheckpoint());
primary.updateGlobalCheckpointForShard(shard.allocationId().getId(), response.globalCheckpoint());
}
......
decPendingAndFinishIfNeeded();
}
@Override
public void onFailure(Exception replicaException) {
if (TransportActions.isShardNotAvailableException(replicaException) == false) {
RestStatus restStatus = ExceptionsHelper.status(replicaException);
shardReplicaFailures.add(new ReplicationResponse.ShardInfo.Failure(
shard.shardId(), shard.currentNodeId(), replicaException, restStatus, false));
}
String message = String.format(Locale.ROOT, "failed to perform %s on replica %s", opType, shard);
replicasProxy.failShardIfNeeded(shard, message, replicaException,
ActionListener.wrap(r -> decPendingAndFinishIfNeeded(), ReplicationOperation.this::onNoLongerPrimary));
}
});
}
replica的写入逻辑和primary类似,这里不再具体介绍。
这里还涉及到Checkpoint的更新操作,包括写translog的详细流程,后续再补充。