【Elasticsearch源码】写入源码分析（三）

接上一篇：【Elasticsearch源码】写入源码分析（二）。

3.2.5 写主分片节点流程

代码入口：TransportReplicationAction.PrimaryOperationTransportHandler#messageReceived，然后进入AsyncPrimaryAction#doRun方法。
检查请求： 1.当前是否为主分片；2.allocationId是否是预期值；3.PrimaryTerm是否是预期值

            if (shardRouting.primary() == false) {
                .....
            }
            final String actualAllocationId = shardRouting.allocationId().getId();
            if (actualAllocationId.equals(targetAllocationID) == false) {
                ......
            }
            final long actualTerm = indexShard.getPendingPrimaryTerm();
            if (actualTerm != primaryTerm) {
              	......
            }

查看主分片是否迁移：
如果已经迁移：1.将phase状态设为“primary_delegation”；2.关闭当前分片的primaryShardReference，及时释放资源；3.获取已经迁移到的目标节点，将请求转发到该节点，并等待执行结果；4.拿到结果后，将task状态更新为“finish”。

                    transportService.sendRequest(relocatingNode, transportPrimaryAction,
                        new ConcreteShardRequest<>(request, primary.allocationId().getRelocationId(), primaryTerm),
                        transportOptions,
                        new TransportChannelResponseHandler<Response>(logger, channel, "rerouting indexing to target primary " + primary,
                            reader) {
                            @Override
                            public void handleResponse(Response response) {
                                setPhase(replicationTask, "finished");
                                super.handleResponse(response);
                            }
                            @Override
                            public void handleException(TransportException exp) {
                                setPhase(replicationTask, "finished");
                                super.handleException(exp);
                            }
                        });

如果没有迁移：
1.将task状态更新为“primary”；2.主分片准备操作(主要部分)；3.转发请求给副本分片

setPhase(replicationTask, "primary");
                    final ActionListener<Response> listener = createResponseListener(primaryShardReference);
                    createReplicatedOperation(request,
                        ActionListener.wrap(result -> result.respond(listener), listener::onFailure),
                        primaryShardReference).execute(); //入口

primary所在的node收到协调节点发过来的写入请求后，开始正式执行写入的逻辑，写入执行的入口是在ReplicationOperation类的execute方法，该方法中执行的两个关键步骤是，首先写主shard，如果主shard写入成功，再将写入请求发送到从shard所在的节点。

    public void execute() throws Exception {
        .......
        //关键，这里开始执行写主分片
        primaryResult = primary.perform(request);
		.......
        final ReplicaRequest replicaRequest = primaryResult.replicaRequest();
        if (replicaRequest != null) {
			........
            markUnavailableShardsAsStale(replicaRequest, replicationGroup);
            // 关键步骤，写完primary后这里转发请求到replicas
            performOnReplicas(replicaRequest, globalCheckpoint, maxSeqNoOfUpdatesOrDeletes, replicationGroup);
        }
        successfulShards.incrementAndGet();  // mark primary as successful
        decPendingAndFinishIfNeeded();
    }

下面，我们来看写primary的关键代码，写primary入口函数为TransportShardBulkAction#shardOperationOnPrimary，最后走入到index(…) --> InternalEngine#index()的过程，这是写数据的主要过程。先是通过index获取对应的策略，即plan，通过plan执行对应操作，如要正常写入，则到了indexIntoLucene(…)，然后写translog。如下所示：

    public IndexResult index(Index index) throws IOException {
    			.......
                final IndexResult indexResult;
                if (plan.earlyResultOnPreFlightError.isPresent()) {
                    indexResult = plan.earlyResultOnPreFlightError.get();
                    assert indexResult.getResultType() == Result.Type.FAILURE : indexResult.getResultType();
                } else if (plan.indexIntoLucene || plan.addStaleOpToLucene) {
                	// 将数据写入lucene，最终会调用lucene的文档写入接口
                    indexResult = indexIntoLucene(index, plan);
                } else {
                    indexResult = new IndexResult(
                        plan.versionForIndexing, getPrimaryTerm(), plan.seqNoForIndexing, plan.currentNotFoundOrDeleted);
                }
                if (index.origin().isFromTranslog() == false) {
                    final Translog.Location location;
                    if (indexResult.getResultType() == Result.Type.SUCCESS) {
                        location = translog.add(new Translog.Index(index, indexResult)); //写translog
                    ......
                    indexResult.setTranslogLocation(location);
                }
              .......
        }

ES的写入操作是先写lucene，将数据写入到lucene内存后再写translog。ES之所以先写lucene后写log主要原因大概是写入Lucene时，Lucene会再对数据进行一些检查，有可能出现写入Lucene失败的情况。如果先写translog，那么就要处理写入translog成功但是写入Lucene一直失败的问题，所以ES采用了先写Lucene的方式。

在写完primary后，会继续写replicas，接下来需要将请求转发到从节点上，如果replica shard未分配，则直接忽略；如果replica shard正在搬迁数据到其他节点，则将请求转发到搬迁的目标shard上，否则，转发到replica shard。replicaRequest是在写入主分片后，从primaryResult中获取，并非原始Request。这块代码如下：

    private void performOnReplicas(final ReplicaRequest replicaRequest, final long globalCheckpoint,
                                   final long maxSeqNoOfUpdatesOrDeletes, final ReplicationGroup replicationGroup) {
        totalShards.addAndGet(replicationGroup.getSkippedShards().size());
        final ShardRouting primaryRouting = primary.routingEntry();
        for (final ShardRouting shard : replicationGroup.getReplicationTargets()) {
            if (shard.isSameAllocation(primaryRouting) == false) {
                performOnReplica(shard, replicaRequest, globalCheckpoint, maxSeqNoOfUpdatesOrDeletes);
            }
        }
    }

performOnReplica方法会将请求转发到目标节点，如果出现异常，如对端节点挂掉、shard写入失败等，对于这些异常，primary认为该replica shard发生故障不可用，将会向master汇报并移除该replica。这块的代码如下：

    private void performOnReplica(final ShardRouting shard, final ReplicaRequest replicaRequest,
                                  final long globalCheckpoint, final long maxSeqNoOfUpdatesOrDeletes) {
        .....
        totalShards.incrementAndGet();
        pendingActions.incrementAndGet();
        replicasProxy.performOn(shard, replicaRequest, globalCheckpoint, maxSeqNoOfUpdatesOrDeletes, new ActionListener<ReplicaResponse>() {
            @Override
            public void onResponse(ReplicaResponse response) {
                successfulShards.incrementAndGet();
                try {
                    primary.updateLocalCheckpointForShard(shard.allocationId().getId(), response.localCheckpoint());
                    primary.updateGlobalCheckpointForShard(shard.allocationId().getId(), response.globalCheckpoint());
                }
                ......
                decPendingAndFinishIfNeeded();
            }
            @Override
            public void onFailure(Exception replicaException) {
                if (TransportActions.isShardNotAvailableException(replicaException) == false) {
                    RestStatus restStatus = ExceptionsHelper.status(replicaException);
                    shardReplicaFailures.add(new ReplicationResponse.ShardInfo.Failure(
                        shard.shardId(), shard.currentNodeId(), replicaException, restStatus, false));
                }
                String message = String.format(Locale.ROOT, "failed to perform %s on replica %s", opType, shard);
                replicasProxy.failShardIfNeeded(shard, message, replicaException,
                    ActionListener.wrap(r -> decPendingAndFinishIfNeeded(), ReplicationOperation.this::onNoLongerPrimary));
            }
        });
    }

replica的写入逻辑和primary类似，这里不再具体介绍。

这里还涉及到Checkpoint的更新操作，包括写translog的详细流程，后续再补充。

少加点香菜

发布了22 篇原创文章 · 获赞 46 · 访问量 1959

私信关注

【Elasticsearch源码】写入源码分析（三）

3.2.5 写主分片节点流程

猜你喜欢