[Source] Elasticsearch CCR source code analysis (two)

Connect one: [] CCR Elasticsearch source source code analysis (a) .

sendShardChangesRequest method eventually enter into ShardChangesAction.TransportAction # shardOperation, with according to the above read request, obtain all seq_no Operation within range of the shard from Translog, the return Operation latest shard needed.

        protected Response shardOperation(Request request, ShardId shardId) throws IOException {
            .......
            // 获取Operation
            final Translog.Operation[] operations = getOperations(
                    indexShard,
                    seqNoStats.getGlobalCheckpoint(),
                    request.getFromSeqNo(),
                    request.getMaxOperationCount(),
                    request.getExpectedHistoryUUID(),
                    request.getMaxBatchSize());
            // 在快照操作完成之后,确保maxSeqNoOfUpdatesOrDeletes,索引元数据,mapping和setting是最新的
            final long maxSeqNoOfUpdatesOrDeletes = indexShard.getMaxSeqNoOfUpdatesOrDeletes();
            final IndexMetaData indexMetaData = indexService.getMetaData();
            final long mappingVersion = indexMetaData.getMappingVersion();
            final long settingsVersion = indexMetaData.getSettingsVersion();
            return getResponse(......);
        }

Operation of the acquisition as follows: First test parameters, and then create a snapshot Translog, traversing a snapshot inside the Operation and add up to more than the maximum limit of the batch.

    static Translog.Operation[] getOperations(....) throws IOException {
        .....// 参数检验
        int seenBytes = 0;
        long toSeqNo = Math.min(globalCheckpoint, (fromSeqNo + maxOperationCount) - 1);
        final List<Translog.Operation> operations = new ArrayList<>();
        // 创建Translog快照,根据translog快照读取Operation
        try (Translog.Snapshot snapshot = indexShard.newChangesSnapshot("ccr", fromSeqNo, toSeqNo, true)) {
            Translog.Operation op;
            while ((op = snapshot.next()) != null) {
                operations.add(op);
                seenBytes += op.estimateSize();
                if (seenBytes > maxBatchSize.getBytes()) {
                    break;
                }
            }
        } catch (MissingHistoryOperationsException e) {
            ......
        }
        return operations.toArray(EMPTY_OPERATIONS_ARRAY);
    }

sendShardChangesRequest snooping processing method by handleReadResponse Response method.

void handleReadResponse(long from, long maxRequiredSeqNo, ShardChangesAction.Response response) {
    	// 处理read response
        Runnable handleResponseTask = () -> innerHandleReadResponse(from, maxRequiredSeqNo, response);
        // 更新follow index mapping
        Runnable updateMappingsTask = () -> maybeUpdateMapping(response.getMappingVersion(),handleResponseTask);
        // 更新follow index settings
        maybeUpdateSettings(response.getSettingsVersion(), updateMappingsTask);
    }

Call innerHandleReadResponse method for processing read response, if no response Operation, will resend sendShardChangesRequest request, otherwise the response of all Operation inside the buffer added to the inside, and then enter the write process.

    synchronized void innerHandleReadResponse(long from, long maxRequiredSeqNo, ShardChangesAction.Response response) {
        .......
        if (response.getOperations().length == 0) { 
            newFromSeqNo = from;
        } else {
            List<Translog.Operation> operations = Arrays.asList(response.getOperations());
            long operationsSize = operations.stream().mapToLong(Translog.Operation::estimateSize).sum();
            buffer.addAll(operations);
            bufferSizeInBytes += operationsSize;
            final long maxSeqNo = response.getOperations()[response.getOperations().length - 1].seqNo();
            newFromSeqNo = maxSeqNo + 1;
            lastRequestedSeqNo = Math.max(lastRequestedSeqNo, maxSeqNo);
            coordinateWrites();//进入write
        }
        if (newFromSeqNo <= maxRequiredSeqNo && isStopped() == false) {
            int newSize = Math.toIntExact(maxRequiredSeqNo - newFromSeqNo + 1);
            sendShardChangesRequest(newFromSeqNo, newSize, maxRequiredSeqNo); //重新发送请求
        } else {
            numOutstandingReads--;
            coordinateReads(); //重新进入read
        }
    }

Also write process first determines whether the write full capacity, then traverse all the Operation ops added to the queue buffer ArrayList inside from the inside, and requests occur through sendBulkShardOperationsRequest.

    private synchronized void coordinateWrites() {
        ......
        while (hasWriteBudget() && buffer.isEmpty() == false) {
            long sumEstimatedSize = 0L;
            int length = Math.min(params.getMaxWriteRequestOperationCount(), buffer.size());
            List<Translog.Operation> ops = new ArrayList<>(length);
            for (int i = 0; i < length; i++) {
                Translog.Operation op = buffer.remove();
                ops.add(op);
                sumEstimatedSize += op.estimateSize();
                if (sumEstimatedSize > params.getMaxWriteRequestSize().getBytes()) {
                    break;
                }
            }
            bufferSizeInBytes -= sumEstimatedSize;
            numOutstandingWrites++;
            // 发送bulk写请求
            sendBulkShardOperationsRequest(ops, leaderMaxSeqNoOfUpdatesOrDeletes, new AtomicInteger(0));
        }
    }

TransportBulkShardOperationsAction class into which to start writing a copy of the primary slice and the slice, TransportBulkShardOperationsAction TransportWriteAction inherited classes.
Here Insert Picture Description
Written procedures and normal bulk written agreement, but rewrote shardOperationOnPrimary and shardOperationOnReplica method. Writing primary translog file playback through, after writing a successful update synchronization translog location and build replicaRequest. A copy of the writing process, constructed according to the above written directly good replicaRequest.

    public static CcrWritePrimaryResult shardOperationOnPrimary(....) throws IOException {
        ......
        final List<Translog.Operation> appliedOperations = new ArrayList<>(sourceOperations.size());
        Translog.Location location = null;
        for (Translog.Operation sourceOp : sourceOperations) {
            final Translog.Operation targetOp = rewriteOperationWithPrimaryTerm(sourceOp, primary.getOperationPrimaryTerm()); //包含操作类型,和相关的信息以及source
            final Engine.Result result = primary.applyTranslogOperation(targetOp, Engine.Operation.Origin.PRIMARY);  // 通过重放translog文件,最终进入到了写primary的逻辑
            if (result.getResultType() == Engine.Result.Type.SUCCESS) {
                appliedOperations.add(targetOp);
                location = locationToSync(location, result.getTranslogLocation());  // 写入成功的话更新同步translog location
            } else {
                ......
            }
        }
        // 写入主分片成功之后,构建replicaRequest
        final BulkShardOperationsRequest replicaRequest = new BulkShardOperationsRequest(
            shardId, historyUUID, appliedOperations, maxSeqNoOfUpdatesOrDeletes);
        return new CcrWritePrimaryResult(replicaRequest, location, primary, logger);  //更新Checkpoint,SeqNo
    }

    public static WriteReplicaResult<BulkShardOperationsRequest> shardOperationOnReplica(.....) throws IOException {
        Translog.Location location = null;
        for (final Translog.Operation operation : request.getOperations()) {
            final Engine.Result result = replica.applyTranslogOperation(operation, Engine.Operation.Origin.REPLICA);  //进入到写数据流程
            if (result.getResultType() != Engine.Result.Type.SUCCESS) {
              .....
            }
            location = locationToSync(location, result.getTranslogLocation());// 写入成功的话更新同步translog location
        }
        return new WriteReplicaResult<>(request, location, null, replica, logger);
    }

handleWriteResponse listener method and the processing result sendBulkShardOperationsRequest per treatment minus 1 numOutstandingWrites successful until numOutstandingWrites equal to 0, if the buffer is pre-measured, is continued read.

private synchronized void handleWriteResponse(final BulkShardOperationsResponse response) {
        this.followerGlobalCheckpoint = Math.max(this.followerGlobalCheckpoint, response.getGlobalCheckpoint());
        this.followerMaxSeqNo = Math.max(this.followerMaxSeqNo, response.getMaxSeqNo());
        numOutstandingWrites--;
        assert numOutstandingWrites >= 0;
        coordinateWrites();
        // 缓冲区有预量时开始读取
        coordinateReads();
    }

In general, follower shard transmitted read request, within the range of seq_no: If there is a new leader shard Operation is available, press configuration parameters to limit the response, then the write data; new leader shard Operation If not available, then the wait until timeout; if a new Operation occurs within the timeout period, then immediately respond to a new Operation, otherwise, if the timeout will reply follower shard no new Operation.

Process to read and write in the cache buffer by a buffer: buffer is a priority queue sorted by seqNo.

private final Queue<Translog.Operation> buffer = new PriorityQueue<>(Comparator.comparing(Translog.Operation::seqNo));

4 Summary

  1. CCR manner widget loaded and used, does not modify the kernel intrusive;
  2. Using snapshots way to restore full amount of replication;
  3. The incremental replication process uses to get all of the Operation and writes data to a remote cluster from the local cluster Translog transaction logs inside;
  4. Replication is the shard level, so each shard has its own Follower Shard Task;
  5. Data consistency between the cluster and to verify by seq_no GlobalCheckpoint;
  6. ES segment files may be deleted or updated in part doc merge process related operations, will lead to changes in seq_no all used soft_deletes, default retention for 12 hours.
Published 31 original articles · won praise 65 · views 50000 +

Guess you like

Origin blog.csdn.net/wudingmei1023/article/details/104064469