Ozone Datanode ContainerStateMachine achieve semantic

Foreword


Part article the author describes the package to achieve internal consistency in the use of Ozone Apache Ratis realization of the principle of consistency, thanks to the bottom in Ozone levels, and only need to call this library to implement a custom StateMachine method. In Ozone Datanode in, we customized the ContainerStateMachine Container operations to achieve consistency control between multiple copies. In this paper we talk about the internal implementation Ozone Datanode ContainerStateMachine, so that we can gain further insight into the requested operation process Container.

For ContainerStateMachine StateMachine / RaftLog semantics implemented


Because Ozone uses internal Apache Ratis implementation, there will be many times involves the following two concepts:

  • RaftLog
  • StateMachine

So we need to understand clearly these two concepts in Ozone is a kind of variable definitions.

First RaftLog, RaftLog in Ozone is simply understood that every user request operation, where manifestations of TransactionContext. This process which is as follows:

First Ozone levels will be packaged as a user request TransactionContext objects,

The method of operation ContainerStateMachine startTransaction,

  public TransactionContext startTransaction(RaftClientRequest request)
      throws IOException {
    long startTime = Time.monotonicNowNanos();
    final ContainerCommandRequestProto proto =
        message2ContainerCommandRequestProto(request.getMessage());
    Preconditions.checkArgument(request.getRaftGroupId().equals(gid));
    try {
      dispatcher.validateContainerCommand(proto);
    } catch (IOException ioe) {
      if (ioe instanceof ContainerNotOpenException) {
        metrics.incNumContainerNotOpenVerifyFailures();
      } else {
        metrics.incNumStartTransactionVerifyFailures();
        LOG.error("startTransaction validation failed on leader", ioe);
      }
      TransactionContext ctxt = TransactionContext.newBuilder()
          .setClientRequest(request)
          .setStateMachine(this)
          .setServerRole(RaftPeerRole.LEADER)
          .build();
      ctxt.setException(ioe);
      return ctxt;
    }
    ...
}

Then the received Datanode TransactionContext writing RaftLog, will be transformed in the log entry,

  @Override
  public LogEntryProto initLogEntry(long term, long index) {
    Preconditions.assertTrue(serverRole == RaftPeerRole.LEADER);
    Preconditions.assertNull(logEntry, "logEntry");
    Objects.requireNonNull(smLogEntryProto, "smLogEntryProto == null");
    return logEntry = ServerProtoUtils.toLogEntryProto(smLogEntryProto, term, index);
  }

Then these will be inside Datanode RaftServer TransactionContext converted from the log entry is written into the local RaftLog.

During the writing of the log entry, still need to be divided into the following two cases:

First, the case of a request with user data, a data write operation request Italy, e.g. writeChunk request, we need to separate the data written to the StateMachine, RaftLog Transaction retain only the information itself Ozone. StateMachine here can be understood as Datanode of Metadata current status.

Because the user is required to write data real time, so Datanode ContainerStateMachine in which to achieve the internal cache way to save user data requests, then asynchronous write this part of the chunk of data, but in order to save temporary files tmp state.

  /*
   * writeStateMachineData calls are not synchronized with each other
   * and also with applyTransaction.
   */
  @Override
  public CompletableFuture<Message> writeStateMachineData(LogEntryProto entry) {
    try {
      metrics.incNumWriteStateMachineOps();
      long writeStateMachineStartTime = Time.monotonicNowNanos();
      ContainerCommandRequestProto requestProto =
          getContainerCommandRequestProto(
              entry.getStateMachineLogEntry().getLogData());
      //1) 构造write chunk请求操作 
      WriteChunkRequestProto writeChunk =
          WriteChunkRequestProto.newBuilder(requestProto.getWriteChunk())
              .setData(getStateMachineData(entry.getStateMachineLogEntry()))
              .build();
      requestProto = ContainerCommandRequestProto.newBuilder(requestProto)
          .setWriteChunk(writeChunk).build();
      Type cmdType = requestProto.getCmdType();

      // For only writeChunk, there will be writeStateMachineData call.
      // CreateContainer will happen as a part of writeChunk only.
      switch (cmdType) {
      case WriteChunk:
        return handleWriteChunk(requestProto, entry.getIndex(),
            entry.getTerm(), writeStateMachineStartTime);
      default:
        throw new IllegalStateException("Cmd Type:" + cmdType
            + " should not have state machine data");
      }
    } catch (IOException e) {
      metrics.incNumWriteStateMachineFails();
      return completeExceptionally(e);
    }
  }

  private CompletableFuture<Message> handleWriteChunk(
      ContainerCommandRequestProto requestProto, long entryIndex, long term,
      long startTime) {
    final WriteChunkRequestProto write = requestProto.getWriteChunk();
    RaftServer server = ratisServer.getServer();
    Preconditions.checkState(server instanceof RaftServerProxy);
    try {
      // 2) 如果是Leader服务,将chunk数据写入cache中,leader服务将从此cache中读chunk数据,
      // 包装为raft log请求发送给Follower
      if (((RaftServerProxy) server).getImpl(gid).isLeader()) {
        stateMachineDataCache.put(entryIndex, write.getData());
      }
    } catch (IOException | InterruptedException ioe) {
      return completeExceptionally(ioe);
    }
    DispatcherContext context =
        new DispatcherContext.Builder()
            .setTerm(term)
            .setLogIndex(entryIndex)
            // 标明此阶段为写数据阶段
            .setStage(DispatcherContext.WriteChunkStage.WRITE_DATA)
            .setContainer2BCSIDMap(container2BCSIDMap)
            .build();
    CompletableFuture<Message> raftFuture = new CompletableFuture<>();
    // ensure the write chunk happens asynchronously in writeChunkExecutor pool
    // thread.
    ...
    return raftFuture;
  }

For Datanode RaftLeader role, it needs to implement readStateMachineData method reads user data structure Raft log sent to Datanode which Raft Follower from StateMachine in itself.

ContainerStateMachine的readStateMachineData方法,

  /*
   * This api is used by the leader while appending logs to the follower
   * This allows the leader to read the state machine data from the
   * state machine implementation in case cached state machine data has been
   * evicted.
   */
  @Override
  public CompletableFuture<ByteString> readStateMachineData(
      LogEntryProto entry) {
    ...
    try {
      final ContainerCommandRequestProto requestProto =
          getContainerCommandRequestProto(
              entry.getStateMachineLogEntry().getLogData());
      // readStateMachineData should only be called for "write" to Ratis.
      Preconditions.checkArgument(!HddsUtils.isReadOnly(requestProto));
      // 目前readStateMachineData只会被write chunk请求调用
      if (requestProto.getCmdType() == Type.WriteChunk) {
        final CompletableFuture<ByteString> future = new CompletableFuture<>();
        CompletableFuture.supplyAsync(() -> {
          try {
            future.complete(
                getCachedStateMachineData(entry.getIndex(), entry.getTerm(),
                    requestProto));
          } catch (IOException e) {
            metrics.incNumReadStateMachineFails();
            future.completeExceptionally(e);
          }
          return future;
        }, chunkExecutor);
        return future;
      } else {
        throw new IllegalStateException("Cmd type:" + requestProto.getCmdType()
            + " cannot have state machine data");
      }
    } catch (Exception e) {
      metrics.incNumReadStateMachineFails();
      LOG.error("{} unable to read stateMachineData:", gid, e);
      return completeExceptionally(e);
    }
  }

  /**
   * Reads the Entry from the Cache or loads it back by reading from disk.
   */
  private ByteString getCachedStateMachineData(Long logIndex, long term,
      ContainerCommandRequestProto requestProto)
      throws IOException {
    // 从本地cache中快速读取,如果cache中已不存在了,从本地tmp chunk文件中读取数据
    ByteString data = stateMachineDataCache.get(logIndex);
    if (data == null) {
      data = readStateMachineData(requestProto, term, logIndex);
    }
    return data;
  }

This part of the final request processing logic ChunkManagerImpl as follows:

  public void writeChunk(Container container, BlockID blockID, ChunkInfo info,
      ChunkBuffer data, DispatcherContext dispatcherContext)
      throws StorageContainerException {
    Preconditions.checkNotNull(dispatcherContext);
    DispatcherContext.WriteChunkStage stage = dispatcherContext.getStage();
    try {
      ...

      switch (stage) {
      case WRITE_DATA:
        //...
        if (tmpChunkFile.exists()) {
          // If the tmp chunk file already exists it means the raft log got
          // appended, but later on the log entry got truncated in Ratis leaving
          // behind garbage.
          // TODO: once the checksum support for data chunks gets plugged in,
          // instead of rewriting the chunk here, let's compare the checkSums
          LOG.warn(
              "tmpChunkFile already exists" + tmpChunkFile + "Overwriting it.");
        }
        // 写入的是临时tmp文件中,如果后续发送Raft log truncate操作,此tmp数据也可以被重新覆盖掉
        // 此过程发生在ContainerStateMachine的writeStateMachineData阶段
        ChunkUtils
            .writeData(tmpChunkFile, info, data, volumeIOStats, doSyncWrite);
        // No need to increment container stats here, as still data is not
        // committed here.
        break;
      case COMMIT_DATA:
        ...
        // 提交chunk数据阶段,rename tmp chunk文件为正式文件名,
        // 此过程发生在ContainerStateMachine的applyTransaction阶段
        commitChunk(tmpChunkFile, chunkFile);
        // Increment container stats here, as we commit the data.
        updateContainerWriteStats(container, info, isOverwrite);
        break;
      case COMBINED:
        // directly write to the chunk file
        ChunkUtils.writeData(chunkFile, info, data, volumeIOStats, doSyncWrite);
        updateContainerWriteStats(container, info, isOverwrite);
        break;
      default:
        throw new IOException("Can not identify write operation.");
      }
    } catch (StorageContainerException ex) {
      ...
    }
  }

Note that the above-mentioned author writeStateMachineData operation just write out data requested by the user, does not mean the end of this action, we have successfully completed the RaftLog apply to StateMachine time.

So we can see writechunk requested operation only when the operation was declared in applyTransaction Stage is COMMIT_DATA stage,

  /*
   * ApplyTransaction calls in Ratis are sequential.
   */
  @Override
  public CompletableFuture<Message> applyTransaction(TransactionContext trx) {
    long index = trx.getLogEntry().getIndex();
    // Since leader and one of the followers has written the data, it can
    // be removed from the stateMachineDataMap.
    stateMachineDataCache.remove(index);

    DispatcherContext.Builder builder =
        new DispatcherContext.Builder()
            .setTerm(trx.getLogEntry().getTerm())
            .setLogIndex(index);

    long applyTxnStartTime = Time.monotonicNowNanos();
    try {
      applyTransactionSemaphore.acquire();
      metrics.incNumApplyTransactionsOps();
      ContainerCommandRequestProto requestProto =
          getContainerCommandRequestProto(
              trx.getStateMachineLogEntry().getLogData());
      Type cmdType = requestProto.getCmdType();
      // Make sure that in write chunk, the user data is not set
      if (cmdType == Type.WriteChunk) {
        Preconditions
            .checkArgument(requestProto.getWriteChunk().getData().isEmpty());
        builder
            // apply transaction阶段为commit chunk data阶段
            .setStage(DispatcherContext.WriteChunkStage.COMMIT_DATA);
      }
      ...
}

For other requests with no user data, such as pure Container made of metadata update request, will be processed in time applyTransaction to StateMachine RaftLog, is shown above the ContainerStateMachine applyTransaction. This request will be passed to the section for processing HddsDispatcher asynchronously.

For Read request type, because this part of the query processing method StateMachine, as follows:

ContainerStateMachine的query

  @Override
  public CompletableFuture<Message> query(Message request) {
    try {
      metrics.incNumQueryStateMachineOps();
      final ContainerCommandRequestProto requestProto =
          message2ContainerCommandRequestProto(request);
      // 执行runCommand方法
      return CompletableFuture
          .completedFuture(runCommand(requestProto, null)::toByteString);
    } catch (IOException e) {
      metrics.incNumQueryStateMachineFails();
      return completeExceptionally(e);
    }
  }

As RaftLog inconsistency may occur in the case where Datanode RaftFollower present process, need to perform truncate operation. ContainerStateMachine also needs to be done at the same time corresponding to the processing, the user data is written StateMachine before processing, the processing as follows:

  @Override
  public CompletableFuture<Void> truncateStateMachineData(long index) {
    // 移除指定index之后的cache数据,写出的tmp chunk data将会后续的写操作中被覆盖
    stateMachineDataCache.removeIf(k -> k >= index);
    return CompletableFuture.completedFuture(null);
  }

For necessary to achieve the above-described method StateMachine ContainerStateMachine, Ozone Datanode achieve consistent processing request based on a protocol control Raft. Write mainly related to the additional user data request processing, the need StateMachine data write operation of the read process. As most of the Transaction terms of read and write requests, implemented in StateMachine the query and applyTransaction can be. Apache Ratis achieve the underlying libraries have helped us to achieve a good, interesting parts of these students can read the author's understanding of the article: RaftLeader internal use of Ozone / synchronization mechanism RaftFollower of consistency .

Shows a flowchart of this process is shown below, the following scheme can deepen understanding of all of the above-described process, the solid line represents ordinary Raft log write phase and the dotted line represents a commit log phase.
Here Insert Picture Description

Published 388 original articles · won praise 424 · Views 2.07 million +

Guess you like

Origin blog.csdn.net/Androidlushangderen/article/details/104456771