A preliminary study of Apache Hudi (5) (combination with spark)

background

At present, the collection of hudi and spark is still based on spark datasource V1. You can check the source implementation of hudi to know:

class DefaultSource extends RelationProvider
  with SchemaRelationProvider
  with CreatableRelationProvider
  with DataSourceRegister
  with StreamSinkProvider
  with StreamSourceProvider
  with SparkAdapterSupport
  with Serializable {

gossip

Continuing with the code involved in the last preliminary exploration of Apache Hudi (4) :

 // HoodieDataSourceInternalBatchWrite 类中的方法:其所涉及的的方法调用链如下:
 createBatchWriterFactory => dataWriter.write => dataWriter.commit/abort => dataWriter.close
     ||
     \/
 onDataWriterCommit
     ||
     \/
 commit/abort
  • Before explaining what commit does, DataSourceInternalWriterHelper does one more thing in the builder phase, which is writeClient.preWrite :
    this.metaClient = HoodieTableMetaClient.builder().setConf(configuration).setBasePath(writeConfig.getBasePath()).build();
    // writeClient是 SparkRDDWriteClient 实例
    writeClient.preWrite(instantTime, WriteOperationType.BULK_INSERT, metaClient);
  • metaClient builds a hoodie metadata client of the HoodieTableMetaClient
    type . If hoodie.metastore.enable is enabled (default is not enabled), create a HoodieTableMetastoreClient type instance, otherwise create a HoodieTableMetastoreClient instance
  • writeClient.preWrite This is the preparation work before writing data
    • Judging according to the mode set by hoodie.write.concurrency.mode (the default is single_writer , and another option is optimistic_concurrency_control ), if it is OCC, it will get the last successful transaction, otherwise it will be empty
    • Whether to enable the asynchronous clean cleaning service AsyncCleanerService.startAsyncCleaningIfEnabled will be started according to hoodie.clean.automatic (default is true) or hoodie.clean.async (default is false) and hoodie.table.services.enabled (default is true)
    • Whether to enable the archive archive service , the service will be started according to hoodie.archive.automatic (default is true) or hoodie.archive.async ( default is false) and hoodie.table.services.enabled
      (default is true) to start the service AsyncCleanerService.startAsyncArchiveIfEnabled
    • So by default, clean and Archive services are not asynchronous background services
  • Let's see what commit does, it will eventually call the dataSourceInternalWriterHelper.commit method:
public void commit(List<HoodieWriteStat> writeStatList) {
    try {
      writeClient.commitStats(instantTime, writeStatList, Option.of(extraMetadata),
          CommitUtils.getCommitActionType(operationType, metaClient.getTableType()));
    } catch (Exception ioe) {
      throw new HoodieException(ioe.getMessage(), ioe);
    } finally {
      writeClient.close();
    }
  }

The writeClient here is an instance of SparkRDDWriteClient , and the commit method of this instance is as follows:

public boolean commitStats(String instantTime, List<HoodieWriteStat> stats, Option<Map<String, String>> extraMetadata,
                             String commitActionType, Map<String, List<String>> partitionToReplaceFileIds) {
    // Skip the empty commit if not allowed
    if (!config.allowEmptyCommit() && stats.isEmpty()) {
      return true;
    }
    LOG.info("Committing " + instantTime + " action " + commitActionType);
    // Create a Hoodie table which encapsulated the commits and files visible
    HoodieTable table = createTable(config, hadoopConf);
    HoodieCommitMetadata metadata = CommitUtils.buildMetadata(stats, partitionToReplaceFileIds,
        extraMetadata, operationType, config.getWriteSchema(), commitActionType);
    HoodieInstant inflightInstant = new HoodieInstant(State.INFLIGHT, table.getMetaClient().getCommitActionType(), instantTime);
    HeartbeatUtils.abortIfHeartbeatExpired(instantTime, table, heartbeatClient, config);
    this.txnManager.beginTransaction(Option.of(inflightInstant),
        lastCompletedTxnAndMetadata.isPresent() ? Option.of(lastCompletedTxnAndMetadata.get().getLeft()) : Option.empty());
    try {
      preCommit(inflightInstant, metadata);
      commit(table, commitActionType, instantTime, metadata, stats);
      // already within lock, and so no lock requried for archival
      postCommit(table, metadata, instantTime, extraMetadata, false);
      LOG.info("Committed " + instantTime);
      releaseResources();
    } catch (IOException e) {
      throw new HoodieCommitException("Failed to complete commit " + config.getBasePath() + " at time " + instantTime, e);
    } finally {
      this.txnManager.endTransaction(Option.of(inflightInstant));
    }

    // We don't want to fail the commit if hoodie.fail.writes.on.inline.table.service.exception is false. We catch warn if false
    try {
      // do this outside of lock since compaction, clustering can be time taking and we don't need a lock for the entire execution period
      runTableServicesInline(table, metadata, extraMetadata);
    } catch (Exception e) {
      if (config.isFailOnInlineTableServiceExceptionEnabled()) {
        throw e;
      }
      LOG.warn("Inline compaction or clustering failed with exception: " + e.getMessage()
          + ". Moving further since \"hoodie.fail.writes.on.inline.table.service.exception\" is set to false.");
    }

    emitCommitMetrics(instantTime, metadata, commitActionType);

    // callback if needed.
    if (config.writeCommitCallbackOn()) {
      if (null == commitCallback) {
        commitCallback = HoodieCommitCallbackFactory.create(config);
      }
      commitCallback.call(new HoodieWriteCommitCallbackMessage(instantTime, config.getTableName(), config.getBasePath(), stats));
    }
    return true;
  }
  • If empty submission is not allowed (hoodie.allow.empty.commit is true by default, that is, empty submission is allowed), that is, if there is no data inserted, it will return directly.
    This also needs to be recorded for metadata such as offset
  • createTable creates a new HoodieTable , here we add and create a table of type HoodieSparkMergeOnReadTable
  • CommitUtils.buildMetadata constructs meta information,
    where the incoming parameter operationType is bulk_insert , schemaToStoreInCommit is avro schema (previously set), commitActionType is deltacommit , partitionToReplaceFileIds is Map.empty , here just builds the HoodieCommitMetadata object, and puts the corresponding metadata information recorded
  • HoodieInstant creates a new instance of the HoodieInstant type, here is the inflight stage
  • Determine whether the heartbeat has timed out. If hoodie.cleaner.policy.failed.writes is LAZY and timed out, an exception will be reported
  • txnManager.beginTransaction starts the transaction, mainly to acquire the lock.
    If hoodie.write.concurrency.mode is optimistic_concurrency_control , the transaction will be started, because there will be a possibility of conflict in this case
    • lockManager.lock() obtains the lock from the hoodie.write.lock.provider configuration, the default is ZookeeperBasedLockProvider implementation is based on InterProcessMutex
      and based on the configuration of hoodie.metrics.lock.enable whether to enable the metrics of the lock period
    • reset(currentTxnOwnerInstant set this TxnOwnerInstant to currentTxnOwnerInstant

Guess you like

Origin blog.csdn.net/monkeyboy_tech/article/details/130694627