A preliminary study of Apache Hudi (6) (combination with spark)

background

At present, the collection of hudi and spark is still based on spark datasource V1. You can check the source implementation of hudi to know:

class DefaultSource extends RelationProvider
  with SchemaRelationProvider
  with CreatableRelationProvider
  with DataSourceRegister
  with StreamSinkProvider
  with StreamSourceProvider
  with SparkAdapterSupport
  with Serializable {

Subsequent discovery is also based on Datasource V2

gossip

Continuing with the code involved in the last preliminary exploration of Apache Hudi (5) :

  preCommit(inflightInstant, metadata);
  commit(table, commitActionType, instantTime, metadata, stats);
  // already within lock, and so no lock requried for archival
  postCommit(table, metadata, instantTime, extraMetadata, false);
  LOG.info("Committed " + instantTime);
  releaseResources();
  this.txnManager.endTransaction(Option.of(inflightInstant));
  ...
  runTableServicesInline(table, metadata, extraMetadata)
  emitCommitMetrics(instantTime, metadata, commitActionType);

    // callback if needed.
    if (config.writeCommitCallbackOn()) {
      if (null == commitCallback) {
        commitCallback = HoodieCommitCallbackFactory.create(config);
      }
      commitCallback.call(new HoodieWriteCommitCallbackMessage(instantTime, config.getTableName(), config.getBasePath(), stats));
    }
    return true;
  }
 

As mentioned before, only the thread that has acquired the distributed lock can continue the operation

  • preCommit resolves potential file conflicts for writing metadata, because there may be conflicts between current writing and background Compaction/Clustering operations
  • commit actually writes metadata
    • Before writing metadata, you need to understand the marker file. The marker file is created in the HoodieRowCreateHandle class:
      createMarkerFile(partitionPath, fileName, instantTime, table, writeConfig);
    
    
    Here, the marker file will be created according to the value of hoodie.write.markers.type (the default is "TIMELINE_SERVER_BASED") and hoodie.embed.timeline.server (the default is true)
    . For the default value (that is, if TimelineServerBasedWriteMarkers is based on HDFS DirectWriteMarkers ), it will send a request to create a marker file (the file suffix is ​​*.parquet.marker.CREATE*) to the built-in TimelineServerBasedWriteMarkers service, and the creation of the service is in the construction method of BaseHoodieClient :

     startEmbeddedServerView()
    
    • When committing a file, reconcileAgainstMarkers will be called , and some files caused by spark speculation will be deleted, which is also the meaning of the existence of the marker file,
      and indicators such as "duration" and "numFilesFinalized" will be updated
    • writeTableMetadata updates table metadata
    • activeTimeline.saveAsComplete creates a metadata file like 20230422183552014.deltacommit
  • postCommit does some post-commit cleanup
    • quietDeleteMarkerDir deletes the corresponding marker file
    • autoCleanOnCommit If hoodie.clean.automatic is true and hoodie.clean.async is false , it will perform synchronous cleaning
      , mainly to clean up unnecessary historical files (parquet data files or log files)
    • autoArchiveOnCommit If hoodie.archive.automatic is true and hoodie.archive.async is false , the archiving operation will be performed,
      mainly for instant archiving, reducing the operating pressure of timeline
  • releaseResources here mainly releases the persistent RDD
  • this.txnManager.endTransaction(Option.of(inflightInstant))
    ends the transaction and releases the lock
  • runTableServicesInline is outside the transaction, because this operation is a time-consuming operation, including Compaction and Clustering
    • If hoodie.metadata.enable is true (default is true) will continue to follow-up operations
    • If hoodie.compact.inline is false (the default is false) and hoodie.compact.schedule.inline is true (the default is false), Compaction will be performed
    • If hoodie.compact.inline is true , Compaction will also be performed
    • The operation of Clustering is the same, but the corresponding cooperation is hoodie.clustering.inline and hoodie.clustering.schedule.inline
  • collect emitCommitMetrics indicators
  • If hoodie.write.commit.callback.on is true (the default is false),
    it will also call back the class of hoodie.write.commit.callback.class (the default is org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback ), according to sending some Information to the http server configured by hoodie.write.commit.callback.http.url

Guess you like

Origin blog.csdn.net/monkeyboy_tech/article/details/130757902