A preliminary study of Apache Hudi (7) (combination with spark)

background

At present, the collection of hudi and spark is still based on spark datasource V1. You can check the source implementation of hudi to know:

class DefaultSource extends RelationProvider
  with SchemaRelationProvider
  with CreatableRelationProvider
  with DataSourceRegister
  with StreamSinkProvider
  with StreamSourceProvider
  with SparkAdapterSupport
  with Serializable {

gossip

Then the rest of Apache Hudi Preliminary Exploration (2) (combination with spark) :

    val syncHiveSuccess = metaSync(sqlContext.sparkSession, writeConfig, basePath, df.schema)

This is mainly to synchronize data to hive metadata. If hoodie.datasource.hive_sync.enable is enabled (default is false, not enabled), hoodie.datasource.meta.sync.enable
will be set to true (default is false, not enabled) Open), at the same time, the HiveSyncTool class will be added to the syncClientToolClassSet collection, which is convenient for subsequent calls. Of course, if hoodie.meta.sync.client.tool.class is set , it will also be added to the collection. If hoodie.datasource.meta.sync.enable is true, hoodie.datasource.hive_sync.schema_string_length_thresh will be set to spark.sql.sources.schemaStringLengthThreshold is 4000 by default. Set hoodie.meta_sync.spark.versio to the current spark version . Set hoodie.meta .sync.metadata_file_listing is



hoodie.metadata.enable (default is true)
and then call the syncHoodieTable method of HiveSyncTool to synchronize metadata. For the MOR table, there will be two tables, one is rt table and the other is ro table, corresponding to snapshot respectively Table (real-time table) and read-optimized table But if hoodie.datasource.hive_sync.skip_ro_suffix is ​​true (default is false ), then the read-optimized table will not add the ro suffix and finally refresh the table just created in spark, so The inserted hudi table can be queried in the spark query


Guess you like

Origin blog.csdn.net/monkeyboy_tech/article/details/130798742