Data lake storm hits, EMR releases Apache Hudi

Hudi is a storage format for data lakes, which provides the ability to update data and delete data on top of the Hadoop file system, as well as the ability to stream changed data.

Application scenarios

  • Near real-time data ingestion

    Hudi supports the ability to insert, update and delete data. You can ingest log data such as message queue (Kafka) and log service SLS into Hudi in real time, and also support real-time synchronization of the change data generated by the database Binlog.

    Hudi optimizes the small files generated during the data writing process. Therefore, compared to other traditional file formats, Hudi is more friendly to the HDFS file system.

  • Near real-time data analysis

    Hudi supports multiple data analysis engines, including Hive, Spark, Presto and Impala. As a file format, Hudi does not need to rely on additional service processes, and it is more lightweight in use.

  • Incremental data processing

    Hudi supports Incremental Query query type. You can query data that has changed after a given COMMIT through Spark Streaming. Hudi provides the ability to stream changes in HDFS data, which can be used to optimize the existing system architecture.

Table type

Hudi supports the following two table types:

  • Copy On Write

    Use a special columnar file format (Parquet) to store data. The update operation of the Copy On Write table needs to be rewritten.

  • Merge On Read

    Use a combination of columnar file format (Parquet) and row file format (Avro) to store data. Merge On Read uses column format to store Base data, while using row format to store incremental data. The newly written incremental data is stored in the line file, and the COMPACTION operation is executed according to the configurable strategy to merge the incremental data into the column file.

Query type

Hudi supports the following three query types:

  • Snapshot Queries

    You can query the latest snapshot data of COMMIT. For the Merge On Read type table, you need to merge the Base data in the column storage and the real-time data in the log online when querying; for the Copy On Write table, you can query the latest version of Parquet data.

    The Copy On Write and Merge On Read tables support this type of query.

  • Incremental Queries

    The ability to support incremental query, you can query the latest data after a given COMMIT.

    Only the Copy On Write table supports this type of query.

  • Read Optimized Queries

    Only the latest data in the range defined before the given COMMIT can be queried. Read Optimized Queries is an optimization of the Merge On Read table type snapshot query.

    The Copy On Write and Merge On Read tables support this type of query.

Write data

In EMR-3.32.0 and later versions, the Hudi service is supported by default, so when using Hudi, you only need to add Hudi dependencies to the pom file.

<dependency>
   <groupId>org.apache.hudi</groupId>
   <artifactId>hudi-spark_2.11</artifactId>
   <version>0.6.0</version>
  <scope>provided</scope>
</dependency>

An example of writing data is as follows:

  • Insert or update data

    
     val spark = SparkSession
          .builder()
          .master("local[*]")
          .appName("hudi test")
          .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
          .getOrCreate()
    
    import spark.implicits._
        val df = (for (i <- 0 until 10) yield (i, s"a$i", 30 + i * 0.2, 100 * i + 10000, s"p${i % 5}"))
          .toDF("id", "name", "price", "version", "dt")
    
        df.write.format("hudi")
          .option(TABLE_NAME, "hudi_test_0")
          // .option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL) //更新数据
          .option(OPERATION_OPT_KEY, INSERT_OPERATION_OPT_VAL) //插入数据
          .option(RECORDKEY_FIELD_OPT_KEY, "id")
          .option(PRECOMBINE_FIELD_OPT_KEY, "version")
          .option(KEYGENERATOR_CLASS_OPT_KEY, classOf[SimpleKeyGenerator].getName)
          .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, classOf[MultiPartKeysValueExtractor].getCanonicalName)
          .option(PARTITIONPATH_FIELD_OPT_KEY, "dt")
          .option(HIVE_PARTITION_FIELDS_OPT_KEY, "ds")
          .option(META_SYNC_ENABLED_OPT_KEY, "true")//开启元数据同步功能
          .option(HIVE_USE_JDBC_OPT_KEY, "false")
          .option(HIVE_DATABASE_OPT_KEY, "default")
          .option(HIVE_TABLE_OPT_KEY, "hudi_test_0")
          .option(INSERT_PARALLELISM, "8")
          .option(UPSERT_PARALLELISM, "8")
          .mode(Overwrite)
          .save("/tmp/hudi/h0")
  • delete data

    
    df.write.format("hudi")
          .option(TABLE_NAME, "hudi_test_0")
          .option(OPERATION_OPT_KEY, DELETE_OPERATION_OPT_VAL) // 删除数据
          .option(RECORDKEY_FIELD_OPT_KEY, "id")
          .option(PRECOMBINE_FIELD_OPT_KEY, "version")
          .option(KEYGENERATOR_CLASS_OPT_KEY, classOf[SimpleKeyGenerator].getName)
          .option(PARTITIONPATH_FIELD_OPT_KEY, "dt")
          .option(DELETE_PARALLELISM, "8")
          .mode(Append)
          .save("/tmp/hudi/h0")

    data query

When Hive or Presto queries the Hudi table, the metadata synchronization function needs to be turned on during the write phase (set META_SYNC_ENABLED_OPT_KEY to true).

For the community version of Hudi, the Copy On Write table and Merge On Read table need to be set to set hive.input.format = org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat. The EMR version of Copy On Write table does not need to set the format parameter.

For more information about Hudi, please see https://hudi.apache.org/.

Guess you like

Origin blog.51cto.com/15060465/2675050