What is Apache Hudi?

foreword

This article belongs to the column "Big Data Technology System". This column is original by the author. Please indicate the source for the citation. Please point out the deficiencies and mistakes in the comment area, thank you!

For the directory structure and references of this column, please refer to Big Data Technology System


background

In 2016, Uber developed Apache Hudi (originally called Hoodie), an incremental processing framework that powers business-critical data pipelines with low latency and high efficiency.

A year later, Uber chose to open-source the solution, allowing other data-dependent organizations to take advantage of it, and then in 2019, took that commitment a step further by donating it to the Apache Software Foundation.

Then, after a year and a half, Apache Hudi graduated from the Apache Software Foundation as a top-level project.


WHAT

insert image description here

Apache Hudi is a next-generation real-time computing data lake platform. It builds a real-time computing data lake through incremental data pipelines on the self-service data storage layer, and also optimizes the data lake engine and offline computing.

Not only is Apache Hudi well-suited for real-time computing workloads, it also allows for the creation of efficient incremental offline computing data pipelines.

Apache Hudi brings streaming to batch-like big data using primitives like upsert and incremental pull.

These features help to display faster and fresher data for our service, a unified service layer with data latency on the order of minutes, avoiding any additional overhead of maintaining multiple systems.

In addition to flexibility, Apache Hudi can run on Hadoop Distributed File System (HDFS) or cloud storage.

Hudi implements Atomicity, Consistency, Isolation, and Durability (ACID) semantics on data lakes.

Two of Hudi's most widely used features are insert (upsert) and incremental pull , which enable users to ingest change data capture and apply it to a data lake at scale.

Hudi also provides extensive pluggable indexing capabilities, and to achieve this, implements data indexing itself.

Hudi's ability to control and manage the layout of files in a data lake is critical not only to overcoming HDFS namenode and other cloud storage limitations, but also to maintaining a healthy data ecosystem by improving reliability and query performance.

To this end, Hudi supports multiple query engine integrations such as Presto, Apache Hive, Apache Spark, and Apache Impala.

insert image description here


characteristic

  • upsert/delete via fast, pluggable indexing
  • Incremental queries, record-level change streams.
  • Transactions, rollbacks, concurrency control.
  • SQL read and write from Spark, Presto, Trino, Hive, etc.
  • File size automation, data clustering, file merging and cleaning.
  • Ingestion of streaming data, built-in CDC data sources and tools.
  • Tracking of built-in metadata for scalable storage access.
  • Backward compatible schema (Schema) upgrade and implementation.

interior details

insert image description here

In the top-level design, Hudi is conceptually divided into three main components: the original data that needs to be stored, the data index used to provide the ability to insert (upsert), and the metadata used to manage the dataset.

At its core, Hudi maintains a schedule of all operations performed on data tables at different points in time, called instants in Hudi . This provides an immediate view of the data table, while also efficiently enabling data to be retrieved in order of arrival.

Hudi guarantees that operations performed on the timeline are atomic and consistent based on instant time, i.e. when changes occur in the database .

With this information, Hudi provides different views of the same Hudi table , including a read-optimized view for fast column storage performance, a real-time view for fast data ingestion, and a Hudi table as a changelog stream as shown above Incremental view of reads.

Hudi organizes data tables into a directory structure under the base path of the distributed file system.

Tables are broken up into partitions, and within each partition, files are organized into filegroups, uniquely identified by file IDs.

Each filegroup contains several file slices, where each slice contains a base file (*.parquet) generated at a particular commit/compression instant, and a set of log files (*.log.*) containing the Inserts/updates to the base file since the base file was generated.

Hudi employs Multi-Version Concurrency Control (MVCC) , where compaction operations merge journal and base files to generate new file slices, and cleanup operations remove unused/old file slices to reclaim space on the filesystem.

Hudi supports two table types: copy-on-write and merge-on-read.

Copy-on-write table types only use columnar file formats such as Apache Parquet to store data.

With copy-on-write, simply update versions and rewrite files by performing a synchronous merge during the write process.

The read merged table type stores data using a combination of column-based (eg Apache parquet) and row-based (eg Apache Avro) file formats.

Updates are logged to delta files and then compacted to generate new versions of the columnar files either synchronously or asynchronously.

Hudi also supports two query types: snapshot and incremental queries.

A snapshot query is a request to take a "snapshot" of a table from a given commit or compact operation.

When leveraging snapshot queries, copy-on-write table types only expose base/columnstore files in the latest file slice and guarantee the same columnstore query performance as non-Hudi tables.

Copy-on-write provides a drop-in replacement for existing Parquet tables, while providing insert/delete and other functionality.

In the case of read merged tables, snapshot queries expose near-real-time data (on the order of the minute) by merging the base and delta files of the latest file slice in real time.

For copy-on-write tables, incremental queries provide new data written to the table since a given commit or compaction, providing a change stream to enable incremental data pipelines.

Guess you like

Origin blog.csdn.net/Shockang/article/details/126825709