Hudi basic knowledge explanation

Hudi overview

Hudi is a data lake storage format that provides the ability to update and delete data and consume changed data on top of the Hadoop file system. It supports a variety of computing engines, provides IUD interfaces, and provides stream primitives for inserting updates and incrementally pulling on HDFS datasets.

Infrastructure Diagram

insert image description here

Hudi features

  • ACID transaction capability supports real-time and batch lake access.
  • Multiple view capabilities (read-optimized view/incremental view/real-time view), supporting fast data analysis.
  • MVCC design supports data version backtracking.
  • Automatically manage file size and layout to optimize query performance Near real-time ingestion provides the latest data for queries.
  • Supports concurrent reading and writing, and the isolation mechanism based on snapshot can be read while writing.
  • Support in-place table conversion, converting stock historical tables into Hudi datasets.

Hudi key technologies and advantages

  • Pluggable index mechanism: Hudi provides a variety of index mechanisms, which can quickly complete the update and delete operations on massive data.
  • Good ecological support: Hudi supports multiple data engine access including Hive, Spark, HetuEngine, and Flink.

Hudi supports two table types

Copy On Write

The copy-on-write table is also referred to as the cow table. It uses parquet files to store data. The internal update operation needs to be completed by rewriting the original parquet file.

  • Advantages When reading, only one data file corresponding to the partition can be read, which is more efficient
  • Disadvantages When writing data, it is necessary to copy a previous copy and then generate a new data file based on it. This process is time-consuming. And due to time-consuming, the data read by the read request will lag behind

Merge On Read

The read-time merge table is also referred to as the mor table, which stores data in two ways: the column format parquet and the row format Avro. Among them, parquet format files are used to store basic data, and Avro format files (also called log files) are used to store incremental data.

  • Advantages Because the delta log is written first when writing data, and the delta log is small, the writing cost is low
  • The disadvantage is that you need to regularly merge and organize compact, otherwise there will be more fragmented files. The read performance is poor because the delta log needs to be merged with the old data files

Hudi supports three views, providing corresponding reading capabilities for different scenarios

Snapshot View

Real-time view: This view provides the latest snapshot data of the current hudi table, that is, once the latest data is written into the hudi table, the new data just written can be found through this view.
Both cow tables and mor support this view capability.

Incremental View

Incremental view: This view provides the ability to query incrementally. It can query incremental data after a specified COMMIT, and can be used to quickly pull incremental data.
The cow table supports this view capability, and the mor table can also support this view, but once the mor table completes the compact operation, its incremental view capability disappears.

Read Optimized View

Read-optimized view: This view will only provide data stored in the latest version of the parquet file.

This view behaves differently on cow and mor tables:

For the cow table, the view capability is the same as the real-time view capability (the cow table only uses parquet files to store data).

For the mor table, only the basic files are accessed, and the data of a given file piece since the last compact operation is provided. It can be simply understood that this view will only provide the data stored in the parquet file of the mor table, and the data in the log file will be ignored. The view data is not necessarily up-to-date, but once the compact operation of the mor table is completed, the incremental log data is merged into the base data. At this time, the view has the same capabilities as the real-time view.

Supongo que te gusta

Origin blog.csdn.net/weixin_43114209/article/details/131677029
Recomendado
Clasificación