Data Lake Architecture Hudi (3) Hudi Core Concepts

3. The core concept of Apache Hudi

3.1 Basic concepts

Hudi provides the concept of Hudi tables, these tables support CRUD operations, and can use existing big data clusters such as HDFS做数据文件存储, then 使用SparkSQL或Hive等分析引擎进行数据分析查询.

There are three main components of a Hudi table:

  • Ordered timeline metadata, similar to a database transaction log.

  • Data files with hierarchical layout: the data actually written into the table;

  • Index (multiple implementations): Maps the dataset containing the specified records.

3.1.1 Time axis Timeline

在不同的即时( Instant) 时间对数据集操作hudi maintains a timeline (Timeline) containing (such as adding, modifying or deleting) in all tables .

Each pair Hudi表的数据集操作时will be in the table Timeline上生成一个Instant, so that it can only query the data successfully submitted after a certain point in time, or only query the data before a certain point in time, effectively avoiding scanning data in a larger time range.

At the same time, only the files before the change can be queried efficiently (for example, after a change operation is submitted by an Instant, only the data before a certain point in time is queried, and the data before the modification can still be queried).

Timeline is an abstraction used by Hudi to manage commits. Each commit is bound to a fixed timestamp and distributed to the timeline. On Timeline, each commit is abstracted as a HoodieInstant, and an instant records the behavior, timestamp, and status of a commit (commit).

insert image description here

instant由以下三个部分组成:

1)Instant action:在表上执行的操作类型
COMMITS:一次commit表示将一批数据原子性地写入一个表。
CLEANS:清除表中不再需要的旧版本文件的后台活动。
DELTA_COMMIT:增量提交指的是将一批数据原子性地写入一个MergeOnRead类型的表,其中部分或所有数据可以写入增量日志。
COMPACTION:合并Hudi内部差异数据结构的后台活动,例如:将更新操作从基于行的log日志文件合并到列式存储的数据文件。在内部体现为timeline上的特殊提交。
ROLLBACK:表示当commit/delta_commit不成功时进行回滚,其会删除在写入过程中产生的部分文件。
SAVEPOINT:将某些文件组标记为已保存,以便其不会被删除。在发生灾难需要恢复数据的情况下,它有助于将数据集还原到时间轴上的某个点。

2)Instant time
通常是一个时间戳(例如:20190117010349),它按照动作开始时间的顺序单调增加。

3)State
REQUESTED:表示某个action已经调度,但尚未执行。
INFLIGHT:表示action当前正在执行。
COMPLETED:表示timeline上的action已经完成。

4)时间概念
Arrival time: 数据到达 Hudi 的时间,commit time。
Event time: record 中记录的时间。

In the figure below, time (hour) is used as the partition field, and various commits are generated successively from 10:00, and a piece of data at 9:00 comes at 10:20, and the data can still fall into the partition corresponding to 9:00, through timeline Direct consumption of incremental updates after 10:00 (consume only groups with new commits), then this delayed data can still be consumed.
insert image description here

3.1.2 File Management

Hudi organizes datasets on DFS into a directory structure under the base path ( HoodieWriteConfig.BASEPATHPROP ).

A dataset is divided into partitions ( DataSourceOptions.PARTITIONPATHFIELDOPT_KEY ), which, much like a Hive table, are folders containing the data files for that partition.

As shown in the figure below, in each partition, files are organized into file groups, and the file id acts as a unique identifier. Each filegroup contains multiple file slices, where each slice contains the base column files ( . Inserts/updates to base files since.
insert image description here

Hudi's base file (parquet file) records the BloomFilter composed of record key in the footer's meta, which is used to realize efficient key contains detection in the implementation of file based index.
Hudi's log (avro file) is self-encoded and written out in units of LogBlocks by accumulating data buffers. Each LogBlock contains information such as magic number, size, content, footer, and is used for data reading, verification, and filtering.

insert image description here

3.1.3 Index

3.1.3.1 Index Principle

  • Hudi provides efficient Upsert operations through the indexing mechanism, which will RecordKey+PartitionPath组合use a method as a method 唯一标识映射到一个文件ID, and the mapping between this unique identifier and the file group/file ID will not change since the record is written to the file group. Therefore, a FileGroup contains all version records of a batch of records. Index is used to distinguish whether the message is INSERT or UPDATE.

  • In order to eliminate unnecessary reading and writing, Hudi introduces the implementation of indexes. With the index, the updated data can be quickly located to the corresponding File Group. Take the following picture as an example, white is the basic file, yellow is the updated data, with the indexing mechanism, you can: avoid reading unnecessary files, avoid updating unnecessary files, and do not need to make distributed associations between updated data and historical data , you only need to merge within the File Group.

insert image description here

3.1.3.2 Index Types

type principle advantage shortcoming
Bloom Index The default configuration uses the Bloom filter to determine whether a record exists or not, and you can also choose to use the range of the record key to cut the required files High efficiency, independent of external systems, consistent data and indexes Due to the problem of false positives, it is necessary to go back to the original file and search again
Simple Index Join the new data and old data of the update/delete operation Easiest to implement and requires no additional resources poor performance
HBase Index Store the index in HBase. In the stage of inserting File Group location, all tasks send Batch Get request to HBase to obtain the Mapping information of Record Key For small batches of keys, the query efficiency is high External systems are required, which increases the pressure on operation and maintenance
Flink State-based Index The Flink witer implemented by HUDI in version 0.8.0 uses Flink's state as the underlying index storage, and each record will first calculate the target bucket ID before writing. Unlike BloomFilter Index, it avoids repeated file index lookups every time

注意:Flink只有一种state based index(和bucket_index),其他index是Spark可选配置。

3.1.3.3 Global and non-global indexes

  • Global index: It is mandatory to keep the key unique under all partition ranges of the whole table, that is, to ensure that there is only one corresponding record for a given key.

  • Non-global index: The key is only required to be unique within a certain partition of the table, and it relies on the writer to provide a consistent partition path for the update and deletion of the same record.

From the perspective of index maintenance cost and write performance, maintaining a global index is more difficult and has a greater impact on write performance, so a non-global index is required.

The HBase index is essentially a global index, and both bloom and simple index have global options:

Ø hoodie.index.type=GLOBAL_BLOOM

Ø hoodie.index.type=GLOBAL_SIMPLE

3.1.3.4 Index selection strategy

(1) Delayed update of the fact table

​ Most of the updates will happen on the latest few partitions and a small part will be on the old partitions. For such a job model, Bloom index can perform very well. Because the query index can be set properly 布隆过滤器来裁剪很多数据文件. In addition, 如果生成的键可以以某种顺序排列,参与比较的文件数会进一步通过范围裁剪而减少。Hudi uses the key fields of all files to construct an interval tree, which can efficiently exclude unmatched files based on the key fields of the input delete records.

​ Hudi supports dynamic bloom filters (set hoodie.bloom.index.filter.type=DYNAMIC_V0). It can adjust the size according to the number of records stored in the file to achieve the set false positive rate.

(2) Deduplication of the event table

The number of events emitted from Kafka or other similar message buses is typically 10-100 times the size of the fact table. Since most of this is append-only data, inserts and updates only exist in the most recent few partitions. Since duplicate events may occur at any node in the entire data pipeline, it is a common requirement to deduplicate before storing them in the data lake.

​ In general, low-consumption deduplication is a very challenging task. Although it is possible to use a key-value store to implement deduplication (ie HBase index), the consumption of index storage will grow linearly with the number of events so that it becomes unfeasible. 事实上,有范围裁剪功能的布隆索引是最佳的解决方案.

(3) Random update and deletion of dimension tables

​ This kind of table data checks all files even if range comparison is used. 使用简单索引对此场景更合适, because it does not use the trimming operation in advance, but directly concatenates the required fields of all files. 如果额外的运维成本可以接受的话,也可以采用HBase索引, which can provide superior query efficiency for these tables.

3.1.4 Types of Hudi tables

3.1.4.1 Copy On Write (COW)

  • In the COW table, 只有数据文件/基本文件(.parquet),没有增量日志文件(.log.*).

  • Writing to each new batch will create a new version of the corresponding data file (new FileSlice), the new version of the file includes: the records of the old version of the file + the records from the incoming batch (full latest).

  • COW incurs some write latency due to coalescing during writes. But the advantage of COW is its simplicity, it doesn't require other table services (such as compression), and it's relatively easy to debug.

insert image description here

​ As shown above, we do a batch of new writes, after indexing, we find that these records match File group 1 and File group 2, and then there are new inserts, for which we will create a new file group (File group 4). So both data_file1 and data_file2 will create newer versions, data_file1 V2 is the record merge of the contents of data_file1 V1 with the incoming batch of matching records in data_file1.

3.1.4.2 Merge On Read (merge MOR when reading)

  • The MOR table contains the basic file (.parquet) for column storage and the incremental log file for row storage (row-based avro format, .log.*).

  • As the name implies, the merge cost of the MOR table is on the read side. So during write we don't merge or create newer data file versions. After marking/indexing is complete, for existing data files that have records to be updated, Hudi creates incremental log files and names them appropriately so that they all belong to one filegroup.

  • The read side will 实时合并基本文件及其各自的增量日志文件. The read latency is high each time (due to the merge at query time), so Hudi is used 压缩机制to merge the data and log files together and create an updated version of the data file.

You can choose to run compression in inline or asynchronous mode. Hudi also provides different compression strategies for users to choose, 最常用的一种是基于提交的数量. For example, you can configure the maximum incremental log for compression to 4. This means that after 4 incremental writes, the data file is compressed and an updated version of the data file is created. After the compression is complete, the reader only needs to read the latest data file and does not need to care about the old version file.

MOR表的写入行为,依据 index 的不同会有细微的差别:

对于 BloomFilter 这种无法对 log file 生成 index 的索引方案,对于 INSERT 消息仍然会写 base file (parquet format),只有 UPDATE 消息会 append log 文件(因为 base file 已经记录了该 UPDATE 消息的 FileGroup ID)。


对于可以对 log file 生成 index 的索引方案,例如 Flink writer 中基于 state 的索引,每次写入都是 log format,并且会不断追加和 roll over。
COW MOR
data delay high Low
query latency Low high
Update(I/O) update cost High (rewrites the entire Parquet file) low (append to incremental log)
Parquet file size Low (high update cost I/O) Larger (low update cost)
write amplification big low (depends on compaction strategy)

3.1.5 Query type of Hudi table

3.1.5.1 Snapshot Queries (snapshot query)

  • Snapshot query, you can query the latest snapshot of the table after the instant operation of the specified commit/delta commit.

  • In the case of merge-on-read (MOR) tables, it provides near-real-time tables (minutes) by merging the base and delta files of the latest file slice on the fly.

  • For copy-on-write (COW), the latest version of the Parquet data file is queried.

insert image description here

3.1.5.2 Incremental Queries

​ Incremental query, you can query the newly written data since the instant operation of a given commit/delta commit. Effectively provide change streams to enable incremental data pipelines.

3.1.5.3 Read Optimized Queries (optimized when reading, only MOR is available)

​ Read-optimized query, you can view the latest snapshot of the table for a given commit/compact immediate operation. 仅将最新文件片的基本/列文件暴露给查询, and guarantee the same column query performance as non-Hudi tables.

3.2 Data reading, writing and merging

3.2.1 Data read

3.2.1.1 Snapshot read

​ Read the files in the latest FileSlice of each FileGroup under all partiitons,Copy On Write 表读 parquet 文件,Merge On Read 表读 parquet + log 文件

3.2.1.2 Incremantal Read

​ The current Spark data source can specify the start and end commit time of consumption, and read the incremental dataset of commit.

3.2.1.3 Streaming read

​ The 0.8.0 version of HUDI Flink writer supports real-time incremental subscription, which can be used to synchronize CDC data and daily data synchronization ETL pipeline. Flink's streaming read achieves real streaming read. The source regularly monitors newly added changed files, and assigns the read task to the read task.

3.2.2 Data write

3.2.2.1 Write operation

(1) UPSERT: The default behavior, the data is first marked by index (INSERT/UPDATE), there are some heuristic algorithms to determine the organization of the message to optimize the file size => CDC import

(2) INSERT: Skip index, write more efficiently => Log Deduplication

(3) BULK_INSERT: write sorting, friendly to Hudi table initialization with large data volume, best effort on file size limitation (write HFile)

3.2.2.2 Write process (UPSERT)

1)COW

(1) First deduplicate the records according to the record key

(2) First create an index for this batch of data (HoodieKey => HoodieRecordLocation); use the index to distinguish which records are update and which records are insert (the key is written for the first time)

(3) For the update message, it will directly find the base file of the latest FileSlice corresponding to the key, and write a new base file (new FileSlice) after doing merge

(4) For the insert message, all SmallFiles (base files smaller than a certain size) of the current partition will be scanned, and then merged to write a new FileSlice; if there is no SmallFile, directly write a new FileGroup + FileSlice

2)MOR

(1) First deduplicate the records according to the record key (optional)

(2) First create an index for this batch of data (HoodieKey => HoodieRecordLocation); use the index to distinguish which records are update and which records are insert (the key is written for the first time)

(3) If it is an insert message, if the log file cannot be indexed (default), it will try to merge the smallest base file (FileSlice that does not include the log file) in the partition to generate a new FileSlice; if there is no base file, write a new FileGroup + FileSlice + base file; if the log file can be indexed, try to append a smaller log file, if not, write a new FileGroup + FileSlice + base file

(4) If it is an update message, write the corresponding file group + file slice, and directly append the latest log file (if it happens to be the current smallest small file, it will merge the base file and generate a new file slice)

(5) When the log file size reaches the threshold, it will roll over a new one

3.2.2.3 Write process (INSERT)

1)COW

(1) First deduplicate the records according to the record key (optional)

(2) Index will not be created

(3) If there is a small base file, merge base file to generate a new FileSlice + base file, otherwise directly write a new FileSlice + base file

2)MOR

(1) First deduplicate the records according to the record key (optional)

(2) Index will not be created

(3) If the log file is indexable and has a small FileSlice, try to append or write the latest log file; if the log file is not indexable, write a new FileSlice + base file

3.2.2.4 Write process (INSERT OVERWRITE)

Create a new set of filegroups in the same partition. Existing filegroups are marked "deleted". Create new filegroups based on the number of new records

1)COW
before inserting the partition Insert the same number of records overwriting Insert overwrites more records insert rewrite 1 record
The partition contains file1-t0.parquet, file2-t0.parquet. Partition will add file3-t1.parquet, file4-t1.parquet. file1, file2 are marked invalid in metadata after t1. Partition will add file3-t1.parquet, file4-t1.parquet, file5-t1.parquet, ..., fileN-t1.parquet. file1, file2 are marked invalid in metadata after t1 Partition will add file3-t1.parquet. file1, file2 are marked invalid in metadata after t1.
2)MOR
before inserting the partition Insert the same number of records overwriting Insert overwrites more records insert rewrite 1 record
The partition contains file1-t0.parquet, file2-t0.parquet. .file1-t00.log file3-t1.parquet, file4-t1.parquet. file1, file2 are marked invalid in metadata after t1. file3-t1.parquet, file4-t1.parquet...fileN-t1.parquetfile1, file2 are marked invalid in metadata after t1 Partition will add file3-t1.parquet. file1, file2 are marked invalid in metadata after t1.
3) Advantages

(1) COW and MOR are very similar in execution. Does not interfere with the compaction of MOR.

(2) Reduce the parquet file size.

(3) There is no need to update external indexes in the critical path. Index implementations can check for invalid filegroups (similar to how commits are checked for invalidity in HBaseIndex).

(4) The cleanup strategy can be extended to delete old filegroups after a certain time window.

4) Disadvantages

(1) Requires forwarding of previously submitted metadata.

Ø In t1, for example, file1 is marked as invalid, we store "invalidFiles=file1" in t1.commit (or store deltacommit in MOR)

Ø At t2, such as file2 is also marked as invalid. We forward the previous files and mark "invalidFiles=file1, file2" in t2.commit (or deltacommit of MOR)

(2) Ignoring parquet files existing on disk is also a new behavior of Hudi, which may be error-prone, we must recognize the new behavior and update all views of the file system to ignore them. This can cause problems when implementing other functions.

3.2.2.5 Key generation strategy

Used to generate HoodieKey (record key + partition path), currently supports the following strategies:

Ø Support multiple field combination record keys

Ø Support the parition path of multiple field combinations (customizable time format, Hive style path name)

Ø Non-partitioned table

3.2.2.6 Delete policy

1) Tombstone: mark all value fields as null

2) Physical deletion:

(1) Delete all input records through OPERATION_OPT_KEY

(2) Configure PAYLOAD_CLASS_OPT_KEY = org.apache.hudi.EmptyHoodieRecordPayload to delete all input records

(3) Add a field to the input record: _hoodie_is_deleted

3.2.2.7 Summary

The core advantages of Apache Hudi over other data lake solutions:

(1) The writing process fully optimizes the problem of small files in file storage. Copy On Write will always write the base file of a bucket (FileGroup) to the set threshold size before dividing into new buckets; Merge On Read writes in In the same bucket, the log file is always append until the size exceeds the set threshold and roll over.

(2) The support for UPDATE and DELETE is very efficient. The entire lifecycle operations of a record occur in the same bucket, which not only reduces the number of small files, but also improves the efficiency of data reading (unnecessary join and merge).

3.2.3 Data Merging

(1) No base file (i.e. parquet file): go through the copy on write insert process, directly merge all log files and write the base file

(2) There is a base file: follow the copy on write upsert process, first read the log file to build an index, then read the base file, and finally read the log file to write a new base file

Both Flink and Spark streaming writers can apply asynchronous compaction strategies, trigger compaction tasks according to the number of commits at intervals or time, and execute them in independent pipelines.

Guess you like

Origin blog.csdn.net/qq_44665283/article/details/129286627