03_Hudi core concept, timeline Timeline, file management, index Index, storage type, computing model, batch model Batch, streaming model Stream, incremental model Incremental, query type, data writing operation process, etc.

This article is from the "Dark Horse Programmer" hudi course

3. Chapter 3 Hudi Core Concepts
3.1 Basic Concepts
3.1.1 Timeline Timeline
3.1.2 File Management
3.1.3 Index Index
3.2 Storage Type
3.2.1 Computing Model
3.2.1.1 Batch Model (Batch)
3.2.1.2 Streaming Model (Stream)
3.2.1.3 Incremental model (Incremental)
3.2.2 Query type (Query Type)
3.2.3 Copy On Write
3.2.4 Merge On Read
3.2.5 Comparison between COW and MOR
3.3 Data write operation process
3.3.1 UPSERT write Process
3.3.1.1 Copy On Write
3.3.1.2 Merge On Read
3.3.2 INSERT Write Process
3.3.2.2 Merge On Read

3. Chapter 3 Hudi Core Concepts

The basic concepts and table types of the Hudi data lake framework belong to the design principles of the Hudi framework and the core of table design.
Documentation: https://hudi.apache.org/docs/concepts.html

3.1 Basic concepts

Hudi provides the concept of Hudi tables. These tables support CRUD operations. You can use existing big data clusters such as HDFS for data file storage, and then use analysis engines such as SparkSQL or Hive for data analysis and query.
insert image description here

The three main components of the Hudi table: 1) Ordered timeline metadata, similar to a database transaction log. 2) Data files with hierarchical layout: the data actually written into the table; 3) Index (multiple implementations): mapping the data set containing the specified records.

3.1.1 Time axis Timeline

The core of Hudi is to maintain a timeline (Timeline) that contains operations on datasets (such as adding, modifying, or deleting) at different instant (Instant) times in all tables. In each dataset operation on the Hudi table Every time, an Instant will be generated on the Timeline of the table , so that only the data successfully submitted after a certain time point can be queried, or only the data before a certain time point can be queried, which effectively avoids scanning data in a larger time range . At the same time, you can efficiently query only the files before the change (for example, after a change operation is submitted by an Instant, only the data before a certain point in time is queried, and the data before the modification can still be queried).
insert image description here

Timeline is an abstraction used by Hudi to manage commits. Each commit is bound to a fixed timestamp and scattered on the timeline. On Timeline, each commit is abstracted as a HoodieInstant, and an instant records the behavior, timestamp, and status of a commit (commit). HUDI's read and write API can conveniently perform conditional screening on commits through the Timeline interface, apply various strategies to history and on-going commits, and quickly filter out the target commits that need to be operated.
insert image description here

In the above figure, time (hour) is used as the partition field, and various commits are generated successively from 10:00, and a piece of data at 9:00 comes at 10:20, and the data can still fall into the partition corresponding to 9:00, through the timeline If you directly consume incremental updates after 10:00 (only consume groups with new commits), then this delayed data can still be consumed.

Timeline (Timeline) implementation class (located in hudi-common-xx.jar ), and timeline-related implementation classes are located under the org.apache.hudi.common.table.timeline package.
insert image description here

3.1.2 File Management

Hudi organizes datasets on DFS into a directory structure under the base path ( HoodieWriteConfig.BASEPATHPROP ). A dataset is divided into partitions ( DataSourceOptions.PARTITIONPATHFIELDOPT_KEY ), which, much like a Hive table, are folders containing the data files for that partition.
insert image description here

Within each partition, files are organized into filegroups, uniquely identified by file ids. Each filegroup contains multiple file slices, where each slice contains the base column files (. Inserts/updates to base files since.

  • A new base commit time corresponds to a new FileSlice, which is actually a new data version.
  • Each FileSlice of Hudi contains a base file (the merge on read mode may not) and multiple log files (the copy on write mode does not).
  • The file name of each file has its own FileID (ie FileGroup Identifier) ​​and base commit time (ie InstanceTime). Organize the logical relationship of FileGroup through the group id of the file name; organize the logical relationship of FileSlice through the base commit time of the file name.
  • Hudi's base file (parquet file) records the BloomFilter composed of record key in the footer's meta, which is used to realize efficient key contains detection in the implementation of file based index. Only keys that are not in the BloomFilter need to scan the entire file to eliminate false positives.
  • Hudi's log (avro file) is self-encoded and written in LogBlock units by accumulating data buffers. Each LogBlock contains magic number, size, content, footer and other information for data reading, verification and filtering.

Hudi is designed with MVCC (Multi-Version Concurrency Control), where compaction operations merge logs and base files to produce new file slices, and cleanup operations delete unused/older file slices to reclaim space on the DFS.

3.1.3 Index Index

Hudi provides an efficient Upsert operation through the index mechanism , which maps a combination of RecordKey+PartitionPath as a unique identifier to a file ID, and the mapping between this unique identifier and the file group/file ID is written from the record The filegroup does not change from the start.

Hudi has built-in 4 types (6) of index implementations, all of which are inherited from the top-level abstract class HoodieIndex, as follows:

  • Global index : It means that the key is required to be unique under all partitions of the whole table, that is, to ensure that there is only one corresponding record for a given key. The global index provides a stronger guarantee, and also makes the consumption of updating and deleting increase with the size of the table (O(table size)), which is more suitable for small tables.
  • Non-global index : The key is only required to be unique in a certain partition of the table. It relies on the writer to provide a consistent partition path for the update and deletion of the same record, but at the same time it greatly improves the efficiency, because the index query complexity It becomes O (the number of records to be deleted) and can cope well with the expansion of the amount of writing.

The mapping relationship between Hoodie key (record key + partition path) and file id (FileGroup) remains unchanged after the data is written to the file for the first time, so a FileGroup contains all version records of a batch of records. Index is used to distinguish whether the message is INSERT or UPDATE.

BloomFilter Index (bloom filter index)

  • Add records to find the mapping relationship: record key => target partition
  • The current latest data finds the mapping relationship: partition => (fileID, minRecordKey, maxRecordKey) LIST (if it is base files, it can be accelerated)
  • Add records to find the mapping relationship that needs to be searched: fileID => HoodieKey(record key + partition path) LIST, key is the candidate fileID
  • Find the target file through HoodieKeyLookupHandle (accelerated by BloomFilter)

Flink State-based Index (based on state Index)

  • The Flink witer implemented by HUDI in version 0.8.0 uses Flink's state as the underlying index storage. Each record will first calculate the target bucket ID before writing, which is different from BloomFilter Index and avoids repeated file index every time. find.

3.2 Storage type

Hudi provides two types of tables: Copy on Write (COW) tables and Merge On Read (MOR) tables. The main differences are as follows:

  • For Copy-On-Write Table, the user's update will rewrite the file where the data is located, so the write amplification is very high, but the read amplification is 0, which is suitable for the scenario of writing less and reading more.
  • For the Merge-On-Read Table, the overall structure is a bit like LSM-Tree. The user's writing is first written into the delta data. This part of the data uses row storage. This part of the delta data can be manually merged into the stock file and organized as The column storage structure of parquet.
    insert image description here

3.2.1 Calculation model

Hudi is an open source data lake framework led by Uber, so most of the starting points come from Uber's own scenarios, such as joining driver data and passenger data through order IDs. In the past usage scenarios of Hudi, similar to the architecture of most companies, the Lambda architecture with the coexistence of batch and streaming is adopted, and the batch (Batch) and streaming are compared in terms of delay, data integrity and cost. The difference between the formula (Stream) calculation model.

3.2.1.1 Batch model (Batch)

The batch model uses typical batch computing engines such as MapReduce, Hive, and Spark to do data computing in the form of hourly or daily tasks .

  • Latency : hour-level delay or day-level delay. The delay here does not only refer to the time of scheduled tasks. In the data architecture, the delay time here is usually the time between scheduled tasks + the calculation time of a series of dependent tasks + the time when the data platform can finally display the results . When the amount of data is large and the logic is complex, the real delay time of the data calculated by the hourly task is usually 2-3 hours.
  • Data integrity : the data is relatively complete. Taking processing time as an example, for hour-level tasks, the raw data usually calculated already includes all the data within the hour, so the obtained data is relatively complete. But if the business requirement is event time, some delayed reporting mechanisms of the terminal are involved here, and batch computing tasks are difficult to come in handy here.
  • Cost : The cost is very low. Only when doing task calculations, resources will be occupied. If no task calculations are performed, these batch computing resources can be transferred to online business. But from another perspective, the cost is quite high. For example, if the original data has been added, deleted, modified and checked, and the data arrives late, then the batch task needs to be recalculated in full.

3.2.1.2 Streaming model (Stream)

The streaming model typically uses Flink for real-time data calculation .

  • Latency : Very short, even real time.
  • Data Integrity : Poor. Because the streaming engine does not wait for all the data to arrive before starting calculation, there is a concept of watermark. When the time of the data is less than the watermark, it will be discarded. In this way, it is impossible to have an absolute report on the integrity of the data. In Internet scenarios, the streaming model is mainly used for large-scale data display during activities, and the requirements for data integrity are not very high. In most scenarios, users need to develop two programs, one is streaming data to produce streaming results, and the other is batch computing tasks for repairing real-time results the next day.
  • Cost : very high. Because streaming tasks are resident, and for multi-stream Join scenarios, it is usually necessary to use memory or database for state storage. Whether it is serialization overhead or additional IO generated by interacting with external components, under large amounts of data All cannot be ignored.

3.2.1.3 Incremental model (Incremental)

In view of the advantages and disadvantages of batch and streaming, Uber proposed an incremental model (Incremental Mode), which is more real-time than batch and more economical than streaming.

The incremental model, in simple terms, is to run quasi-real-time tasks in the form of mini batches . Hudi supports two most important features in the incremental model:

  • Upsert : This is mainly to solve the problem that data cannot be inserted and updated in the batch model. With this feature, incremental data can be written to Hive instead of being completely overwritten every time. (Hudi itself maintains the key->file mapping, so it is easy to find the file corresponding to the key when upsert)
  • Incremental Query : Incremental query to reduce the amount of raw data for calculation . Take the data stream join of drivers and passengers in Uber as an example. The incremental data in two data streams can be captured each time for batch join. Compared with streaming data, the cost is several orders of magnitude lower.

In the incremental model, Hudi provides two types of Tables, Copy-On-Write and Merge-On-Read .

3.2.2 Query Type

Hudi can support three different ways of querying tables (Snapshot Queries, Incremental Queries, and Read Optimized Queries), depending on the type of the table.
insert image description here

Type 1: Snapshot Queries (snapshot query)

  • To query the latest snapshot of the dataset in an incremental commit operation, the latest basic file (Parquet) and incremental file (Avro) will be dynamically merged first to provide a near real-time dataset (usually with a delay of several minutes).
  • Read the files in the latest FileSlice of each FileGroup under all partiitons, Copy On Write means read parquet files, Merge On Read means read parquet + log files

Type 2: Incremental Queries (incremental query)

  • To only query files newly written to the dataset, you need to specify a Commit/Compaction instant time (an Instant on the Timeline) as a condition to query new data after this condition.
  • View newly written data since a given commit/delta commit immediate operation. Effectively provide change streams to enable incremental data pipelines.

Type 3: Read Optimized Queries (read optimized query)

  • Directly query the basic file (the latest snapshot of the dataset), which is actually a columnar file (Parquet). And guarantee the same columnar query performance compared with non-Hudi columnar datasets.
  • View the latest snapshot of the table for a given commit/compact immediate operation.
  • Read-optimized queries, like snapshot queries, access only the base file, providing data for a given file slice since the last compaction operation performed. Usually the guarantee of freshness of query data depends on the compaction strategy

3.2.3 Copy On Write

Abbreviated as COW, as the name suggests, it copies an original copy when data is written, and adds new data on top of it. The request for reading data reads the latest complete copy , which is similar to the idea of ​​Mysql's MVCC.
insert image description here

In the figure above, each color contains all the data up to the time it is located. Old data copies will be deleted after exceeding a certain number limit . There is no compact instant for this type of table, because it is already compact when written.

  • Advantages : When reading, only one data file of the corresponding partition can be read, which is more efficient;
  • Disadvantage : When data is written, it is necessary to copy a previous copy and then generate a new data file based on it. This process is time-consuming. Due to time-consuming, the data read by the read request will lag behind;
    insert image description here

For this kind of Table, two kinds of queries are provided:

  • Snapshot Query: Query the latest snapshot data, that is, the latest data.
  • Incrementabl Query: The user needs to specify a commit time, and then Hudi will scan the records in the file to filter out the records whose commit_time > the commit time specified by the user.
    The COW table mainly uses the columnar file format (Parquet) to store data. In the process of writing data, it performs synchronous merge, updates the data version and rewrites the data file, similar to the B-Tree update in RDBMS.
  • 1) Update update : When updating a record, Hudi will first find the file containing the updated data, and then rewrite the file with the updated value (the latest data), and the files containing other records remain unchanged. When there is a sudden large number of write operations, it will cause a large number of files to be rewritten, resulting in a huge I/O overhead.
  • 2), read read : When reading the data set, the latest update is obtained by reading the latest data file. This storage type is suitable for scenarios with a small amount of writing and a large amount of reading.
    Every time the Copy On Write type table is written, a new FileSlice holding the base file (corresponding to the instant time of writing) will be generated. When the user reads the snapshot, it will scan all the base files under the latest FileSlice.

3.2.4 Merge On Read

Referred to as MOR, the newly inserted data is stored in the delta log, and the delta log is periodically merged into a parquet data file . When reading data, the delta log will be merged with the old data file, and the complete data will be returned. The figure below demonstrates the two data reading and writing methods of MOR.
insert image description here

The MOR table can also ignore the delta log like the COW table, and only read the latest complete data file.

  • Advantages : Since the delta log is written first when writing data, and the delta log is smaller, the writing cost is lower;
  • Disadvantage : It is necessary to merge and organize compact regularly, otherwise there will be many fragmented files. The reading performance is poor because delta log and old data files need to be merged;
    for this type of Table, three queries are provided:
  • Snapshot Query : Query the latest snapshot data, that is, the latest data. Here is a query that mixes row and column data.
  • Incrementabl Query : The user needs to specify a commit time, and then Hudi will scan the records in the file to filter out the records whose commit_time > the commit time specified by the user. Here is a query that mixes row and column data.
  • Read Optimized Query : only checks stock data, not incremental data, because all columnar file formats are used, so the efficiency is high.
    insert image description here

The MOR table is an upgraded version of the COW table, which uses a mixture of columnar (parquet) and row (avro) files to store data. When updating records, it is similar to the LSM-Tree update in NoSQL.

  • 1) Update : When updating records, only update to the incremental file (Avro), then perform asynchronous (or synchronous) compaction, and finally create a new version of the columnar file (parquet). This storage type is suitable for write-frequent workloads because new records are written to incremental files in append mode.
  • 2) Reading : When reading the data set, you need to merge the incremental file with the old file first, and then query after the columnar file is successfully generated.

3.2.5 Comparison of COW and MOR

Hudi's WriteClient is the same for copy-on-write (COW) and merge-on-read (MOR) writers.

  • For the COW table, the user will scan all the base files under the latest FileSlice when reading the snapshot.
  • The MOR table, in READ OPTIMIZED mode, only reads the most recent compacted commit.
    insert image description here

3.3 Data write operation process

Three ways to write data are supported in the Hudi data lake framework: UPSERT (insert update), INSERT (insert) and BULK INSERT (write sort).

  • UPSERT: default behavior, data is first marked by index (INSERT/UPDATE), there are some heuristic algorithms to determine the organization of messages to optimize the size of the file
  • INSERT: skip index, more efficient writing
  • BULK_INSERT: write sorting, friendly to Hudi table initialization with large data volume, best effort on file size limit (write HFile)

3.3.1 UPSERT write process

Since the types of tables in Hudi are divided into: COW and MOR, when UPSERT writes data, the specific process is also different.

3.3.1.1 Copy On Write

  • The first step is to deduplicate the records according to the record key;
  • The second step is to first create an index for this batch of data (HoodieKey => HoodieRecordLocation); use the index to distinguish which records are update and which records are insert (the key is written for the first time);
  • The third step, for the update message, will directly find the base file of the latest FileSlice corresponding to the key, and write a new base file (new FileSlice) after doing merge;
  • Step 4. For the insert message, all SmallFiles (base files smaller than a certain size) of the current partition will be scanned, and then merged to write a new FileSlice; if there is no SmallFile, directly write a new FileGroup + FileSlice;

3.3.1.2 Merge On Read

  • The first step is to deduplicate the records according to the record key (optional)
  • The second step is to first create an index for this batch of data (HoodieKey => HoodieRecordLocation); use the index to distinguish which records are update and which records are insert (the key is written for the first time)
  • The third step, if it is an insert message, if the log file cannot be indexed (default), it will try to merge the smallest base file in the partition (FileSlice that does not include the log file) to generate a new FileSlice; if there is no base file, write a new one FileGroup + FileSlice + base file; if the log file can be indexed, try to append a small log file, if not, write a new FileGroup + FileSlice + base file
  • Step 4. If it is an update message, write the corresponding file group + file slice, and directly append the latest log file (if it happens to be the current smallest small file, it will merge the base file and generate a new file slice). The log file size reaches the threshold Will roll over a new one

3.3.2 INSERT write process

Also, because the types of tables in Hudi are divided into: COW and MOR, when INSERT writes data, the process is also different.

3.3.2.1 Copy On Write

  • The first step is to deduplicate the records according to the record key (optional);
  • In the second step, Index will not be created;
  • The third step, if there is a small base file, merge base file to generate a new FileSlice + base file, otherwise directly write a new FileSlice + base file;

3.3.2.2 Merge On Read

The first step is to deduplicate the records according to the record key (optional);
The second step is not to create an Index;
The third step is to append or write the latest if the log file can be indexed and there is a small FileSlice The log file; if the log file is not indexable, write a new FileSlice + base file;

Guess you like

Origin blog.csdn.net/toto1297488504/article/details/132240729