Data lake-hudi overview

foreword

Data lake is a hot concept at present, and many enterprises are building or planning to build their own data lakes.

A data lake is a centralized repository that allows you to store all structured and unstructured data at any scale. You can store data as is (without first structuring it) and run different types of analytics – from dashboards and visualizations to big data processing, real-time analytics and machine learning to guide better decision making.

Looking at the data lake from the data warehouse

Refer to the official comparison between AWS data warehouse and data lake.

  • A data warehouse is a database optimized for analyzing relational data from transactional systems and line-of-business application systems. Define the data structure and Schema in advance to provide fast SQL queries. Raw data undergoes a series of ETL transformations to provide users with a trusted "single data result".
  • A data lake is different because it stores not only relational data from line-of-business applications, but also non-relational data from mobile apps, IoT devices, and social media. When capturing data, there is no need to define a data structure or schema in advance. This means that data lakes can store all types of data without elaborate data structures. Different types of analytics (such as SQL queries, big data analytics, full-text search, real-time analytics, and machine learning) can be used on the data.

Summary: Data lakes are more inclusive of data and have more diverse data processing methods. On the contrary, the data in the data lake is more chaotic

Please add a picture description

  1. The metadata of the lake and the warehouse are seamlessly connected and complement each other. The model of the data warehouse is fed back to the data lake (becoming part of the original data), and the structured application of the lake is deposited into the data warehouse.
  2. Unified development of lakes and warehouses, data stored in different systems can be managed uniformly through the platform.
  3. For the data of the data lake and the data warehouse, according to the needs of business development, it is determined which data is stored in the data warehouse and which data is stored in the data lake, thereby forming the integration of the lake and the warehouse.
  4. The data is in the lake, the model is in the warehouse, and the transformation is repeated.

What is Hudi?

Hudi is Uber's open source data lake architecture. It is a streaming data lake built around the database core , a new technical architecture.

According to streaming requirements, he designed file storage and management to implement two data models "COW vs MOR"

In order to adapt to this hudi data model and integrate into the current big data environment, he wrote huid plug-ins for all components.

As a data lake solution, Hudi itself does not generate any business data, nor does it need to be deployed separately. Fully dependent on other big data components

insert image description here

  • The underlying data of hudi can be stored in hdfs, s3, azure, alluxio, etc.
  • Hudi can use the spark/flink computing engine to consume the data of message queues such as kafka and pulsar, and these data may come from the business data and log data of apps or microservices, or the binlog log data of databases such as mysql
  • Spark/hudi first processes these data into row tables (original table) in hudi format, and then this original table can be generated by Incremental ETL (incremental processing) to generate a derived tables derived table in hudi format
  • The query engines supported by hudi include: trino, hive, impala, spark, presto, etc.
    Computing engines such as spark, flink, and map-reduce are supported to continue processing hudi data

Data organization structure

The data files of the Hudi table can be stored using the file system of the operating system, or can be stored using a distributed file system such as HDFS. For subsequent analysis performance and data reliability, HDFS is generally used for storage. From the perspective of HDFS storage, the storage files of a Hudi table are divided into two categories.
Please add a picture description

  1. .hoodie file: Due to the fragmented nature of CRUD, each operation will generate a file. The increasing number of these small files will seriously affect the performance of HDFS. Hudi has designed a file merging mechanism. The .hoodie folder stores the log files related to the corresponding file merge operation.
  2. The path related to amricas and asia is the actual data file, which is stored by partition, and the path key of the partition can be specified.

hoodie file

Hudi calls a series of CRUD operations on tables as time goes by Timeline, and an operation in Timeline is called Instant.

  • Instant Action, record whether this operation is a data submission (COMMITS), file merging (COMPACTION), or file cleaning (CLEANS);
  • Instant Time, the time when this operation occurred;
  • State, the status of the operation, initiated (REQUESTED), in progress (INFLIGHT), or completed (COMPLETED);

Please add a picture description
The status record of the corresponding operation is stored in the hoodie folder
Please add a picture description

Timeline to solve the data timing problem caused by delay

data file

Hudi's real data file contains a metadata metadata file (record partition) and data file parquet columnar storage.
Please add a picture description
In order to realize the CRUD of data, it is necessary to be able to uniquely identify a record. Hudi will combine the unique field (record key) + data partition (partitionPath) in the data set as the unique key of the data.

Please add a picture description

  1. The organizational directory structure of the Hudi dataset is very similar to that of Hive, and a dataset corresponds to this root directory. The data set is broken into multiple partitions, and the partition fields exist in the form of folders, which contain all the files of the partition.
  2. Under the root directory, each partition has a unique partition path, and each partition data is stored in multiple files.
  3. Each file is identified by a unique fileId and the commit that generated the file. If an update operation occurs, multiple files share the same fileId, but have different commits.

Index index

Hudi maintains an index to support fast mapping of the key of a new record to the corresponding fileId when the record key exists.

  1. Bloom filter: stored in the footer of the data file. The default option does not depend on external system implementation. Data and indexes are always consistent.
  2. Apache HBase: It can efficiently find a small batch of keys. This option may be several seconds faster during index marking.

Please add a picture description

Hudi's tabular format

Hudi provides two types of tables: Copy on Write (COW) tables and Merge On Read (MOR) tables.

  • For Copy-On-Write Table, the user's update will rewrite the file where the data is located, so the write amplification is very high, but the read amplification is 0, which is suitable for the scenario of writing less and reading more.
  • For the Merge-On-Read Table, the overall structure is a bit like LSM-Tree. The user's writing is first written into the delta data. This part of the data uses row storage. This part of the delta data can be manually merged into the stock file and organized as The column storage structure of parquet.

Copy on Write

Abbreviated as COW, as the name suggests, it copies an original copy when data is written, and adds new data on top of it. The request for reading data reads the latest complete copy, which is similar to the idea of ​​Mysql's MVCC.
Please add a picture description
The COW table mainly uses the columnar file format (Parquet) to store data. In the process of writing data, it performs synchronous merge, updates the data version and rewrites the data file, similar to the B-Tree update in RDBMS.

  1. Update update: When updating a record, Hudi will first find the file containing the updated data, and then rewrite the file with the updated value (the latest data), and the files containing other records remain unchanged. When there is a sudden large number of write operations, it will cause a large number of files to be rewritten, resulting in a huge I/O overhead.
  2. Read read: When reading data, get the latest update by reading the latest data file, this storage type is suitable for a small amount of writing and a large number of reading scenarios

Merge On Read

Referred to as MOR, the newly inserted data is stored in the delta log, and the delta log is periodically merged into a parquet data file.
When reading data, the delta log will be merged with the old data file, and the complete data will be returned. The figure below demonstrates the two data reading and writing methods of MOR.
Please add a picture description
The MOR table is an upgraded version of the COW table, which uses a mixture of columnar (parquet) and row (avro) files to store data. When updating records, it is similar to the LSM-Tree update in NoSQL.

  1. Update: When updating records, only update to the incremental file (Avro), then perform asynchronous (or synchronous) compaction, and finally create a new version of the columnar file (parquet). This storage type is suitable for write-frequent workloads because new records are written to incremental files in append mode.
  2. Reading: When reading a data set, you need to merge the incremental file with the old file first, and then query after the columnar file is successfully generated.

In READ OPTIMIZED mode, only the most recent compacted commit will be read.

Inquire

Hudi supports three different ways of querying tables:

  1. Snapshot Queries (snapshot queries):
    dynamically merge the latest basic files (parquet) and incremental files (Avro) to provide near real-time data sets
    • Copy On Write table reads parquet files,
    • Merge On Read reads parquet + log files.
  2. Incremental Queries (incremental query)
    only query the files newly written into the data set. You need to specify a Commit/Compaction instant time (an instant on the Timeline) as a condition to query new data after this condition
  3. Read Optimized Queries
    directly query the basic file (the latest snapshot of the dataset), which is actually a columnar file (Parquet).

main reference

" Data Lake Technology Architecture Evolution "
" Data Lake Series Articles "
" Hudi Official Documents "

Guess you like

Origin blog.csdn.net/y3over/article/details/127247807