1 Introduction
Apache Hudi is a new generation of stream computing-based data storage platform in the field of big data, also known as the Data Lake Platform (Data Lake Platform), which integrates the core functions of traditional databases and data warehouses to provide diverse data integration , data processing and data storage platform capabilities. The core functions provided by Hudi include data table management service, transaction management service, efficient addition, deletion, modification and query operation service, advanced index system service, streaming data collection service, data cluster and compression optimization service, high-performance concurrency control service, Hudi The data organization storage format in the data lake is an open source file format.
Apache Hudi can support large-scale stream processing workloads, and at the same time, it also provides data pipelines that can create efficient, incremental, and batch processing.
Apache Hudi can be easily deployed on any cloud storage platform, and combined with the currently popular Apache Spark, Flink, Presto, Trino, and Hive data analysis and query engines, it can provide data analysis capabilities with superior performance.
2 Architecture Description
The overall application architecture of the Apache Hudi data lake platform is as follows:
Data Sources Data source, providing data input |
Apps & Microservices Data sources of application and microservice types, providing input for events |
Databases Data source of SQL database or NoSQL database type, providing input of events |
Event Streams Message or event middleware, which accepts the input of events from other data sources and aggregates them into event streams |
Hudi Data Lake The Hudi data lake platform uses streaming computing technology to provide large-scale, structured or unstructured data processing and storage services |
DeltaStreamer/CDC Stream computing event processor/capture data changes, used to process event streams and process event changes |
Row Tables A data table for row storage, used to store events that have been processed in the previous step |
Incremental ETL The standard processing steps of the data warehouse, using incremental, streaming, and pipeline computing event processors, converge into the input of the next event stream |
Derived Tables Store the input stream event of the previous step, or the final data to be analyzed |
Lake Storage Data organization storage of Hudi data table, support HDFS or object storage in public cloud environment |
Queries Query engine, providing Hudi data lake query and retrieval services |
Pipelines Analysis engine, providing Hudi data lake query and analysis services |
(to be continued)