Apache Hudi Technology and Architecture-1

1 Introduction

Apache Hudi is a new generation of stream computing-based data storage platform in the field of big data, also known as the Data Lake Platform (Data Lake Platform), which integrates the core functions of traditional databases and data warehouses to provide diverse data integration , data processing and data storage platform capabilities. The core functions provided by Hudi include data table management service, transaction management service, efficient addition, deletion, modification and query operation service, advanced index system service, streaming data collection service, data cluster and compression optimization service, high-performance concurrency control service, Hudi The data organization storage format in the data lake is an open source file format.

Apache Hudi can support large-scale stream processing workloads, and at the same time, it also provides data pipelines that can create efficient, incremental, and batch processing.

Apache Hudi can be easily deployed on any cloud storage platform, and combined with the currently popular Apache Spark, Flink, Presto, Trino, and Hive data analysis and query engines, it can provide data analysis capabilities with superior performance.

2 Architecture Description

The overall application architecture of the Apache Hudi data lake platform is as follows:

 Data Sources

Data source, providing data input

 Apps & Microservices

Data sources of application and microservice types, providing input for events

 Databases

Data source of SQL database or NoSQL database type, providing input of events

 Event Streams

Message or event middleware, which accepts the input of events from other data sources and aggregates them into event streams

 Hudi Data Lake

The Hudi data lake platform uses streaming computing technology to provide large-scale, structured or unstructured data processing and storage services

 DeltaStreamer/CDC

Stream computing event processor/capture data changes, used to process event streams and process event changes

 Row Tables

A data table for row storage, used to store events that have been processed in the previous step

 Incremental ETL

The standard processing steps of the data warehouse, using incremental, streaming, and pipeline computing event processors, converge into the input of the next event stream

 Derived Tables

Store the input stream event of the previous step, or the final data to be analyzed

 Lake Storage

Data organization storage of Hudi data table, support HDFS or object storage in public cloud environment

 Queries

Query engine, providing Hudi data lake query and retrieval services

 Pipelines

Analysis engine, providing Hudi data lake query and analysis services

(to be continued)

Guess you like

Origin blog.csdn.net/uesowys/article/details/126589829