From theory to practice, real-time lake warehouse functional architecture design and implementation

In the previous article, we explained to you why the real-time lake warehouse is the solution for the current enterprise digital transformation process, and introduced the application scenarios of the combination of real-time computing and data lake . ( In the "data-driven" era, why do companies need real-time lake warehouses? )

In this article, we will introduce in detail the functional architecture design and specific practical cases of the real-time lake warehouse in the data stack real-time development platform .

Introduction to functional architecture

The real-time lake warehouse is not an independent product module. Its complete practice is based on the data stack real-time development platform . In order to more intuitively introduce our complete idea of ​​building a real-time lake warehouse, we have separately separated the architecture diagram for your reference.

file

Lake warehouse management

Lake warehouse management is the basis for building real-time lake warehouses. Through the construction of this layer, you can:

· Use Flink Catalog management to build a virtual lake warehouse hierarchical architecture , similar to the subject domain and DW hierarchical design in traditional offline data warehouses

· Visually create lake tables . The platform supports the creation of three lake tables: Paimon, Hudi, and Iceberg, and provides corresponding DDL DEMOs respectively.

· Through Flink table management, persistent storage of Flink mapping tables created based on RDB and Kafka, together with lake tables, provides table management capabilities for real-time calculations

· As the most commonly used data medium in the field of real-time computing, the platform also supports basic functions such as addition, deletion, modification, and data statistical analysis of Kafka Topic.

Hucang Development

Lake warehouse development is the core capability for building real-time lake warehouses. It is mainly divided into the following application scenarios:

· Data entering the lake: By consuming Kafka in real time, or reading CDC data from RDB, the business data is entered into the data lake in real time, and the ODS layer of the real-time lake warehouse is constructed to provide a unified data foundation for subsequent stream/batch reading and writing .

· Lake warehouse processing: With the help of the transaction features, snapshot features and other capabilities of the Hu table format , read and write the lake table through FlinkSQL tasks to build the lake warehouse middle layer

· Integrated streaming and batch reading: During the Hucang processing process, you can choose streaming reading or batch reading according to different business scenarios. In the design of streaming and batch integration , you can choose to batch read the existing data first, and then seamlessly read the incremental data in the stream; you can also choose to read the incremental data in a stream, and then correct the data by batch reading.

Lake warehouse management

During the lake warehouse development process, we can continuously optimize and improve the real-time lake warehouse through the lake warehouse management capabilities:

· Lake table file management: During the development process of the lake warehouse, a large number of small files, expired snapshots, orphan files and other data will be generated, which seriously affects the read and write performance of the lake table. Through the file management function , you can regularly merge small files, clean up expired snapshots/orphan files, and improve development efficiency.

· Metadata query: While providing basic information query of Catalog/Database/Table, it also collects statistics on the storage, number of rows, task dependencies and other information of the lake table to facilitate global judgment of the value of the lake table.

· Hive table conversion: For historical Hive tables, the platform supports one-click conversion of table types without affecting historical data.

Practical case sharing

The following uses a data case to introduce in detail how to implement data entry into the lake, lake warehouse development, and lake warehouse management on the platform.

Data enters the lake (collects DB2 data in real time and writes it to the PaimonA lake table)

● First create the Flink mapping table and Paimon lake table of DB2-CDC

file file

● Develop tasks to enter the lake

file

Hucang development (streaming reading PaimonA, streaming writing PaimonB)

● Create PaimonB

The method is the same as above, and the demonstration will not be repeated here.

● Develop reading and writing lake table tasks

The platform supports configured development of read and write parameters without the need to define them in SQL code, thus greatly improving development efficiency. For example, if you select a timestamp when reading a lake table, if you use SQL to develop, you need to first query the snapshot data in the background and perform timestamp conversion to understand it. Through configuration , you can directly select or enter the date and time, and automatically convert the timestamp when submitting the task .

file

Lake warehouse management

● Metadata query

Provides metadata query for Catalog, Database, Lake table (Paimon/Hudi/Iceberg), and Flink mapping table.

file

● Data file management

Reading and writing lake tables, especially reading and writing in real-time scenarios, will generate a large number of small files. Too many small files will affect reading performance. Therefore, the management function of lake table files is an indispensable part of building a real-time lake warehouse.

file

Summarize

Real-time lake warehouse is an application scenario that combines "real-time computing" and "data lake", and does not specifically refer to a product module. Through the design of relevant functions, the platform allows data developers to understand concepts such as Flink Catalog, data lake, and streaming-batch integration more simply and intuitively , and make it easier to implement them in actual business scenarios.

This article is based on the summary of the live content of "Five Lectures on Real-time Lake Warehouse Practice, Issue 2". Interested friends can click on the link to watch the live replay video and obtain the live courseware for free.

Live courseware:

https://www.dtstack.com/resources/1053?src=szgzh

Live playback video:

https://www.bilibili.com/video/BV1Uw411k7iS/?spm_id_from=333.999.0.0

"Dtstack Product White Paper": https://www.dtstack.com/resources/1004?src=szsm

"Data Governance Industry Practice White Paper" download address: https://www.dtstack.com/resources/1001?src=szsm Friends who want to know or consult more about Kangaroo Cloud big data products, industry solutions, and customer cases, please browse Kangaroo Cloud official website: https://www.dtstack.com/?src=szkyzg

At the same time, students who are interested in big data open source projects are welcome to join "Kangaroo Cloud Open Source Framework DingTalk Technology qun" to exchange the latest open source technology information, qun number: 30537511, project address: https://github.com/DTStack

Alibaba Cloud suffered a serious failure, affecting all products (has been restored). The Russian operating system Aurora OS 5.0, a new UI, was unveiled on Tumblr. Many Internet companies urgently recruited Hongmeng programmers . .NET 8 is officially GA, the latest LTS version UNIX time About to enter the 1.7 billion era (already entered) Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is .NET 8 on NuttX Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released. Microsoft launches a new "Windows App"
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3869098/blog/10120095