Some thoughts and future trends of StreamingWarehouse

5781db0dd7bd0ce57bf6a93af7544c36.png3 million words! The most complete big data learning interview community on the whole network is waiting for you!

a note.

Take the frameworks of Hudi, Iceberg, and Paimon as examples. They support efficient data flow/batch reading and writing, data backtracking, and data updating. It has some features that traditional real-time and offline data warehouses do not have, mainly in the following aspects:

  1. These storage engines are naturally unified batch-flow integrated storage. It not only supports batch access to complete Table data, but also supports full processing of Table data first, and then incremental stream processing of Changelog;

  2. Support UPSERT stream, this is very important; the file organization form is also more efficient (LSM);

  3. Support TimeTravel, in theory, batch or stream processing can be performed from any point in time;

  4. There are also some other offline data warehouse operations

If we build a new data warehouse system Streaming Warehouse based on the Lake framework, all our development will be oriented to Table and pure SQL operations.

Such an architecture solves the core problem:

  1. If the performance is sufficient, it can achieve a delay comparable to that of a real-time link;

  2. Natural integration of batch and flow, consistent caliber, natural alignment of computing semantics, ensuring data consistency;

  3. The intermediate results can be checked, which is a great advantage compared with the current very popular real-time data warehouse;

  4. It is very convenient to restore historical data;

  5. Low development and storage costs

This is also mentioned in many articles: realize batch-flow integrated computing and storage, support stream, batch, and OLAP processing at the same time, and realize data processing in the form of "Table".

Some scenarios that can be replaced at present: For example, the end-to-end delay is acceptable at the minute level, the data logic is complex and you want to be offline, and the real-time consistency is strong, and the traditional online serving scenarios such as creating materialized views and stored procedures with the database as the core, etc.

But we have to say that the above are ideal visions for the future, and many problems have not been resolved at the current stage, such as the end-to-end delay is much greater than that of pure real-time scenarios, depending on the time interval of CheckPoint, etc.

However, with the continuous iteration and development of these frameworks, the future may be different.

If this article is helpful to you, don't forget to  "Like",  "Like",  and "Favorite"  three times!

961b57e38463f75e3152fee67a17c887.png

56575fe54d4f8d873b6fcb63ebedf953.jpeg

It will be released on the whole network in 2022 | Big data expert-level skill model and learning guide (Shengtian Banzi)

The Internet's worst era may indeed be here

I am studying in university at Bilibili, majoring in big data

What are we learning when we are learning Flink?

193 articles beat Flink violently, you need to pay attention to this collection

Flink production environment TOP problems and optimization, Alibaba Tibetan Scripture Pavilion YYDS

Flink CDC I'm sure Jesus can't keep him! | Flink CDC online problem inventory

What are we learning when we are learning Spark?

Among all Spark modules, I would like to call SparkSQL the strongest!

Hard Gang Hive | 40,000-word Basic Tuning Interview Summary

A Small Encyclopedia of Data Governance Methodologies and Practices

A small guide to user portrait construction under the label system

40,000-word long text | ClickHouse basics & practice & tuning full perspective analysis

[Interview & Personal Growth] More than half of 2021, the experience of social recruitment and school recruitment

Another decade begins in the direction of big data | The first edition of "Hard Gang Series" ends

Articles I have written about growth/interview/career advancement

What are we learning when we are learning Hive? "Hard Hive Sequel"

Guess you like

Origin blog.csdn.net/u013411339/article/details/132419194