a note.
Take the frameworks of Hudi, Iceberg, and Paimon as examples. They support efficient data flow/batch reading and writing, data backtracking, and data updating. It has some features that traditional real-time and offline data warehouses do not have, mainly in the following aspects:
These storage engines are naturally unified batch-flow integrated storage. It not only supports batch access to complete Table data, but also supports full processing of Table data first, and then incremental stream processing of Changelog;
Support UPSERT stream, this is very important; the file organization form is also more efficient (LSM);
Support TimeTravel, in theory, batch or stream processing can be performed from any point in time;
There are also some other offline data warehouse operations
If we build a new data warehouse system Streaming Warehouse based on the Lake framework, all our development will be oriented to Table and pure SQL operations.
Such an architecture solves the core problem:
If the performance is sufficient, it can achieve a delay comparable to that of a real-time link;
Natural integration of batch and flow, consistent caliber, natural alignment of computing semantics, ensuring data consistency;
The intermediate results can be checked, which is a great advantage compared with the current very popular real-time data warehouse;
It is very convenient to restore historical data;
Low development and storage costs
This is also mentioned in many articles: realize batch-flow integrated computing and storage, support stream, batch, and OLAP processing at the same time, and realize data processing in the form of "Table".
Some scenarios that can be replaced at present: For example, the end-to-end delay is acceptable at the minute level, the data logic is complex and you want to be offline, and the real-time consistency is strong, and the traditional online serving scenarios such as creating materialized views and stored procedures with the database as the core, etc.
But we have to say that the above are ideal visions for the future, and many problems have not been resolved at the current stage, such as the end-to-end delay is much greater than that of pure real-time scenarios, depending on the time interval of CheckPoint, etc.
However, with the continuous iteration and development of these frameworks, the future may be different.
If this article is helpful to you, don't forget to "Like", "Like", and "Favorite" three times!
The Internet's worst era may indeed be here
I am studying in university at Bilibili, majoring in big data
What are we learning when we are learning Flink?
193 articles beat Flink violently, you need to pay attention to this collection
Flink production environment TOP problems and optimization, Alibaba Tibetan Scripture Pavilion YYDS
Flink CDC I'm sure Jesus can't keep him! | Flink CDC online problem inventory
What are we learning when we are learning Spark?
Among all Spark modules, I would like to call SparkSQL the strongest!
Hard Gang Hive | 40,000-word Basic Tuning Interview Summary
A Small Encyclopedia of Data Governance Methodologies and Practices
A small guide to user portrait construction under the label system
40,000-word long text | ClickHouse basics & practice & tuning full perspective analysis
Another decade begins in the direction of big data | The first edition of "Hard Gang Series" ends
Articles I have written about growth/interview/career advancement
What are we learning when we are learning Hive? "Hard Hive Sequel"