Building a new streaming lake warehouse solution based on Flink SQL and Paimon

This article is compiled from what Li Jinsong, the person in charge of intelligent open source table storage at Alibaba Cloud, founder of Paimon, and member of Flink PMC, shared at the open source big data session at the Computing Conference. The content of this article is mainly divided into four parts:

  • Data analysis architecture evolution
  • Introducing Apache Paimon
  • Flink + Paimon streaming lake warehouse
  • Streaming Lake Warehouse Demo

Data analysis architecture evolution

Currently, the data analysis architecture is evolving from Hive to Lakehouse. Traditional data warehouses including Hive and Hadoop are evolving to lake and lakehouse architectures. Lakehouse architectures include Presto, Spark, OSS, lake format (Delta, Hudi, Iceberg) and other architectures. This is a relatively big trend now. The Lakehouse architecture includes many new capabilities.

First of all, OSS is more flexible and has the ability to separate computing and storage than traditional HDFS. Moreover, OSS also has the ability to separate hot and cold storage. Data can be archived in cold storage. You will find that its cold storage is very cheap, giving you storage flexibility.

Further up you will find that these lake formats have some benefits. What are the specific benefits?

The first point is that it is easy to operate. Lake format has ACID, Time Travel, and Schema Evolution, which can give you better control capabilities.

The second possibility is that the query will be faster. For example, the planning phase will take less time. Hive will have some query problems when there is a large amount of data and a large number of files. So the Lake format will also solve this aspect better.

The above two benefits may not necessarily impress the company's decision-makers. In fact, not every company is upgrading or has already upgraded. One of the big reasons is that although everyone says that Hive is old, it can still fight. , because the first two benefits are not necessarily strictly necessary for every company. A large number of companies continue to use Hive, maybe the underlying storage is replaced by OSS (or OSS-HDFS), but it is still the same old Hive.

For example, you already have a train that is running stably. Now you can upgrade it, add a dining car, decorate it, and cut it into more sections to make it more flexible. However, it needs to be upgraded to a new set of architecture. Are you willing to take the risk of upgrading? ? But what if it could be upgraded to a high-speed train?

So I'm going to introduce the third benefit on the left. Lakehouse can achieve better timeliness.

Better timeliness does not necessarily mean that all businesses need better timeliness, from days to minutes, but you can select some of the data for real-time upgrades, and you can also select certain times for real-time, mainstream data Still offline.

Better timeliness may bring real changes to some of your businesses, and it can even significantly simplify your architecture and make the entire data warehouse more stable.

The leader in timeliness in the field of computing is Apache Flink. I just said that improving timeliness is the next development focus of Lakehouse. What we need to do now is to bring the streaming computing standard technology, namely Apache Flink, into the Lakehouse architecture.

Therefore, we have done a lot of related exploration in the past few years, including investment in Iceberg and Hudi, and have successfully polished the connection between Flink and Iceberg and Hudi. But the polishing effect may not be that good. If you have used Flink + Iceberg or Flink + Hudi, you may also have some complaints. The key problem is that Iceberg and Hudi are both Spark-oriented and offline-oriented data lake technologies, which do not match well with real-time and Flink.

So we developed a new data lake format, Apache Paimon, which is a streaming data lake format. Let’s analyze the history and original intentions of the Four Musketeers of the Data Lake.

Apache Iceberg and Delta Lake are actually an upgrade to the traditional Hive format. In essence, it is still oriented to the processing of Append data. It has more advantages and is more convenient to use than Hive in the T+1 analysis of offline data warehouses. It is more oriented to traditional offline processing.

Apache Hudi actually provides the ability to incrementally update based on Hive, which is its original intention. Its infrastructure is still oriented towards full incremental merging. Flink's integration is not as good as Spark's. Some functions are only available in Spark but not in Flink.

Apache Paimon is a data lake hatched from the Flink community and is designed to support large-scale updates and true streaming reading.

The difficulty of combining streams and lakes is actually new. If you are familiar with Flink, one of the reasons for the success of Flink SQL is that it truly handles Changelog natively. This changelog itself is an update.

Iceberg, Hudi, and Delta are all based on batch processing and Spark's incremental + full methods. Once merging is required, it is a very large merging of incremental data and full data. The total amount is quite 10 TB. Even an increment of 1 GB may involve the merging of all files. All 10 TB of data must be rewritten before the merger is completed. The cost of merging is very high.

On the right is an updated technology, LSM, whose full name is Log Structured Merge-Tree. This format has been used by a large number of various databases in the real-time field, including RocksDB, Clickhouse, Doris, StarRocks, etc.

The change brought by LSM is that each merge may be partial. Each merge only needs to merge data according to a certain strategy. This format can truly achieve a stronger trade-off in the triangle of cost, freshness and query delay, and in the triangle it can be based on different parameters. Make different trade-off choices.

Introducing Apache Paimon

We have just introduced the evolution process, which requires Flink + lake storage to build Flink Lakehouse, and also introduced the difficulties. The second part introduces Apache Paimon.

What is Apache Paimon? You can simply think that the basic architecture is the combination of lake storage + LSM. The basic capabilities of lake storage are writing and reading. On this basis, Apache Paimon has integrated more deeply with Flink. Various CDC data can be synchronized to Paimon through Schema Evolution and the entire database through Flink CDC.

It can also be written to Paimon through Flink, Spark, Hive, wide table merging or batch write coverage. This is a basic Lakehouse capability. You can also batch read it later and do some analysis through Flink, Spark, StrarRocks, and Trino. You can also use Flink to stream the data in Paimon. The Changelog generated by stream reading will be introduced later. I will also introduce the features of stream reading.

This is Paimon’s architecture diagram, which is mainly the general development process of Paimon’s streaming integrated real-time data lake. At the beginning of 2022, a missing piece of technology in the open source community was discovered, so the Flink Table Store was proposed in the Flink community. It was not until January 2023 that the first stable version 0.3 was released, and it entered the Apache incubator in March. Paimon version 0.5 was released in September this year. This is a fully mature version of Paimon, including CDC lake entry and Append data processing.

We also tested the performance of Apache Paimon and Hudi on Alibaba Cloud, and tested the update performance of MergeOnRead of Lake Storage. You can see that approximately 500 million pieces of data are entered into the lake on the left. According to similar configurations and the same index, we enter the lake. Let’s estimate how long it would take to get 500 million fish into the lake. After testing, it was found that during the process of entering the lake, Paimon's throughput or time consumption can reach 4 times that of Hudi. However, when querying the same data, it is found that Paimon's query performance is 10 times or even 20 times that of Hudi. Hudi will also encounter problems due to smaller memory. and cannot be read.

why? We analyzed that Hudi MOR is a pure append. Although there is compaction in the background, it does not wait for compaction at all. Therefore, Hudi's Compaction only did a little bit in the test, and the performance was particularly poor when reading.

Based on this, we also made the benchmark on the right, which is CopyOnWrite of 100 million pieces of data, to test the merging performance and the compaction performance in the case of CopyOnWrite. The result of the test is that whether it is 2 minutes, 1 minute or 30 seconds, Paimon's performance is significantly ahead, with a performance gap of 12 times. At 30 seconds, Hudi couldn't run out, but Paimon could still run out relatively normally.

So looking back, I hope to describe what Paimon can do through these three keywords.

First, low-latency, low-cost streaming data lake. If you have used Hudi, we hope you can run it with 1/3 of the resources after switching to Paimon.

Second, it is easy to use, easy to enter the lake, and has high development efficiency. You can easily synchronize database data to the data lake Paimon in CDC mode.

Powerfully integrated with Flink, data flows.

Flink + Paimon streaming lake warehouse

The first part talks about the evolution of data architecture, which is why we want to build Paimon. The second part introduces what Paimon can do, its integrations, advantages, and performance. The next third part is how Flink + Paimon build a streaming lake warehouse.

First, let’s look at a rough picture. In fact, the essence of the streaming lake warehouse is still a lake warehouse. What can the lake warehouse do? The most basic ones are batch writing and batch reading, which has better advantages than the traditional Hive data warehouse. On this basis, it is necessary to provide a powerful streaming data update into the lake and streaming reading of incremental streaming data to achieve full-link real-time and streaming-batch integration. The difficulty is streaming updating and streaming reading.

One of the most typical scenarios that a streaming lake warehouse can solve is the CDC data on Hive, that is, the link from MySQL, traditional database data, and CDC data to the warehouse or lake. This is a relatively old architecture diagram, but it is also widely used in enterprises.

You may synchronize to the Hive full partition table during the first run or as needed through full synchronization to become a partition. Next, we need to synchronize to kafka through incremental synchronization every day, and synchronize the incremental CDC data into an incremental table in Hive through regular reflow. After synchronization is completed every night, a merger of the incremental table and the full table can be done at about 0:10. The new partition formed after the merger is the full volume of MySQL for the new day.

Through this technology, we can see that its output delay is very high, requiring at least T+1, and it also has to wait for the incremental data and full data to be merged. Moreover, the full amount of increments is fragmented, and storage is also very wasteful. You can see that each partition of the Hive full table is a full amount of data. If you want to store 100 days of data, the storage will be at least 100 times larger.

The third is that the link is very long and complex, involving various technologies. In real business scenarios, it is very easy to encounter this output. Which component has a problem and the data cannot be output, resulting in the following A series of offline jobs cannot be run. So what is described here is the three highs: high latency, high cost, and high link complexity.

Turning to Flink+Paimon's streaming CDC update, we hope to make the architecture very simple. We don't need Hive's partition table. We only need to define Paimon's primary key table without partitioning. Its definition is very similar to the definition of a MySQL table.

It is enough to integrate the full incremental CDC data into Paimon through Flink CDC and Flink jobs. Then you can see the status of this table in real time and check this table in real time. The data is synchronized in real time, but the offline data warehouse requires daily views, and Paimon needs to provide Tag technology. If you put a Tag today, you will remember today's status. Every time you read this Tag, it will be the same data. This status is immutable. Therefore, Tag technology can equivalently replace Hive's full table partitioning, and Flink and Spark can access Tag data through Time Travel syntax.

Traditional Hive tables are partitioned tables, and Hive SQL does not have Time Travel semantics. What should I do? Paimon also provides the ability to map Tags to Hive partition tables, and you can also query data for multiple days through partition query in Hive SQL. Hive SQL is fully compatible with querying Paimon's component table without changing a row. Therefore, after such an architectural transformation, you can see the entire data visible in real time at the minute level. All full increments are integrated, and storage is reused. Relatively simple, stable and one-click synchronization, both storage and computing costs can be greatly reduced.

Storage Cost Through Paimon's file reuse mechanism, you will find that the storage cost of tagging for ten days is only the full cost of one or two days. Therefore, by retaining a partition for 100 days, the final storage cost can be saved by 50 times.

In terms of computing costs, although you need to maintain streaming jobs that run 24 hours a day, you can use Paimon's asynchronous compaction to reduce synchronization resource consumption as much as possible. Paimon even provides you with a similar function of synchronizing the entire database. Hundreds or hundreds of tables can be synchronized through one job. Therefore, the entire link can achieve three lows: low latency, low cost, and low link complexity.

Next, two streaming reads are introduced. You may think that Paimon is designed for real-time and better flow reading, but in fact it has no real sense. Including Hudi and Iceberg, they can also stream reading. I use two mechanisms to illustrate that Paimon has done a lot of work on data stream reading.

Consumer mechanism. If you don't have this ability, the most troublesome thing you often encounter when reading streams is FileNotFoundException. What is this mechanism like? Because we need to continuously generate Snapshots during the data production process. Too many Snapshots will result in a large number of files and very redundant data storage, so a Snapshot cleanup mechanism is required. But other streaming jobs don’t know this. If the Snapshot I’m streaming is deleted by Snapshot Expiration, FileNotFoundException will occur. What should I do? What's more serious is that the streaming reading job may failover. If it hangs for 2 hours, after resuming, the snapshot it is streaming reading has been deleted and cannot be restored again.

So Paimon proposed the consumer mechanism here. The consumer mechanism is used in Paimon to record a progress in the file system. When I read this Snapshot again, Expiration will not delete this Snapshot. It can ensure the safety of this stream reading and can also achieve things like Similar to kafka group id streaming reading progress saving. Restarting a job statelessly restores the same progress. Therefore, the consumer mechanism can be said to be the basic mechanism of stream reading.

Second, Changelog generation. Suppose there is such a Paimon PK table, the key is the name, the value is the count, the upstream is constantly writing, and the downstream is constantly reading. Stream writing may write the same data to the same component. For example, the Jason written previously is 1, and the Jason written next is 2. You will find that when the stream reading job does a correct stream processing, for example, when making a sum, the sum result should be 2 or 3. Without the generation of this changelog, we will not know that this is the same primary key. I need to first put Jason -> Retract 1 and write Jason again -> 2. Therefore, our lake storage itself needs to behave like a database generating binlog, so that the downstream stream reading calculation can be better and more accurate.

What are the technologies for changelog generation? In Flink real-time stream computing, if you have written homework, you may also have written a lot of use of State to remove duplicates. However, the cost of state in this way is relatively high, and the data will be stored in multiple copies, and the consistency is difficult to guarantee. Or you can use full merging. For example, Delta, Hudi, and Paimon all provide this method, which can generate the corresponding changelog during full merging. This is possible, but each time the changelog is generated, it requires full merging, which costs a lot. It will be very big.

Third, Paimon’s unique method is that it has chagelog-producer=lookup because it is LSM. LSM has the capability of point-checking, so you can configure such a point-checking method to generate the corresponding chanelog through batch efficient point-checking when writing, so that the downstream stream processing can correctly process the stream.

The above two parts are Paimon's update and streaming reading. The streaming lake warehouse is designed for Flink's streaming-batch integration. In the past, it was a stream-batch integrated calculation, but now with storage, it is a stream-batch integrated calculation + stream-batch integrated storage.

However, some students using Alibaba Cloud Serverless Flink found that they did not have the basic capabilities of batching: scheduling and workflow?

The streaming lake warehouse not only needs to solve the flow capability, but also needs to solve the batch offline processing capability. The batch is the basis of the lake warehouse. The real flow in this streaming lake warehouse may only be 10% or 20%, not the entire flow. All of Lake Cang. Therefore, Flink’s stream-batch integration is inseparable from Flink’s real batch processing.

You can also see in the picture of the streaming lake warehouse that it may take 4 steps to process the data.

The first step is to enter the lake with one click, and enter the lake with one click through Flink CTAS/CDAS.

In the second step, the pipeline's full link is streamed in real time, so I need to have the ability to stream read and write to the storage.

The third step is that all these data can be analyzed through the open analysis engine.

The fourth step is batch reading and batch writing of the essential things of Hucang. What is needed in the product is basically scheduling and workflow.

Everyone has been waiting for it for a long time. Alibaba Cloud Serverless Flink has also officially welcomed the scheduling and workflow capabilities on the product, allowing you to achieve truly complete batch processing link capabilities in Serverless Flink.

Next, I want to take a case of quasi-real-time streaming lake warehouse, which is e-commerce data analysis. Through Flink, real-time data is transferred to the lake into the ODS layer Paimon table, and then flows to DWD, then to DWM, and then to DWS through streaming, thus forming a complete set of streaming lake warehouses.

Streaming Lake Warehouse Demo

Demo viewing address:

https://yunqi.aliyun.com/2023/subforum/YQ-Club-0044

Open source big data special replay video 01:52:42 - 01:59:00 time period

Serverless Flink not only has the ability to stream ETL, but now also has a relatively complete batch processing method. In the past, the stream may be on one development platform and the batch on one development platform, which is very fragmented. Now what can be done is that the entire development platform can be On Serverless Flink, the entire computing engine can be Flink Unified, and the underlying storage is a set of Unified's Paimon storage, which can complete offline processing, real-time processing or quasi-real-time processing, and can achieve complete Unified from development to computing and storage. plan. The batch processing version will be released soon. If you need it, you can contact us to try it out in advance.

OpenAI opens ChatGPT Voice Vite 5 for free to all users. It is officially released . Operator's magic operation: disconnecting the network in the background, deactivating broadband accounts, forcing users to change optical modems. Microsoft open source Terminal Chat programmers tampered with ETC balances and embezzled more than 2.6 million yuan a year. Used by the father of Redis Pure C language code implements the Telegram Bot framework. If you are an open source project maintainer, how far can you endure this kind of reply? Microsoft Copilot Web AI will be officially launched on December 1, supporting Chinese OpenAI. Former CEO and President Sam Altman & Greg Brockman joined Microsoft. Broadcom announced the successful acquisition of VMware.
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5583868/blog/10150853