Streaming Warehouse Data Consistency Management Based on Flink & Paimon

Abstract: This article is compiled from Li Ming, a ByteDance infrastructure engineer, at the Apache Paimon Meetup. The content of this article is mainly divided into four parts:

  1. background

  2. Design

  3. Current Progress

  4. future plan

Click to view the original video & speech PPT

1. Background

The early data warehouse production system was mainly based on offline data warehouses, and the business divided the data warehouses into different levels according to their own business needs, such as DWD, DWS, ADS, etc. In the offline data warehouse, business data will be processed into the data warehouse through offline ETL, and data conversion between layers will also be processed using offline ETL. The ADS layer can directly provide Serving capabilities externally, and the middle layer usually uses Hive to store intermediate data. Based on Hive, it can also provide some OLAP QUERY capabilities.

Under the offline data warehouse production system, the advantage is that the production system of the offline data warehouse is very complete, the tool chain is relatively mature, the cost of storage and maintenance is relatively low, and the development threshold for users is relatively low. But the disadvantages are also very obvious. First of all, the freshness of data is very low, usually at the T+1 level, usually at the hour level, or even at the day level. Secondly, the changelog support is not perfect. Although it is developed for Table, the intermediate storage Hive mainly supports append type data, and offline ETL is more suitable for processing full data rather than incremental update.

As the amount of data increases, the execution time of offline ETL is getting longer and longer, and the business has higher and higher requirements for data freshness. The business urgently needs a new low-latency data warehouse production system. Therefore, based on the offline data warehouse, a real-time data warehouse production system has been further evolved.

A typical example is the real-time data warehouse production system based on the Lambda architecture. In the real-time data warehouse production system of the Lambda architecture, the business needs to maintain two links, and the production link is divided into a stream processing layer and a batch processing layer. The stream processing layer is mainly used to process incremental data in real time. As the acceleration layer of the batch processing layer, this layer usually uses real-time computing engines such as Storm and Flink for data processing. The intermediate results are stored in Kafka to provide low-latency streaming consumption capabilities.

The batch processing layer is the same as the offline data warehouse, and completes T+1 data result output. The service layer will integrate the results of the stream processing layer and the batch processing layer to provide external services.

With the continuous development of streaming computing engines, taking Flink as an example, the stream-batch unification of the computing layer has been realized. In some scenarios, the batch processing layer can be completely removed, and the stream processing layer can complete the full + incremental calculations. . In order to provide OLAP query capabilities for intermediate key data, it is still necessary to dump Kafka data to Hive again.

In the real-time data warehouse production system, the advantage is that the data freshness is very high, and a lot of precomputation can be done based on the stream processing layer to reduce the query delay.

The disadvantages are also obvious:

  • First, the maintenance personnel of the data warehouse need to maintain two completely different links from computing to storage, and the cost of development and maintenance is relatively high.
  • Second, storage costs are high. In order to provide low-latency streaming consumption capabilities, Kafka has higher storage costs than offline storage such as HDFS and S3, which are commonly used offline. At the same time, in order to enable the intermediate data to provide offline query capabilities, an additional offline full amount of data needs to be stored.
  • Third, it is difficult to align the data calibers of offline and real-time links. This is because two completely different technology stacks are used to build the stream processing layer and the batch processing layer. Although the logical abstraction is the same, there are still differences in the concrete implementation. Moreover, the data in the stream processing layer is continuously being incrementally processed, and it is difficult to align the results with the offline processing layer based on a fixed time point.
  • Finally, the intermediate results in the stream processing link cannot be queried, because Kafka only supports sequential consumption in stream mode, and does not have the ability to query or batch query. Although it is possible to dump the Kafka data to one copy in Hive, the real-time performance is relatively poor.

Although the computing engine has realized stream-batch unification, other pain points of the real-time data warehouse are largely caused by certain limitations of the storage function. With the rise of data lake technology, a new storage technology has emerged, which can support efficient batch read and write of data streams, data backtracking, and data update. Based on the data lake, a new data warehouse production system - Streaming Warehouse can be built.

In Streaming Warehouse, each intermediate table is abstracted as a Dynamic Table, which can support streaming and batch access at the same time, providing users with the same production experience as the offline data warehouse. Based on Streaming Warehouse can bring the following benefits.

First of all, it provides users with a unified Table abstraction, and users only need to maintain a set of Schema. At the same time, the technology stack is also unified, which greatly reduces the cost of business development and operation and maintenance.

Secondly, it adopts stream-batch integrated storage, supports stream consumption and OLAP query, and can query the intermediate results of real-time calculation at any time.

Finally, in the case of ensuring data freshness, the storage cost will be lower than the real-time data warehouse. Intermediate storage can choose relatively cheap storage such as HDFS and S3.

Next, we will make an overall comparison of these three data warehouse production systems.

  • In terms of data freshness, the data freshness of real-time data warehouse and Streaming Warehouse are relatively close, and both are similar to real-time production experience.

  • In terms of query latency, the query latency of the three data warehouse production systems is relatively low, but the query of the intermediate results of the real-time data warehouse requires more costs, such as exporting the intermediate results to Hive.

  • In terms of development costs, the development costs of Streaming Warehouse and offline data warehouses are relatively close. Their development models are similar, and development and data verification can be easily carried out with a low threshold. Since the intermediate results of the real-time data warehouse cannot be checked, the cost of debugging and data verification will be relatively high.

  • In terms of operation and maintenance costs, the operation and maintenance costs of Streaming Warehouse and offline data warehouse are also relatively close, because their production systems are similar. For operation and maintenance personnel, only one link needs to be maintained and the same technology stack is used. At the same time, both Streaming Warehouse and offline data warehouse can choose cheaper offline storage, and the storage cost will be lower.

So think about whether Streaming Warehouse really fully covers our needs?

Let’s first look at a business scenario, which is a typical business scenario for associated calculation of commodity orders. In this scenario, the order data and product data will be imported into the ODS layer tables in Streaming Warehouse after some simple processing, that is, the order table and product table.

Then the order table and product table will be further spliced ​​into the product order details table of the DWD layer. Finally, do some aggregation calculations on the tables of the DWD layer to generate the data result table of the DWS layer. For example, count the revenue of all commodities today, and count the information of the top 10 commodities sold today.

In such a business scenario, the business may also perform some common operations in the data warehouse, for example, the business may modify the fields of the order table. So if the fields of the order table are modified, how to judge which downstream tables may be affected by this modification? This reflects the current lack of a lineage management business capability in the Streaming Warehouse.

In addition, if the data in the order table is wrong, how to correct the data in the production link? In the offline data warehouse, operations such as task rerun and Overwrite can be easily performed. Is it convenient to do such an operation in Streaming Warehouse?

Since the Streaming Warehouse is based on the real-time production link, it is not only necessary to correct this table, but also to process its downstream tables at the same time. During the entire revision process, intermediate changes to the data should not be visible to the service layer. For example, the aggregation result has reached 10. During the correction process, this result may roll back to 1, and then gradually accumulate to 10.

In addition to the above two problems, how to analyze the proportion of Top 10 products in the total revenue when performing OLAP queries? If it is an offline data warehouse, we can perform batch query after the two tables are ready. However, there is no ready concept in Streaming Warehouse. These two tables come from two different tasks, and there is no data alignment operation between tasks. When we perform a multi-table association query, its calculation results are not completely consistent, and there is a lack of a consistency guarantee.

Let's summarize the problems existing in Streaming Warehouse.

  • The lineage management function is lacking, including the lineage relationship of tables and the lineage relationship of data. Table blood relationship refers to the upstream and downstream dependencies of the table, while data blood relationship refers to which data the data comes from upstream, and which data is produced downstream based on this data.

  • Lack of unified version management capabilities. In the offline data warehouse, we can align data by hour and day. In Streaming Warehouse, since we all process in a streaming manner, there is no concept of data alignment and version division, which will lead to a lack of consistency guarantee when performing multi-table association queries.

  • Data correction is difficult. In the process of correction, a large number of manual operations such as link double running and business logic correction are required, and the operation and maintenance cost is high.

Based on the above problems, we propose a Streaming Warehouse based on Flink and Paimon, and provide external data consistency management capabilities.

2. Scheme design

Let's introduce the detailed design of the data consistency management solution based on Flink and Paimon.

In the overall design of the consistency management scheme, it mainly includes two parts.

  • The first part is to establish the blood relationship between upstream and downstream. We will introduce System Database to record the blood relationship of all tables and data in Streaming Warehouse. At the same time, in the process of task submission and data production, the blood relationship between tables and data will be automatically written into the blood relationship table.

  • In the second part, we will introduce the capability of data version control in Streaming Warehouse, the data will maintain visibility according to the version, and coordinate the consistency of multi-table data version processing.

Below we introduce the scheme design of these two parts in detail.

The first is the Table blood relationship management in the blood relationship. We introduced the System Database in the Streaming Warehouse, and created the blood relationship table of Source and Sink in this System Database. In the submission stage of the task, the Table table used by this task will be parsed, and the information will be recorded in Paimon's blood relationship table.

The above picture is our table structure, which is mainly used to record the relationship between tables and tasks. Based on this relationship, we can construct the data blood relationship between tables.

 

Next, introduce the design of data version control, and first introduce the basic concepts.

  • The first concept is Flink Checkpoint. This is a function that Flink regularly uses to persist state and make snapshots, mainly for fault tolerance and two-phase commit.

  • The second concept is Paimon Snapshot. When Flink makes Checkpoint, Paimon will generate 1 or 2 Snapshots, depending on whether Paimon has performed compaction during this process, but at least one Snapshot will be generated as a new data version.

  • The third concept is Data Version, which is the data version. When calculating, the computing engine will align the data according to the version of the data, and then process it, so as to realize the processing of a micro-batch mode.

At present, in the short term, we are aligning the two concepts of Paimon Snapshot and Data Version, which means that a Paimon Snapshot corresponds to a version of the data.

Let's briefly look at an example of data alignment. Suppose we have Job-A and Job-B, and they have produced their own downstream tables Table-B and Table-C respectively based on Table-A. When Job-C wants to perform associated query on Table-B and Table-C, it can do its own QUERY based on the consistent version.

For example, Job-A produces Snapshot-11 of Table-B based on Snapshot-20 of Table-A. Job-B produces Snapshot-15 of Table-C based on Snapshot-20 of Table-A. Then the query of Job-C should be calculated based on the Snapshot-11 of Table-B and the Snapshot-15 of Table-C, so as to achieve the consistency of calculation.

Next, introduce the implementation of data alignment, which is divided into two parts.

  • In the submission phase, it is necessary to query the consistent version of the upstream and downstream tables in the blood relationship table, and set the initial consumption position for the corresponding upstream table based on the query results.
  • In the running phase, checkpoints are coordinated according to the consumed snapshots. When Flink's checkpoint coordinator sends a checkpoint request to the source, it will force the checkpoint to be inserted between the data of the two snapshots. If the current Snapshot has not been fully consumed, the triggering of the Checkpoint will be postponed, so as to divide and process the data according to the Snapshot.

After Flink's Checkpoint is successful, it will notify the Sink operator to commit the Table. After the commit is completed, the data of this Snapshot can be seen downstream. At this time, the Commit Listener will write the blood relationship of the data into the System Table to record the blood relationship.

After we implement the above two functions, what are the specific application scenarios?

  • First, the automated management of data lineage. Data kinship is a very important part of the entire data warehouse. Based on blood relationship, we can quickly perform data traceability, risk assessment, etc. At the same time, the user, usage quantity, and data trend of these tables can also be analyzed based on blood relationship, so as to evaluate the actual application value.
  • Second, the ability to query consistency, we can automatically align data according to the data version for OLAP queries, and ensure the consistency of query results. At the same time, development and debugging based on consistent data can reduce development and operation and maintenance costs, and no longer require the business side to manually perform multi-table alignment operations.
  • Third, data correction. Based on data consistency management and data blood relationship, the process of data correction can be simplified. First, according to the blood relationship, we can automatically create a mirror table of the downstream table that needs to be corrected, and then correct it. Two correction methods can be provided, full correction and incremental correction.
    • Full revision, based on the data of the consistent version, can be consumed in full from the upstream to generate a new data of the whole link. After the entire data production catches up to the delay, an automatic switchover of the tables can be done.
    • Incremental corrections can be combined with Flink's Savepoint mechanism, so that there is no need to initialize the state from scratch, reducing the amount of data that needs to be backtracked.

3. Current progress

Below we introduce the phased progress of the current data consistency management.

In the community, we have launched related issues, PIPs, and emails for discussion. If you are interested, you can follow the corresponding progress. If you have new needs and ideas, you are welcome to communicate with us.

Inside Byte, we have completed the development and testing of a POC version. In this version, we provide a third-party external service to manage blood relationships, coordinate data versions, etc.

4. Future planning

Finally, let me introduce the future planning on Streaming Warehouse.

  • First, end-to-end latency optimization. During the POC process, we found that the end-to-end delay largely depends on the interval of Flink Checkpoint. At the same time, when collecting some business requirements internally, the business has relatively high requirements for end-to-end delay. This will bring about a problem. When we reduce the frequency of Checkpoint, it will lead to more small files, which requires some trade-offs. In the next stage, we will focus on solving the problem of end-to-end latency.
  • Second, the ability to correct data has been enhanced. At present, this is a pain point that the business has received more feedback in real-time data warehouse production. The business hopes that the cost of data correction can be low enough, and at the same time, the intermediate results generated during the correction process are not visible to the outside world.
  • Third, state reuse. Many scenarios in data warehouse production are multi-table associations. Currently in Flink, the Join operator will store the data details of the left and right streams. In the multi-table cascading Join scenario, each Join operator will store the previous Join result, which is equivalent to storing the details of the previous table once more. , will cause a very serious problem of state expansion. The business hopes that these states can be reused, that is to say, the data in the same table only needs to be stored in one copy, which can greatly reduce the overhead of state storage. At the same time, the business also hopes that this intermediate state can be queried. It is assumed that these states can be stored in Paimon's table and accessed by way of Lookup Join. Then we can use Flink's SQL to directly query the intermediate state.

Q&A

Q: Is blood relationship analysis based on Flink's calcite?

Answer: No, it is implemented based on FlinkTableFactory. When creating DynamicTableSource and DynamicTableSink, relevant Table information and task information are extracted, and then written into Paimon's blood relationship table.

Q: How to correct the data for task errors? That is, what is the process of returning to normal, and how long will it take to return to normal?

Answer: Our goal is to hope that the process of data correction can be completed automatically in the system. The initial idea is to generate a corresponding mirror table for the downstream table based on the blood relationship of the table during correction, and then double-run the task on this mirror link. On the basis of data blood relationship, the data can still be processed according to the same version. When the delays of the two links are basically aligned, switch tasks and tables. The processing time depends on the amount of data processed, the complexity of the link, etc.

Question: Have you considered building a unified Paimon management service on this basis? For example, Paimon's metadata management, compaction management, lineage management, etc.

Answer: At present, only metadata management and lineage management are considered. For compaction management, it may be more suitable for services such as Table Service.

Q: The business cycle span is relatively large. Does Flink Join cache the full amount of data?

Answer: Flink's full join data will store all the data of the table in the state, and at the same time, it will cause a very serious state expansion problem for cascade join. According to the principle of Join, it can be considered to implement Join as Lookup Join + Delta Join. For historical data, use Lookup join to check historical table data, and for the latest incremental data, store it in the state and perform Join through state query , so that a large amount of full data can be stored in the Paimon table, and only a small part of the data is cached in the state. This depends on the ability of version management to distinguish whether the data is Join historical data or incremental data.

Q: Will field kinship do it? It should be parsed according to the SQL syntax.

Answer: The realization of field blood relationship is not considered for the time being.

Click to view the original video & speech PPT


more content


Activity recommendation

Alibaba Cloud's enterprise-level product based on Apache Flink - the real-time computing Flink version is now open:
0 yuan to try the real-time computing Flink version (5000CU* hours, within 3 months) to learn more about the event: https://click.aliyun.com/ m/1000372333/ 

It is infinitely faster than Protocol Buffers. After ten years of open source, Cap'n Proto 1.0 was finally released. The postdoctoral fellow of Huazhong University of Science and Technology reproduced the LK-99 magnetic levitation phenomenon. Loongson Zhongke successfully developed a new generation of processor Loongson 3A6000 miniblink version 108. The world's smallest Chromium core ChromeOS splits the browser and operating system into an independent 1TB solid-state drive on the Tesla China Mall, priced at 2,720 yuan Huawei officially released the security upgrade version of HarmonyOS 4, causing all Electron-based applications to freeze AWS will begin to support IPv4 public network addresses next year Official release of Nim v2.0, an imperative programming language
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/2828172/blog/10092699