Xingfuli’s streaming data warehouse practice based on Flink & Paimon

Abstract: This article is compiled from ByteDance infrastructure engineer Li Guojun’s sharing at the Streaming Lakehouse Meetup. Xingfuli business is a typical transaction and transaction type business scenario. This business scenario encounters many challenges in real-time data warehouse modeling. This sharing mainly introduces the practical experience of Xingfuli business in building a streaming data warehouse based on Flink & Paimon. From the business background, streaming-batch integrated data warehouse architecture, problems and solutions encountered in practice, the ultimate benefits that can be obtained with the help of Paimon , and introduce future planning.

Tips: Click "Read the original text" to receive 5000CU*hours of Flink cloud resources for free

01

business background

Xingfuli business is Byte's real estate business line. There are many directions for BP support around this business, one of the most important directions is the work order system. The users of the work order system are agents and store managers on the front line of Xingfuli’s business line. As shown in the figure below, we can see how data is generated and transferred through the work order system.

851f6cae48740d1febfb99ad016a5608.jpeg

First, the broker submits a work order for the completed agent viewing task. Subsequently, the corresponding store manager will review the work order. In this process, two pieces of data are generated and updated to the Binlog data of the business database. As a data source for real-time data warehouse, data reports are generated after calculation or used directly in some assessment systems. The data report is used to display and evaluate whether the work of front-line agents is up to standard; the assessment system is used by store managers to set the assessment task volume for front-line agents, and automatically feedback rewards through task volume standards. Therefore, in the real-time data warehouse modeling of the above application, we found that the real estate business has two typical characteristics:

  • The accuracy is required to be 100%, and there must be no data loss or duplication.

  • Full calculation is required, and the retention time of incremental data in MQ is limited, so the full data View needs to be obtained for calculation.


Features of real-time data warehouse modeling

7215ed751d5b13e1a36c219eb01b69cf.jpeg

In the real-time data warehouse pipeline of actual business, there are multiple data sources before entering the real-time data warehouse, and the characteristics of each data source are also different, so the real-time incremental part will be stored in MQ, and the full data will be stored in Hive.

572d03bc3cdf7be1ff9786e4d069ba49.jpeg

Each layer in the real-time data warehouse in the figure above is connected in series by a Flink Streaming SQL. The main function of the DW layer is to join multiple data sources and widen them, and directly output them into MQ through the calculated wide table. Due to the limited retention time of MQ, an hour-level or day-level periodic task will be formed. After the end of a cycle, the data in MQ will eventually fall into Hive. The main function of this layer of DWM is aggregate calculation, and the results of aggregate calculation will also be directly output to MQ. The calculation model of each layer is the same as the previous layer. The calculation results of the real-time data warehouse will serve online data applications through the Service layer, such as the data reports and assessment systems mentioned above. The Hive offline data output by each layer can be used by BP students to do offline query work such as data troubleshooting/verification.

Looking back at the two characteristics of the real-time data warehouse, one is the accuracy requirement of 100%, which means that the real-time task status operators of the entire data warehouse are required to maintain full data; the second is the need for full calculation, which means that due to heterogeneous storage, real-time The data exists in MQ and the historical data exists in Hive, so the MQ consumed by each layer needs to consume incremental data and Hive full data in real time. From the perspective of a development engineer, this real-time data warehouse model has the following pain points:

25319781b5611c8f4f2e493cc492a42b.jpeg

During the development process, you need to always pay attention to logic other than business logic, such as repeated processing of data in SQL; in the process of data deduplication, using a single field to process is not accurate enough, and Nanotime needs to be introduced to do non-deterministic calculations to solve the problem, etc. . The reason why the above problems exist is mainly because real-time data and offline data are stored separately in the entire link. This heterogeneous storage makes it naturally difficult to align the two parts of data.

ea3c944c31f92e307f471db468cfefe1.jpeg

The data operation and maintenance here includes three parts: data investigation, data verification and data correction. The problem is that during the process of data investigation and data verification, if it is found that a certain SQL job on a certain link needs to be corrected. The result of the revised SQL is output to MQ, and the operation of writing the data in MQ to storage will incur a cost of T+1. In addition, the intermediate result rollback during the revision process will be directly exposed to the user.

The second problem is that the purple part in the picture above is a simplified link, but the complexity in the actual production process is very high. What is reflected in the main link is that some tables and tasks will be dependent on many other tasks or tables, making The revision process can affect many unpredictable tables or tasks. There are two main reasons for the above problems. One is that the results of data correction are rolled back and exposed to users, and the other is that blood relationships are complicated and require manual maintenance.

be136ed8b10307f67f194986abd2d982.jpeg

On the current link, the state maintenance of Flink real-time tasks is very large, which results in a very large consumption of storage and computing resources, and the process of recovering jobs from such a large state will also be very slow. The two main reasons for major status problems are that the deduplication operator maintains the full data status and the cascade Join status is repeated.

Why choose Paimon

Based on the above existing pain points, we consider building a combination of Steaming Lakehouse through the Flink ecosystem to solve the problems on the original link. As shown in the figure above, the problems on the original link include:

  • Storage is heterogeneous, Base+Delta data is difficult to align;

  • Deduplication introduces non-deterministic calculations and large states;

  • Complex blood relationships & data correction results are rolled back and exposed to users.

To solve the problem of the original link, we chose Paimon:

  • The integrated streaming and batch storage can be output as a unified table, and real-time and offline data can be stored in a Paimon table, directly solving the alignment problem;

  • There is no need to deduplicate, Changelog Producer replaces the state operator, and supports generating a complete Log on the storage and persisting it to replace the state operator on the original link;

  • Lineage management & data consistency management, supporting imperceptible data revision.


02

Streaming data warehouse practice

e446694b2e002379f74206de8586c284.jpeg

First, the architectural design in the practical process of streaming data warehouse is introduced, as shown in the following figure:

  • The storage layer uses HDFS or S3 object storage as the storage base, and Paimon as the unified Table abstraction;

  • The computing layer uses the same technology stack as Flink, which unifies stream and batch computing;

  • The data management layer implements the lineage management of Table and the lineage management of data. Based on this lineage management, data consistency can be achieved. Lineage management can be used for data traceability requirements to provide guarantee for data quality.

  • Data consistency management, flow-batch integrated ETL data management. During multi-table consistency joint debugging, data can be automatically aligned without the need for developers to manually align.

As shown in the figure above, the upper layer provides external services through the Gateway or Service layer, and end users access the entire system through SQL Client or Rest API.

b26e291ae9ede0ac6347cb61e6cf6ee0.jpeg

The picture above is the streaming data warehouse Pipeline. The data source is the same as mentioned before, offline data is stored in Hive, and real-time data is stored in MQ. The difference is that when entering the Streaming Lakehouse, an ODS layer is set up, which will precipitate each data source into the Paimon Table through Flink Streaming SQL. The second layer is the DWD layer, which performs Join and widening operations on multiple data sources and precipitates the output results into the Paimon Table. Then the last layer of APP layer is used to do the indicator aggregation and disclosure work.

Since the intermediate data is deposited in the Paimon Table, developers can directly operate it when doing data troubleshooting and verification. Through the Pipeline of the real-time data warehouse in the picture above, you can see that the storage is integrated with streaming and batch, and the Flink technology stack is used to unify the streaming and batch computing in computing, which can reduce the cost of development and operation and maintenance. Moreover, the intermediate data precipitation can also be directly queried, eliminating the need to perform more tedious operations during operation and maintenance.

35993bf05d03549b9e84b5071ea22437.jpeg

After completing the implementation of the above Streaming Lakehouse practice, the following benefits were summarized:

  • Simplify development process

    • Streaming-batch integrated storage can solve the problem of real-time and offline storage heterogeneity;

    • Reduce business intrusion, remove deduplication operators, and solve non-deterministic calculations.

  • Improve operation and maintenance experience

    • Intermediate data can be checked; data can be traced;

    • Blood relationship & multi-table consistency enhances multi-table association debugging capabilities and enables data correction to be imperceptible.

  • Reduce the amount of status

    • Changelog persistence can reduce the amount of status by 30%.

In the process of practice, in addition to gaining a lot of benefits, we also encountered new problems, including two main ones:

  • Poor data freshness: The end-to-end delay changes to the minute level, and the data freshness is reduced;

  • Small file issues: Some small files may affect read and write performance.


03

Tuning of streaming data warehouse

End-to-end latency tuning

a2517a3190cc43f259d2ec8efd96c92f.jpeg

First, we need to analyze what the visibility of the entire link data is related to. As shown in the figure above, after the Source receives the data, it will continuously send these Records to the Bucket. Then after the Bucket Writer receives the data, it first caches the data into a memory-based Buffer. When the buffer is full, a Flash flashes all the data in this Buffer to the disk. At this time, a data file that is invisible to the outside world is generated. Only when the upstream triggers a Checkpoint, and the Commit operator in the entire link generates a Snapshot pointing to the newly generated data file, can it be visible to the outside world.

Analyzing the entire process, two conclusions can be drawn:

  • Data visibility is tied to Checkpoint. More strictly speaking, the data visibility of a period is strictly bound to the Checkpoint period.

  • Checkpoint period = Checkpoint interval + Checkpoint latency. Checkpoint interval is the frequency of Checkpoint triggering; Checkpoint latency is the time required to complete a Checkpoint.

920e7eb9be4f830abe3318d02044cf71.jpeg

So when we are doing end-to-end tuning, do we only need to make relevant adjustments for the Checkpoint cycle? Is the easiest thing to do is to reduce the Checkpoint interval?

Before drawing conclusions, let’s take a look at the writing process. In the Paimon Sink operator, Bucket Writer will continuously open data to disk data files. In addition, Paimon Sink also includes another component, Compact Manager. This component mainly performs compaction on the data files on the disk. As shown on the right side of the figure above, Compaction is logically a Bucket, and in terms of storage, it is a directory. Many data files are stored in the directory. These files are organized by LSM trees and divided into multiple Levels. In fact, when Compact Manager does Compaction, it is a process for these different layers of data.

So we infer that the entire Compaction process is an operation process with a lot of I/O. Suppose that blindly adjusting the Checkpoint Interval will lead to frequent Checkpoints. For example, 100 pieces of data can be divided into one file, but now the Checkpoint frequency If it is too high, these 100 pieces of data may be divided into multiple files, and the data in each file will be very small. Secondly, too many small files will make the overall cost of compaction higher and also affect the writing performance. In fact, this is a process of pursuing data freshness, which mainly requires a trade-off between data writing performance and data freshness. After many practical verifications, it is recommended to set the Checkpoint Interval to 1-2 minutes.

Checkpoint Latency optimization can be divided into several directions:

  • Log-Based Incremental Checkpoint

Utilize some features of higher versions of Flink, such as Log-based incremental Checkpoint, to optimize the time spent in the upload phase.

Reduce the amount of status

    For example, if you reduce the amount of data transmitted, the upload time will be reduced.

  • Checkpoint continues to upload

    Continuously upload local state files.

  • Build an independent HDFS cluster

    Reduce the probability of encountering slow nodes.

b071c16d597943b8d4d4168064dec8a9.jpeg

After optimization in the above four directions, we have verified in practice that the end-to-end delay can be reduced to the minute level.

Small file optimization

1b5229f25903444e69f1d8933d45e871.jpeg

Byte's internal practice is based on HDFS as the storage base. We define small files as files that are significantly smaller than the size of a Block on HDFS. The most direct problem caused by small files is that too many files require more Blocks, such as Create Block, Delete Block, etc. The direct impact is frequent I/O operations, which will cause excessive pressure on the NamaNode on HDFS and affect stability. Impact; In addition, no matter how big the file itself is, its Block meta-information is fixed, and this meta-information is stored in the NameNode memory. Excessive Block meta-information will cause memory OOM problems; when the data is too scattered/file When the number is too large, the data may be allocated to more HDFS DataNodes, which will cause DataNodes to jump back and forth, increase frequent random reading and writing, and reduce efficiency and performance; and the number of allocated DataNodes will increase and encounter problems. The probability of slow nodes will also increase.

c78506108481a591ccc80f5800896f41.jpeg

Among issues related to small files, the timing and factors that determine whether to generate small files are as follows:

  • File generation. There are two triggering opportunities for data files to be generated on disk. One is Checkpoint, which will force the data in the current WriteBuffer to be flushed to the disk; the second is WriteBuffer, which will also flush the data in the memory when it is full. The data is flushed to disk. If the Checkpoint Interval is adjusted too small, or the WriteBuffer capacity is set relatively small, then data will be flushed to the disk more frequently, resulting in excessive small files.

  • File division. By setting some Partition keys or Bucket keys, you determine the direction of the data and which files it will fall into. For example, if the actual quantity in production is very small and too many buckets are set up, it can be predicted that the amount of data that can be allocated to one bucket will be relatively small. Small file problems will also be encountered in this process. In addition, if the Partition key or Bucket key is set unreasonably, some file skew may occur, that is, a hot key problem.

  • File cleaning. Paimon has a file cleaning mechanism, which will delete some useless files during the compaction process. In addition, the data is managed by Snapshot. If the Snapshot expires, the corresponding data file will be deleted from the disk. If the Compaction triggering conditions and Snapshot expiration conditions are not managed well, redundant files will remain on the disk.

Based on the above introduction, let’s share some small file tuning parameters we have accumulated in practice, as shown in the table below.

  • Checkpoint interval:: It is recommended that 1-2 minutes is more appropriate;

  • WriteBuffer size: It is recommended to use the default value unless you encounter related problems and need to adjust it;

  • Business data volume: The number of Buckets can be adjusted according to the business data volume. The adjustment basis is that the size of a single Bucket is about 1G.

  • Key settings: The appropriate Bucket-key and Partition can be set according to the actual business Key characteristics to prevent the problem of hot Key tilt;

  • Compaction management and Snapshot management related parameters: It is recommended to use the default values ​​unless you encounter related problems and need to adjust them.

747cc37f62b4e44c9287287b731a56e5.jpeg

After undergoing the entire architecture transformation, we compared the original real-time data warehouse links. As can be seen in the figure below, we have achieved certain gains in many indicators.

  • End-to-end delay: When Mini-batch is enabled in the original real-time data warehouse, the end-to-end delay does not degrade significantly and can support near-real-time visibility of 1-2 minutes;

  • Data troubleshooting timeliness: can be improved from hour to minute;

  • The amount of status is saved by about 30%;

  • The development cycle is shortened by approximately 50%.

e5ea5e17fc08fdd6555f1a682654698b.jpeg


04

future plan

2c38465944711344766572e5d869c370.jpeg

Currently, the following four directions are mainly planned:

  • First, try second-level end-to-end latency. It may be done in several phases, and it is planned to introduce Embeded Log System to solve this problem. In the long run, data visibility will be unbound from Checkpoint;

  • Secondly, data consistency management. The two aspects of blood relationship management and data consistency management are very valuable in actual data operation and maintenance;

  • Third, state reuse. State reuse mainly solves the problem of Join state reuse. In addition, we hope to make the intermediate state checkable;

  • Fourth, monitor operation and maintenance. When the scale increases in the future, we hope to establish a monitoring system and make the indicators observable.


Q&A

Q: In the case of heterogeneous data sources, have you considered other technology options for entering the lake? Why did you choose Paimon in the end?

A: When initially selecting technology, we mainly considered several points. One was the combination with the Flink ecosystem, and the other was the Streaming Warehouse model. At that time, the best combination of these two points was Paimon. In addition, in our Steaming upsert The situation is better in mainstream scenarios.

In addition, for the intermediate storage layer, it is a Plugin model, which does not mean that it must be deeply bound to Paimon.

Q: What considerations have you made when doing data backtracking, snapshots and rollbacks? Can you give some suggestions for reference?

A: In this aspect, we mainly implement the bloodline management function based on Paimon. Simply speaking, blood relationship management is divided into two parts: the first part is the blood relationship management of tables; the second part is the blood relationship management of data.

Table blood relationship management, for example, when submitting a job, the information of its upstream and downstream tables can be extracted through tasks, and then inserted into the imported System Database. Data blood relationship management can divide data versions according to Checkpoint. The completion of a Checkpoint means the generation of a version of data. Then record the specific version of data consumed into the System Database.

Based on these two types of blood relationship management, it is possible to maintain the online service status of the old link and ensure that the new link can trace back data or correct data. In a production environment, a data traceback can be completed by automatically switching tables at the system level.

Q: If you use the stream itself to process data, if the amount of data is too large, will it cause congestion in the opening of the data source, so that the data cannot come in?

A: This is a question of writing performance optimization. There is detailed guidance specifically for this on the Paimon official website. You can check it out.

Event video review & PPT acquisition

PC version

It is recommended to go to  the Apache Flink learning website :

https://flink-learning.org.cn/activity/detail/69d2ec07bc2f664d000a954f49ed33aa

Mobile terminal

Video review / PPT download : Follow the Apache Flink public account/Apache Paimon public account, reply 0729

Hands-on practice|Use Flink to discover the hottest GitHub projects in real time

Want to learn how to use Flink to discover the most popular projects in GitHub ? This experiment uses the GitHub public event data set built into the Alibaba Cloud real-time computing Flink version, and uses Flink SQL to explore and analyze the hidden Easter eggs in the Github public data set in real time!

After completing this lab, you will have the following knowledge:

  • Learn about the advantages of Flink and streaming computing

  • Have preliminary experience with Flink SQL basic capabilities and Flink real-time processing features

Experiment details: How to use Flink SQL to explore GitHub data sets|Flink-Learning practical camp


Flink Forward Asia 2023 officially launched

Click to view event details

8985df64da02ecc34107d9639acfca72.png


▼ " Activity Recommendation " 99 yuan for first purchase and monthly trial ▼

d60673c56453f862e8beff4d9c936af2.png


▼ Follow " Apache Flink " to get more technical information ▼

0fbd0e02b5721016908a69e440a9206f.png

 f16f64f0e0f56e8fa3716812f264d0f5.gif  Click " Read the original text " to receive 5000CU* hours of Flink cloud resources for free

Guess you like

Origin blog.csdn.net/weixin_44904816/article/details/132893360