Station B builds a real-time data lake based on Hudi

303be2e997252111a9a21794f4d87bc3.png3 million words! The most complete big data learning interview community on the whole network is waiting for you!

01

Background and Pain Points

ef87de754184cbb2781bc94290f76e82.jpeg

In the application of big data scenarios, the business must not only calculate data results, but also ensure timeliness. At present, our company has evolved two links. Data with high timeliness goes through Kafka and Flink real-time links; data with low timeliness requirements goes through Spark offline links. The above figure briefly describes the link for data reporting, processing and use of station B. Data collection is mainly through the behavioral event data reported by the APP. The log data reported by the server will be stream-distributed to the big data warehouse system through the gateway and the distribution layer.

The business data stored in MySQL is periodically batch-synchronized to the data warehouse through Datax. Time-sensitive data will be stream-calculated through Flink+Kafka. Data with low timeliness is batch-calculated through Spark+HDFS and finally delivered to the medium of MySQL Redis Kafka for use in AI and BI model training and report analysis scenarios.

In the process of using, many problems were also found.

1. The timeliness of offline data is not enough, and the offline batch calculation is based on hours/days. More and more business parties hope that the timeliness can reach the minute level, and offline hourly or daily calculations cannot meet the needs of business parties. In order to achieve higher timeliness, the business will develop another real-time link.

2. But the observability of the real-time link is weak. Because it is not convenient to view data in Kafka, the data in Kafka needs to be moved to other storages before viewing. Real-time data links are generally not easy to align with business time, and it is difficult to accurately locate the starting point that needs to be rerun. If the data is abnormal, the business generally does not choose to rerun on the real-time stream, but to perform T-1 repair on the offline link.

3. The real-time offline dual link will have double resource overhead, development and operation and maintenance costs. In addition, inconsistent caliber will bring additional interpretation costs.

4. The peak of computing resources throughout the day is concentrated in the early morning, and the peak usage of the entire big data cluster is between 2:00 and 8:00 in the morning. Mainly running sky-level tasks, there is also a phenomenon of task queuing. Resources in other periods are relatively idle, and there are obvious peaks and valleys when used. There is room for optimization in overall resource utilization.

5. In terms of data storage islands, users also need to clone a copy of data to HDFS, so there will be data consistency problems. When the data is out of the warehouse, permissions and data federation queries will be insufficient.

We hope to solve the above pain points through the real-time data lake solution.

1. We use Flink+Hudi to store data increments and the results of incremental calculations in Hudi, which supports minute-level data calculations and further enhances the timeliness of data.

2. In addition, Hudi has the duality of flow table, which can not only perform real-time streaming incremental consumption, but also can be used as a table for direct query. Compared with the Kafka link, the observability is further enhanced.

3. The real-time data lake meets both real-time and offline requirements, achieving the effect of reducing costs and increasing efficiency. In addition, it also supports the demand for re-running offline data warehouse data.

4. Through incremental calculation, the data resources originally allocated after 0 o'clock are subdivided and allocated to every minute of the whole day, and resources are used in staggered peaks.

5. Through sorting, indexing, materialization, etc., we can directly query the data in the Hudi table, so as to achieve the effect of data leaving the warehouse.

02

scene exploration

2.1 DB warehousing scenario

4fee70a8aa885dc7b67407a7e559967a.jpeg

When business system data is stored in MySQL, these data need to be imported into the data warehouse of big data for reporting, analysis, calculation and other scenarios. At present, business data warehousing is not only used for offline ETL, but also expected to be time-sensitive. For example, in the manuscript content review scenario, the staff wants to know whether the increase in manuscripts in the past ten minutes matches the manuscript review manpower, and there are demands for real-time monitoring and alarm demands. The data of the manuscript comes from the business database. It is not satisfied with the timeliness of the current day-level and hour-level data synchronization, and hopes to achieve the timeliness of the minute-level.

In the context of reducing costs and increasing efficiency, we must not only meet business demands, but also consider efficiency. The real-time link is used in real-time scenarios, and the offline link is wasteful in batch ETL scenarios. Therefore, we hope to build a unified stream-batch solution through the real-time data lake, while meeting the demands of real-time and offline scenarios. Through research, it is found that the existing schemes fall into the following categories:

First, DataX periodically batch-exports data to Hive.

Hive itself does not have the ability to update, and generally exports the full amount on a daily basis, which does not meet the timeliness requirements. In addition, this solution also has the problem of data redundancy. The daily partitions of the Hive table are snapshots of the MySQL table on that day. For example, the information in a user information table changes very little, and a full snapshot needs to be stored every day. That is, each piece of user information will be stored once a day. If the life cycle of the Hive table is 365 days, then this piece of data will be stored repeatedly 365 times.

Second, the Canal/CDC to Hudi solution.

The data of DB is written into Hudi through Canal or Flink CDC, so as to meet the requirement of timeliness. Since Hudi's data is updated in real time, it does not have the capability of repeatable reading. Therefore, this solution does not meet the ETL scenario. Even with Hudi's "snapshot read" capabilities. Although it is possible to read the Commit of Hudi's history, get a snapshot data at a certain moment. However, if the Commit data is kept for a long time, it will lead to too many files, which will affect the performance of accessing the timeline, and then affect the read and write performance of Hudi.

Third, the Hudi export Hive solution.

This scheme is a combination of the previous two schemes. After the DB data is written to Hudi through Canal/CDC, it is periodically exported to the Hive table. Hudi tables are used in real-time scenarios, and Hive tables are used in offline ETL scenarios. So as to meet the scene demands of two aspects at the same time. The disadvantage is that the user uses two tables in the process of use, and there is a certain cost of understanding and data redundancy.

Fourth, the Hudi Savepoint solution.

It mainly solves the problem of data redundancy. Through periodic Savepoint, Hudi's timeline metadata at that time can be stored. When accessing Savepoint, it will map to access the actual file of Hudi to avoid redundant storage data files. A Savepoint per day is equivalent to storing a MySQL snapshot every day, thus meeting the requirements of ETL scene re-reading. At the same time, you can also directly access the latest data of Hudi to meet the real-time demands of users.

But this scheme still has some flaws, it cannot accurately split the cross-day data. When incrementally writing to Hudi through Flink, Commit will be generated periodically, and the business time and Commit alignment cannot be controlled. If the data of yesterday and today fall into the same Commit, Savepoint will use Commit as the minimum strength. When accessing yesterday's Savepoint, it contains today's data, which is not what the user expected.

3650fe9e34dc25d0a1fa75db1babad81.jpeg

In order to solve the above problems, the solution we propose is the Hudi Snapshot View snapshot view. An improvement has been made on the Hudi Savepoint scheme, which is simply a Hudi Savepoint with filtering.

In the export Hive scheme, filter conditions can be added to filter out the data of the next day. We incorporate filtering logic into the Hudi snapshot view. In the snapshot view, the filtering logic is implemented at the bottom of Hudi and stored in Hudi Meta. When accessing the snapshot view, we will spit out the filtered data to solve the problem of cross-day data in the snapshot.

As shown in the figure above, Delta Commit T3 contains the data of November 1st and November 2nd. When the source data of the snapshot view is stored, all historical source data of T1, T2, and T3 are stored. In addition, the filter condition of Delta<=November 1 is also stored. Filter the data when reading, and only include the data on and before November 1st to the query end, not including the data on November 2nd.

The snapshot view also stores metadata, and accesses the actual data files through mapping, so there is no problem of redundant data storage. It also satisfies real-time and offline scenarios at the same time, realizing the unification of streaming and batching. In addition, the snapshot view independently cuts out a timeline, which supports operations such as compaction and clustering to speed up queries.

866b2625943e26dbb790e441df6fc892.jpeg

Next, let's talk about the timing of Snapshot View generation. After which Commit should the user make this snapshot view? There are two concepts that need to be understood here, one is event time and the other is processing time.

When the data is delayed, although the actual time reaches 0 o'clock, it may still be processing the data at 22 o'clock. At this time, if the snapshot view is performed, the snapshot view data read by the user will be relatively small. Because Commit is processing time, not business event time. This Snapshot View must be performed after the event time advances to 0 o'clock to ensure the integrity of the data. To this end, we added processing progress in Hudi. This concept is similar to using Watermark in Flink to identify processing progress. We extended Hudi Meta to store processing progress in Commit. When the event time advances to 0 o'clock, the Snapshot View operation starts to notify the downstream tasks that they can be called. With the processing progress of the event in Meta, the processing progress can also be obtained when querying Hudi, so as to make some judgments.

df628fdfa985d1e4dc7920aca9a6e4ae.jpeg

In addition, we have also done the adaptation of the engine layer. In terms of use, the SQL written by the user is basically the same as the original one. Through the hint or set parameter, specify whether to query the snapshot partition or the real-time partition. In the scenario of DB warehousing, it not only meets the timeliness of real-time scenarios, but also meets the offline demands of ETL, successfully realizes the unification of real-time and offline, and achieves the purpose of reducing costs and increasing efficiency.

2.2 Buried point storage scene

12d5b3b0eec9df7bff9bd2ddd2ef091f.jpeg

Next, let’s talk about the scene of burying points into warehouses. As an Internet company, our company also defines user behavior events, collects data, reports it to warehouse, and then analyzes and uses it. Use data to drive business development.

Our company's reporting of user behavior events has reached a considerable scale, and there are many behavior events. Now tens of thousands of behavioral event IDs have been defined, and hundreds of billions of data are added every day, and the traffic is very large. Buried point warehousing is a company-level project, and all business parties on the site are reporting buried points. When using buried points, our company has a large number of cross-use of departmental business lines. For example, advertising AI needs to use data reported by other business lines for sample collection and training.

In the original architecture, the data reported by the APP end is transmitted and cleaned, and falls into the pre-divided table partitions of the data warehouse, which are provided to the business side for development and use. This business division is a coarse-grained division based on business information such as BUs and event types. Downstream tasks can use the tables of their own department and tables of other departments, and only need to apply for permission.

But the architecture also has some pain points. The isolation of a stream of data is not enough, and tens of thousands of buried points are transmitted through the same channel, and the isolation is insufficient for cleaning. It is easy for a certain behavior event to increase sharply during the activity, which affects the overall task processing progress. In addition, line of business use needs to filter a large amount of useless data. In downstream business tasks, only one of your own behavior events may be used for analysis. But at this time the behavior incident is mixed with other behavior incidents in the same department. In conditional filtering, the engine layer can only do partition-level filtering. Loading the files of the entire partition, and then filtering, there is a waste of IO for reading larger files. At the same time, authority management is difficult when departments cross-use data. If a behavior event of another BU is used, it is necessary to apply for the authority of the entire BU table. Coarse particle size, there is a risk. Downstream has minute-level demands. Currently, streaming data is hourly scrubbing. The downstream timeliness is at the hour level, which does not meet the timeliness demands of users.

4f2285d5407d9ec1eb693ccdd464f8c3.jpeg

In order to solve the above problems, we have made some architectural optimizations. As shown in the figure above, after the data is reported and transmitted, the data is dropped into the Hudi table of the household. Users access or use these data through View, which can be used in scenarios such as offline ETL, real-time calculation, and BI report analysis.

For data requiring second-level timeliness, high-quality Kafka links will be used to provide online services, and this proportion is relatively small. The Polaris event management platform and metadata management are responsible for managing the life cycle of the entire behavior event burying point, distribution rules, etc.

Platform control starts from edge reporting, rules distribution, business isolation, and improved isolation. After the data falls into the business Hudi table, Clustering is performed to sort and index the business data. Through the engine layer, dataskip at the file level/data block level is performed to reduce the IO overhead of actually reading the amount of data. Users read data through Hive View, and the platform achieves behavior event-level permission management by adding authorized behavior events to the user's View. For example, when students in department a use the behavior events of department b, just add an ID of the behavior event of department b to the View of department a. When submitting SQL for inspection, permission verification at the behavioral event level will be performed. When incremental transmission cleans the Hudi table, the Hudi table supports incremental consumption, which can achieve minute-level timeliness. Downstream real-time tasks can be used behind this View, so as to achieve flow batch unification.

In terms of optimization on the Hudi side, because the traffic data is not updated when it enters the lake, we adopt the no index mode and remove the bucket assign and other processes to improve the writing speed. At the same time, the ETL delay downstream of Flink's incremental Clustering did not increase significantly. After Clustering, the data becomes orderly. The index records the distribution of behavior events, and can be filtered at the file level and data block level through conditional query. In addition, engines such as Flink and Spark also support predicate pushdown of Hudi tables, which further improves efficiency. In terms of Flink's support for View, the downstream of View can define Watermark, or define the with attribute on View, etc.

Through the combination of architecture adjustment and Hudi capabilities, we have enhanced the isolation and timeliness of buried point management, and saved resources.

2.3 BI real-time report scenario

9582ac740b1c69c52155170e6f41733b.jpeg

Next, let's talk about the BI real-time report scenario. Under the original architecture, after the traffic data and DB data are imported into the data warehouse, they will be joined and widened, and the original calculation results will be output to storage such as MySQL after aggregation. BI reports will directly connect to MySQL for data display. Another offline link will carry out T-1 data repair.

The pain point of the original architecture lies in the repeated construction of real-time and offline links, high computing and storage costs, high development and operation costs, and high caliber interpretation costs. Kafka data needs to be copied to other storage to be queried, and the observability is relatively weak. In addition, Kafka links are difficult to perform data repair. It is difficult to determine the starting point for repairing the Kafka link, and the T-1 method is usually used for repairing. There are problems such as data storage islands.

937b95fc2204b236d6f64c60accbe272.jpeg

BI real-time report scenarios generally do not require second-level timeliness, and minute-level timeliness can meet the requirements. We replaced Kafka with Hudi, which met the real-time and offline requirements at the same time, realized stream batch unification, achieved cost reduction, and unified data caliber.

Compared with Kafka, Hudi can directly query the data in Hudi, and it is easier and more convenient to alarm than Kafka.

For example, compare the data of seven days ago on Kafka and make a threshold alarm. It is necessary to consume the data seven days ago and the current data, perform calculations, and then issue an alarm. The whole process is more complicated. Hudi's query SQL is consistent with offline query SQL. For the DQC system, the scheme of real-time DQC and offline DQC is unified, and the development cost is low. For tasks with second-level timeliness requirements, the Kafka link is also required.

In addition, the data can be kept out of the warehouse. BI reports can be queried directly, and connected to the queried Hudi table for data display.

816c71f1f6efeb014be98eb6cdb551eb.jpeg

In actual use, there are also some problems. When aggregate query is directly performed on Hudi's detail table, the query time is too long and there is a problem of read amplification.

Assume that real-time DQC counts data for nearly an hour every five minutes to monitor the number of data items. The data of nearly an hour will be calculated in five minutes, and the data of nearly an hour will be calculated in the next five minutes. In the process of sliding the window, the intermediate data will be calculated many times, and there is serious IO amplification.

In addition, take the BI report scenario as an example. Assuming a DAU curve is displayed, each point is the cumulative value of historical data. The data of 1 point is the accumulative value of the data of 0 point~1 point. The data at point 2 is the accumulative value of the data from point 0 to point 2. When the interface is displayed, n points need to be calculated, and each point will be repeatedly calculated, resulting in a long query time and the problem of read amplification.

In addition, the cost of development and operation and maintenance is relatively high. Users will display multiple indicators on the interface of a BI panel. It may be data of different dimensions from the same Hudi schedule. If ten indicators are produced, ten real-time tasks need to be developed and maintained, which leads to high development and maintenance costs and low reliability. When an exception occurs in a real-time task, some indicators will be missing from this panel.

The optimization solution we propose is to build the Projection materialized view through Flink+Hudi. Through the Flink State state, only incremental data calculations need to be ingested, avoiding the problem of read amplification. The query results are calculated in advance, and the result data is directly queried to achieve the effect of accelerating the query.

The specific process is that the user submits a query SQL to Excalibur Server for SQL analysis. During the parsing process, the Projection creation is submitted, a Stream task is submitted, and then the data in the original table is read incrementally, and after materialization calculation, it is stored in the Projection materialization table. When the query SQL hits the materialization rule, the query will be rewritten to directly query the result table to achieve the effect of acceleration.

a8dcf925d491aa9568d6322215c297d1.jpeg

By extending the parsing process of Flink Batch SQL, materialization rules and projection metadata will be loaded during query. And judge the current Watermark materialization progress of the materialized table. If the requirements are met, rewrite the query to the Projection materialized table.

23a6e9de2f7f49304d0f3e359abe65fe.jpeg

We refer to the materialization rules of Calcite and add the syntax support of TVF.

Supports the creation of Projection. When users submit batch queries, they can add a hint to the select statement to prompt the query engine, and the query will be reused. The engine will create a projection for the query.

Supports Flink SQL's Projection DDL syntax and rules for SQL query rewriting. When the user submits a batch query, if there is a corresponding Projection, it can be rewritten. After rewriting, the results of Projection can be used directly, which greatly speeds up the query and can achieve second-level or even millisecond-level response.

The rewriting and downgrading of Projection is based on indicators such as Watermark, which shields problems such as delays and failures of Projection real-time tasks, and ensures the reliability of query results. We added Watermark data processing progress information in Hudi Meta. During the data writing process, we will record the materialization progress in Commit Meta. When performing materialization rule matching, if it is too far behind the current time, the rewriting of the current Projection will be rejected, and the table will be directly downgraded to the original table for data query.

Add hint creation on the select statement, and achieve the effect of accelerating the query through the ability of materialization. It not only solves the problem of read amplification, but also reduces the user's development and operation and maintenance costs through automatic downgrade.

In the future, we will optimize around Projection efficiency. Recycle projections that cannot be hit for a long time. Merge multiple projections with the same dimensions to reduce the calculation cost of projections. In addition, we will connect with the indicator system, and accelerate the query through the indicator system cache to meet the flow calculation of some high-QPS scenarios.

d17fc4f4122dd5e7b2b17537d9215bf4.jpeg

In the past, Flink SQL was used for streaming writing, and Spark SQL was used for batch repair. Two sets of SQL are still needed for development and operation and maintenance. Under the Hudi real-time database solution, we refer to the offline correction solution.

The rerun of the historical partition uses Flink Batch Overwrite, which is consistent with the offline repair method.

To rerun the current partition, we use the Flink Stream Overwrite method. For example, the current partition data needs to be cleared and deleted, and then written. Because it is written in a no index way, there is no way for it to overwrite the previously written data in the form of update. We extend the Hudi Catalog to support Flink SQL, and the alter table drop partition operation deletes partitions and data. Then, Flink Stream Overwrite is implemented by re-streaming writing.

When the tool supports cascading rerun tasks, we can repair from the most source ODS level to the end, and no longer need to develop and maintain Spark T-1 repair tasks. Really achieved the effect of flow batch integration.

03

Infrastructure optimization

7352e7ee1d95222da7eec8e73766db02.jpeg

In terms of infrastructure optimization, we optimize Table Service. Because tasks such as compaction and clustering consume more resources and interact with writing tasks, the performance of writing is degraded. We solve this problem by splitting Table Service and running it through independent resources.

We put the generation process of the Compaction plan and Clustering plan execution plan in the writing task, and will actually execute the independent process of the task of Compaction and Clustering. It avoids the interaction between writing and Table Service, and improves the writing performance. At the same time, it supports the strategy of dynamically adjusting the Compaction plan, and reduces unnecessary IO by adjusting the frequency.

793ed5934f14259194079391ea782bb2.jpeg

Hudi Manager is used for large-scale management and hosting, including table service hosting, such as Compaction, Clustering, Projection task hosting, independent operation, resource isolation, and improved write stability. Support automatic pull up, can batch or flow.

In terms of table management, the source data of Hudi is built when the table is built, replacing the source data built when the first write is made, so as to avoid the omission of important parameters. For example, when importing data into Hudi in batches, you don't care about the preCombine comparison field, and the metadata of the table is initialized. When streaming writes, the metadata of the table is not modified. The absence of this matching field will result in the inability to obtain the correct merged result.

In terms of policy configuration, when the user selects OLAP and ETL scenarios, the execution interval of different table services can be automatically configured. For example, the downstream ETL scenario is day-level scheduling. Compared with OLAP scenarios, we can use lower compaction frequency.

2f6439a6132a94b1a3c8a535fa3aecbb.jpeg

As shown in the figure above, in the process of actual use, we have discovered and solved many problems in data quality, stability, and performance, and made functional enhancements to contribute to the community. Covers several aspects such as Sink, Compaction, Clustering, Common package, Source, Catalog and so on. Some of the capabilities mentioned in the previous scenarios will be pushed to the community in the form of PR or RFC one after another.

04

Summary and Outlook

75ccf043156499df5ce8776482468358.jpeg

We have done a series of practices in the flow data into the lake, DB data into the lake scene, report scene, and flow batch integration.

Next, we will go deep into the field of data warehouses, explore the reduction of data warehouse stratification through materialized tasks, and perform intelligent stratification through the analysis, diagnosis, and optimization of stratified tasks. Make business students more focused on the use of data, reduce the workload of data warehouse layering, and evolve towards the integration of lake and warehouse. Enhance incremental computing, support Join ETL around Hudi, and optimize Join logic at the storage layer. Explore the use of Hudi in the field of AI.

In terms of the kernel, we will enhance Hudi Meta Store in the future to unify metadata management; enhance Table Service; enhance Hudi Join's column splicing ability.

If this article is helpful to you, don't forget to  "Like",  "Like",  and "Favorite"  three times!

fb6743a4ca5909fe105960491eb6abd2.png

99567d28e041b53852c452da3569786d.jpeg

It will be released on the whole network in 2022 | Big data expert-level skill model and learning guide (Shengtian Banzi)

The Internet's worst era may indeed be here

I am studying in university at Bilibili, majoring in big data

What are we learning when we are learning Flink?

193 articles beat Flink violently, you need to pay attention to this collection

Flink production environment TOP problems and optimization, Alibaba Tibetan Scripture Pavilion YYDS

Flink CDC I'm sure Jesus can't keep him! | Flink CDC online problem inventory

What are we learning when we are learning Spark?

Among all Spark modules, I would like to call SparkSQL the strongest!

Hard Gang Hive | 40,000-word Basic Tuning Interview Summary

A Small Encyclopedia of Data Governance Methodologies and Practices

A small guide to user portrait construction under the label system

40,000-word long text | ClickHouse basics & practice & tuning full perspective analysis

[Interview & Personal Growth] More than half of 2021, the experience of social recruitment and school recruitment

Another decade begins in the direction of big data | The first edition of "Hard Gang Series" ends

Articles I have written about growth/interview/career advancement

What are we learning when we are learning Hive? "Hard Hive Sequel"

Guess you like

Origin blog.csdn.net/u013411339/article/details/131318211