Exploration of flow-batch integration technology framework and practice in Kangaroo cloud data stack

1. About stream-batch integrated data warehouse

Stream-batch integration is an architectural idea. This idea refers to the same business and uses the same SQL logic, which can satisfy both stream processing calculations and batch task calculations.

From the perspective of efficiency, batch processing can only present business data in the form of t+1, and stream processing can only present business data in the form of t+0. When the two are independent, the enterprise needs to run two sets of codes, development, operation and maintenance. , High labor cost and long presentation cycle. The flow-batch integration uses one set of codes to present two sets of business data, reducing development and operation and maintenance costs by half, and significantly improving effectiveness.

So, what is a stream-batch integrated data warehouse? To put it simply, it is a data collection that uses the same computing engine for data from heterogeneous sources and combines the data storage architecture unique to the data warehouse theory to complete real-time and offline analysis services.

The dataset has the following characteristics:

Topic-oriented: The data warehouse organizes data according to a certain subject domain;

Easy to integrate: Eliminate inconsistencies in source data and ensure the consistency of enterprise global information;

Relatively stable: the data in the collection is retained for a long time, and only needs to be loaded and refreshed regularly;

Predictive trend: Historical information is stored in the data, which can quantitatively analyze and predict the development history and future trends of the enterprise.

2. The evolution of the data stack in the stream-batch integrated data warehouse

With the increase of customer volume and the gradual increase of customer demand, the data stack technical team is facing more and more challenges in the face of the processing requirements of PB-level batch data and stream data. In the process, the data stack data warehouse has been gradually improved Architecture system. From the batch processing based on the traditional architecture in 2017, it has been iterated for 4 years to the integrated data warehouse based on the hybrid architecture, as shown in the figure:

Schematic diagram of the evolution process of the hybrid data warehouse

1. Batch processing based on traditional architecture

At the beginning of the Internet, although the amount of data increased sharply, and the number of fact sheets in a single day reached tens of millions, the customer demand scenarios were mostly in the form of "t+1". Only offline analysis is sufficient.

Just as the hadoop ecosystem was just emerging, the data stack technology team installed the Hadoop ecological chain based on the dilemma of data explosion and storage tension, and periodically imported data into HDFS. Using the Hadoop platform Hive as a data warehouse can realize the massive data sets on HDFS. Perform offline analysis.

In fact, this stage does not change much from the essential architecture of the Internet. Data is still periodically loaded and then analyzed, but the technology used has been transformed from classic data warehouse tools to big data tools.

2. Stream batch independence based on Lambda architecture

With the development of network and communication technology, the data of "two-day delivery" can no longer meet the needs of customers. They are more looking forward to real-time data presentation, so that whether it is in financial, securities trading, or retail, real-time monitoring and early warning of ports and other scenarios, decision-making Everyone can make a favorable judgment at the first time, improve efficiency and reduce losses.

In response to this change, the data stack technology team combined the mainstream big data processing technology at that time, and added the most advanced streaming-batch integrated computing engine Spark to the original HIVE data warehouse to speed up offline computing performance. At the same time, on the original offline big data architecture, a stream processing link based on Kafka storage and Flink computing engine is added to complete the calculation of indicators with high real-time requirements.

Although the use of Spark and Flink computing engines satisfies the customer's scene presentation of real-time data, although Spark is a stream-batch integration in concept, it is essentially based on batches to implement streams, and there are still some flaws in terms of practical effects. In the same period, the Flink computing engine was not perfect, so the data stack technical team expanded the Flink function to a certain extent.

In this process, FlinkX, which can synchronize more data sources, and FlinkStreamSql, which can calculate and write more data sources in real time through Sql, have been hatched synchronously. (Take it from open source, and give it back to open source. The data stack technical team has shared them on Github. Students who need it can click the original text to view it.)

At this stage, the data stack technology team has added a stream computing link to the original offline link for real-time data analysis through the self-developed FlinkX and FlinkStreamSql, completing the transition from the traditional big data architecture to the Lambda architecture.

The core idea of ​​the Lambda architecture is to split the business. The business with high real-time requirements adopts the real-time computing scheme, and the business with low real-time requirements adopts the offline computing scheme. Finally, the data service layer analyzes and summarizes all the data for downstream use.

Lambda architecture flow batch independent processing flow chart

3. Real-time processing based on Kappa architecture

The deployment of the Lambda architecture basically meets customers' demands for real-time data. A large number of customers use the data stack DTinsight to realize the needs of data-enabled production tasks. Under the daily data volume of tens of thousands, the data stack DTinsight can also maintain stable operation. It provides a strong backing for customers in data-driven business.

Although the Lambda architecture meets the real-time needs of customers in business, the business volume is gradually increasing as the enterprise develops, resulting in a gradual increase in development and operation and maintenance costs. At this time, Flink's stream processing technology has gradually matured, and Flink Exactly-Once and state calculation can fully guarantee the accuracy of the final calculation result. Therefore, the data stack technology team began to pay attention to how to make adjustments based on the Lambda architecture.

Jay Kreps, the former chief engineer of LinkedIn, once put forward an improvement point of view for the Lambda architecture: improve the Speed ​​Layer in the Lambda architecture, so that it can not only perform real-time data processing, but also have the ability to update the business logic. Reprocess previously processed historical data without

Inspired by Kreps, the data stack team recommends that customers with more real-time business keep the data log of Kafka to the date. When the code of the stream task changes or the upstream calculation needs to be backtracked, it is only necessary to keep the original Job N unchanged, and then Start another job Job N+1, specify the offset of the historical data for calculation and write it into a new N+1 table. When the calculation progress of Job N+1 catches up with the progress of the Job, the original Job can be The N task is replaced with Job N+1, and the downstream business programs only need to analyze or display according to the table generated by Job N+1. In this way, the offline link layer can be removed, reducing the workload of additional code development and maintenance for customers, and at the same time unifying the computing caliber of the business.

The disadvantage of the Lambda architecture is that it needs to maintain the code that produces the same result in two complex distributed systems, while reprocessing real-time data by increasing parallelism and replaying historical data can effectively replace offline data processing systems. This architecture is simple and avoids the problem of maintaining two sets of system codes and maintaining consistent results.

Kappa architecture real-time data warehouse flow chart

4. Stream-batch integrated data warehouse based on Kappa+Lambda hybrid architecture

Through the Lambda architecture and Kappa architecture, the data stack can solve the real-time scenarios and development operation and maintenance requirements faced by most enterprises. However, some enterprises have high real-time business requirements, and the real-time calculation data is inaccurate due to extreme data disorder. At this time, the streaming task faces the problem of data quality.

In view of this situation, the data stack technology team combines the advantages of the Kappa architecture and the Lambda architecture, and periodically revises the real-time link output data through the offline link in the Labmda architecture. Based on the FlinkX computing engine, the computing tasks in the entire link are unified to ensure the final consistency of the data.

3. Interpretation of FlinkX technology, the core engine of data stack, flow and batch integration

FlinkX is a Flink-based stream-batch unified data synchronization and SQL computing tool. It can collect static data, such as MySQL, business data in HDFS, and real-time changing data, such as MySQL, Binlog, Kafka, etc. In FlinkX1.12, FlinkStreamSql will also be integrated, so that FlinkX1.12 can not only collect static and dynamic data through synchronization tasks, but also perform streaming batch processing of collected data according to business timeliness through SQL tasks.

In the data stack, the implementation of FlinkX's stream-batch integration is reflected in the data acquisition layer and the data calculation layer.

1. Data acquisition layer

From the perspective of data time, data can be divided into real-time data and offline data. For example, high-throughput message middleware such as Kafka and EMQ usually hold a steady stream of data, so the data can be stored in real time through FlinkX's real-time collection tasks, so that subsequent tasks can be performed in near real time and accurately Real-time business computing. OLTP databases such as Mysql and Oracle usually hold historical transaction data, which are stored and calculated in days and months. Therefore, such data can be increased at intervals through the offline synchronization task of FlinkX. The volume or full volume is synchronized to our OLAP data warehouse or data lake, and then the data is stratified and batch analyzed according to various business indicators.

In addition, in addition to collecting data to the storage layer, according to the data specifications defined in data governance combined with data warehouse specifications, data cleaning, transformation and dimension completion are completed through FlinkX synchronization tasks to improve the effectiveness of data. and the accuracy of business calculations.

2. Data computing layer

After the data is collected in the specified storage layer, the data will be subjected to routine business calculations based on the storage type and business timeliness. The ability of FlinkX Sql to support stream batch computing comes from the unified management of metadata by the Flink kernel in version 1.12 and the support of batch execution mode on the DataStream API, which enhances the reusability and maintainability of jobs, making FlinkX jobs You can switch freely between streaming and batch execution modes and only need to maintain one set of code without rewriting any code. Moreover, compared with the open source Flink, FlinkX 1.12 not only provides more sources and sinks to support real-time and offline computing of various data sources, but also implements dirty data management plug-ins, allowing customers to target errors and non-compliance in the ETL stage The data can be processed by perception as well as fault tolerance.

FlinkX implements flow-batch integration flow chart in the data stack

3. The practice of integrating data stack, flow and batch in data warehouse

The following describes the integration of the next stack, stream and batch in combination with the architecture diagram scenario.

Scenario: K-line in stock trading is divided into time-sharing chart, daily chart, weekly chart, etc. After the user's stock transaction is completed, the buying and selling point and transaction amount need to be displayed on the K-line.

The data stack does not implement the integrated processing method of stream and batch:

For the above scenario, before the integration of stream and batch is implemented in the data stack, the trading point of the time-sharing graph will be calculated by Flink, and the trading points of daily K, week K, etc. are calculated by configuring periodic Spark tasks, that is, the classic Lambda architecture. The pain points of this architecture are relatively obvious, the maintenance of two sets of codes is inefficient, the cost of two sets of computing engines is high, and the data caliber is inconsistent.

The data stack realizes the post-processing method of stream-batch integration:

On the data stack platform, first choose to create real-time collection and data synchronization tasks to collect business database data to Kafka and Iceberg, that is, the ODS layer of the data warehouse. The processing logic of data cleaning and data widening from ODS to DWD layer in real-time data warehouse and offline data warehouse is the same, and the table definition structure is also consistent, so this step only needs to implement a set of Flink SQL data stack platform, which will be automatically translated into Flink Stream and Flink Batch tasks can be used for both real-time data warehouses and offline data warehouses. The real-time data warehouse and the offline data warehouse DWS layer respectively store the time-sharing chart trading point information and data such as daily K and weekly K. The processing logic of the two sides is different. Therefore, two sets of SQL need to be developed according to the business at this layer. Stream Flink SQL is connected to the real-time data warehouse. The DWD layer data calculates the time-sharing chart trading points in real time, and Batch Flink SQL connects to the offline data warehouse DWD layer data to periodically schedule and calculate the daily K, weekly K and other trading point data. The application layer service directly obtains the buying and selling point data from the DWS layer for display.

Through the example, we can see that the data stack has chosen Iceberg as the storage layer integrated with the stream and batch. The reasons are as follows:

1. Iceberg stores raw data, and the data structure can be diversified;

2. Iceberg supports a variety of computing models and is a universally designed Table Format, which perfectly decouples the computing engine and the underlying storage system;

3. Iceberg has flexible underlying storage support. Generally, cheap distributed file systems such as S3, OSS, and HDFS are used, and specific file formats and caches can be used to meet the data analysis requirements of the scene;

4. The community resources behind the Iceberg project are very rich, and many large companies at home and abroad have run massive data on Iceberg;

5. Iceberg saves the full amount of data. When the stream computing task needs to rerun historical data, it can read data from Iceberg and seamlessly switch to Kafka.

Fourth, the integration of flow and batch empowers enterprises

With the continuous development of the field of big data, the demands of enterprises for business scenarios have changed from offline satisfaction to high real-time requirements. Data stack products are also undergoing continuous iterative upgrades in this process to improve the quality of data calculation results for enterprises. It provides powerful help in improving the efficiency of enterprise business research and development and reducing enterprise maintenance costs.

1. Improve the quality of data calculation results

High-quality and high-accuracy data is beneficial for enterprises to make excellent decisions. The data stack's integrated data warehouse based on a hybrid architecture unifies the computing engines and solves the problem that the SQL logic cannot be reused between the two sets of codes of different engines. , so that the data consistency and quality are guaranteed.

2. Improve business R&D efficiency

From business development to launch, business developers only need to develop a set of SQL tasks for the business, and then flexibly switch between streaming and batch computing based on business delay standards. Application developers only need to stitch together a set of SQL encapsulation logic for the business.

3. Improve enterprise resource utilization and reduce maintenance costs

The real-time and offline services of enterprise users only need to run on the same set of computing engines. There is no need to separately purchase high-profile hardware resources for different computing engines running real-time and offline services. For business changes, developers only need to modify the corresponding SQL tasks, without considering the modification of real-time and offline tasks.

V. Future Planning

Although FlinkX SQL has improved the capability of stream batch computing to a certain extent, the actual effect of batch processing still needs to be improved. In the next step, the data stack technology team will optimize the operators and tasks from the Flink source code level to improve the calculation at the batch level. Efficiency reduces business time costs. At the same time, the metadata standard in the data source is further unified, so that the data dictionary, data lineage, data quality, authority management and other modules involved in the data governance process of the enterprise can be quickly responded to in the subsequent use level, reducing enterprise management costs.

The integrated architecture of data stack, flow and batch has realized the combination of real-time data warehouse + OLAP scene through iteration. Only one set of code can perform multiple computing processing modes, which not only meets the business-driven needs of enterprises with low latency and high timeliness, but also reduces the It reduces the cost of enterprise development, operation and maintenance, and labor. Of course, this is only the first step in the integrated exploration of streaming and batching. The data stack technology team will continue to dig deep at the data storage level, integrating the convenient management and high-quality data features of data warehouses with the explorable and high flexibility of data lakes. Complete the transformation of the data stack from the data warehouse to the lake warehouse, realize the ability to store unknown data first and then flexibly explore it, and go further in the data architecture level.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3869098/blog/5415360