Practical application of Apache Flink in WingPay

Abstract: This article summarizes the sharing of Yin Chunguang, senior big data engineer of Ziyipay, in the special session of Flink Forward Asia 2022 industry cases. The content of this article is mainly divided into five parts:

  1. Real-time business scenarios of the company

  2. Platform Introduction

  3. Architecture practice

  4. Application Scenario

  5. future outlook

Click to view the original video & speech PPT

1. Real-time business scenarios of the company

E-surfing E-Commerce Co., Ltd. (hereinafter referred to as "Esurfing Payment") is a member enterprise of China Telecom Group Co., Ltd. It is the fourth batch of "dual pilot" enterprises of the SASAC Double Hundred Reform and the National Development and Reform Commission, and it is also the only "dual pilot" enterprise financial technology company. The company uses the Yipay APP as the carrier, and provides services such as payment of people's livelihood, consumer shopping, and financial management for 70 million monthly active users. Relying on blockchain, cloud computing, big data, artificial intelligence and other technologies, it empowers more than 10 million businesses Offline merchant stores and more than 170 well-known online e-commerce companies.

At present, the platform is mainly faced with massive data processing and high concurrent data service requests. It not only has the characteristics of low-latency calculation, but also has the characteristics of business diversity and scene complexity. Diversified business scenarios include financial payment, consumer shopping, and people's livelihood payment.

2. Platform introduction

As shown in the figure above, the development process of Yipay flow computing is mainly divided into four stages.

In the first stage, we developed Spark Streaming computing tasks in a chimney-like manner to solve the needs of real-time data scenarios. At this stage, as the demand continues to increase, it mainly faces problems such as low development efficiency, difficult manual management, complex operation and maintenance tasks, and difficulty in releasing manpower.

In the second stage, build a real-time computing platform based on Spark Struct Streaming. Users can create real-time computing tasks through web page configuration, solving most of the problems of low development efficiency. In addition, the operation and maintenance efficiency is improved through platform automation management tasks. However, with the development of business, there are also some problems:

  1. The delay of Spark-based real-time computing tasks is relatively large.

  2. Added requirements for real-time data integration.

  3. Analysts need to analyze business data in real time through SQL.

  4. Developers need to submit complex computing tasks for hosting.

In order to support real-time computing tasks in multiple scenarios, we introduced Flink as a real-time computing engine and upgraded the computing engine uniformly.

In the third stage, based on Flink SQL as the real-time computing engine, a platform supporting real-time computing tasks in multiple scenarios was built.

At present, the fourth stage is exploring the construction of integrated lake and warehouse. In order to solve the problems of real-time business data warehousing delay, frequent business table changes, and online extraction of business databases lead to high inventory pressure. After comparing mainstream data lakes, such as Iceberg, Hudi, and DeltaLake, based on functional integrity, business scenarios, and community activity, we finally chose to build a lake warehouse based on Flink CDC + Hudi.

The data development platform meets the real-time task development functions of multiple scenarios, including the following types:

  1. SQL tasks support online SQL debugging and analysis. This scenario is suitable for analysts and data warehouse personnel.

  2. Real-time feature tasks, configuration indicators based on web pages, and finally converted into Flink computing tasks, this function is suitable for business personnel to configure tasks.

  3. For real-time data integration tasks, this function can meet the real-time sampling requirements of the data warehouse and the data synchronization requirements of technical personnel.

  4. The Jar package task can solve complex customized development requirements scenarios and is suitable for all big data developers.

The figure above is a flowchart of the real-time indicator development.

  • The first step is to create a real-time indicator.

  • The second step is to configure the data source. When configuring data sources, it supports multi-level nested data and supports configuring multiple data sources.

  • The third step is to preprocess the data source, including data conversion, adding new fields, and completing external data.

  • The fourth step is to define the type of result storage.

  • The fifth step is to develop indicators and support the configuration of multiple indicators in one real-time task.

  • The sixth step is to configure the calculation logic of the indicators, including processing algorithms, filter conditions, aggregation dimensions, etc. Configuring a new indicator supports reusing the original indicator configuration.

  • Step 7: After completing all indicator configurations, perform debugging and operation.

  • The eighth step, after the logic is passed, you can go online normally.

The picture above is an introduction to the task development of the real-time data development platform. I chose the process of real-time indicator task development to share with you. The four pictures are the new real-time indicator task, data source configuration preprocessing, new indicator configuration, and multiple indicators configured in one task.

3. Architecture practice

The initial real-time data development platform architecture is shown in the figure, and the core calculation is based on Spark Structure Streaming + custom State implementation. Cache the intermediate results of the calculation to the corresponding intermediate storage. The intermediate result storage here is similar to Flink's Checkpoint State, and the calculation result data is output to the middleware and Hbase, providing real-time information for services such as real-time Kanban, smart credit, and real-time recommendation. Data analysis and decision making.

The left side of the above figure is the data source and dimension table, the middle is the core of the computing engine, and the right side is the intermediate result storage and result data storage.

The calculation engine module in the middle is a set of indicators on the top, and a DSL parser on the bottom. The calculation logic of analyzing indicators is finally converted into a custom algorithm template, which can then be submitted through the Spark Structure Streaming task and run normally on the cluster.

The development engine architecture of the V1 version has the following pain points:

  1. New UDF functions need to be continuously developed. All the algorithms are user-defined, and more common SQL functions need to be developed, such as variance, deduplication, 99 lines, user-defined encryption and decryption functions, etc. But these functions are not compatible with the functions in Streaming SQL, and the Streaming SQL engine module also needs to develop UDF repeatedly.

  2. The intermediate results are stored in Redis, and the amount of cached data is large. When calculating the aggregation index, it is necessary to read data from the cache to implement the aggregation calculation, and the overall calculation performance is low.

  3. Redis Cluster cannot balance efficiency and data security concerns. When reading and writing a large amount of data, if you want to ensure that the data is not lost, turning on AOF requires a lot of performance loss. There is a lack of state isolation between computing tasks, and the amount of cached data in one task will affect the security of other data.

  4. Based on the custom DSL development indicator template, if you need to continue to expand to support complete SQL and data integration and other related tasks, the development is more difficult. If you need to independently develop the server engine and data integration tasks, you need to maintain multiple sets of codes, and the development and maintenance costs are relatively high.

Therefore, in the face of different real-time computing task scenario requirements, we upgraded the architecture of the development engine module of the platform.

The upgraded architecture is directly based on Flink as the real-time computing engine, unifying the real-time computing architecture. The real-time data development module is divided into real-time indicators, StreamingSQL, and real-time data integration. Relying on Flink State here, we remove the intermediate result module to improve the stability and performance of computing tasks.

We unified real-time data integration, real-time SQL analysis, and real-time indicator calculation into Flink SQL. Computing tasks are realized based on Flink SQL, which reduces the difficulty of maintaining the development engine module and realizes module reuse. Load UDF through SQL analysis to finally generate Flink Streaming Graph, and submit the Job Pipeline to the computing cluster. After the upgrade, the code maintenance of the platform is more convenient, and the management and monitoring of real-time computing tasks are unified.

Around the real-time computing engine module, we have done the following work:

  • Metadata management: Parse Flink SQL through Calcite, and finally get the blood relationship.

  • Sandbox testing of tasks: It is convenient for users to debug tasks online and see the intermediate results of each step of SQL operation.

  • UDF management: including UDF authority management, users submit SQL, and UDF is obtained through SQL parsing for authority verification. It mainly solves the rights management of some encrypted UDF functions.

  • Fine-density resource configuration: Analyze the Streaming Graph through Flink SQL, and then set the parallelism for operators to solve the problem that Flink SQL can only set the parallelism uniformly.

  • Task status monitoring: Monitor the task status through Flink Metric. For example, if the data processing speed is lower than the threshold and Kafka accumulation exceeds the threshold, we will judge that there is a problem with the data processing of this task.

  • Automatic recovery: automatic task management through the application layer.

First, multiple SQL tasks cannot be isolated. In a real-time computing task, it is usually necessary to create multiple indicators, and the user hopes that multiple indicators can go online and offline together, and the indicators can be isolated.

Currently, Flink SQL cannot implement multi-SQL state isolation, resulting in the change of one SQL task affecting other SQL recovery from savepint. The task starts normally, but the State data is lost.

To solve this problem, our solution is to isolate the multi-SQL state of the Flink SQL task to achieve normal recovery of multi-SQL.

The above figure is based on Flink's multi-SQL isolation design. The Streaming Graph is parsed through SQL Parser, and then the calculated Hash value is set for the UID of each operator in the entire Streaming Graph. SQL-based hash summary + how to locate the position of the operator in the entire graph, that is, the operator type, upstream dependencies and other information to determine the position of a node in the graph, so as to realize the state isolation of different graphs.

First of all, the above figure is a schematic code after simplification. SQL Parser parses multiple SQLs to get DML statements, and then converts them into Streaming Graph through Flink's execution plan, and finally sets nodes for the entire graph to achieve isolation.

Second, the parallelism adjustment for Flink SQL tasks. After analyzing the Streaming Graph for Flink SQL, we can set the degree of parallelism for Flink operators and collect metrics based on Flink, such as combination rules such as operator processing speed, task pressure, input buffer ratio, etc., to prompt O&M to expand capacity the node.

According to different task types, the operator nodes that are busy are different, and the rules need to be defined according to different SQL scenarios. For example, Kafka's source node usually has a high throughput; the reason for the small Kafka partition is that it is not necessary to set a large parallelism for Kafka's source. For example, in the scenario where SQL statistics add monthly activity by province every day, it often appears on the Group AGG node when data processing is under pressure, so the parallelism of this node can be adjusted instead of the parallelism of the Source node and the Rank node .

Third, the debugging practice of Flink SQL tasks. The scenario is that business users want to use the platform to debug SQL logic, which is convenient for developing and locating task exceptions. And in production, you can find out why the data can be joined and why the count is the expected value, etc. Platform technicians hope to know which operator each SQL filter group runs to, and whether there is corresponding data in the running State. Faced with such demands, we conducted a comparison of three solutions.

  • Solution 1, based on the implementation of Minicluster, replace the result table of the application. The advantage of this solution is that the implementation process is simple, but the disadvantage is that it cannot meet the task debugging in production, and it will cause the entire server to occupy high resources when submitting tasks.

  • Solution 2, based on submitting real computing tasks to the cluster, by replacing each logical query to increase the result table, so that users can see the intermediate running data. The advantage of this solution is that it can meet the debugging of the development environment and some production scenarios. Its disadvantage is that it can only meet the debugging of some scenarios for the time being.

  • Solution 3 is implemented through mirroring tasks + low-level API ProcessFunction and other instrumentation. The advantage of this solution is that it can satisfy fine-grained troubleshooting. The disadvantage is that its implementation is very complicated and there is a problem of resource consumption.

Compared with the above three schemes, we finally chose scheme two for task debugging.

Fourth, the optimization of task monitoring and alarming.

  1. For real-time data, we have added quality monitoring, which can detect data anomalies and alarms in time.

  2. Hierarchical management of real-time tasks, such as manual intervention for risk control, marketing and other caliber tasks, and unified processing of task rows similar to reports.

  3. Monitor the status of tasks based on the Metric combination. For example, if the data processing speed of Flink is lower than the threshold, and the accumulation of Kafka exceeds the threshold, an alarm will be issued.

  4. The parallelism of tasks and memory-related optimization are performed based on Metric combination.

4. Application scenarios

The picture above is a real-time Kanban scene. Business data is extracted in real time through Flink CDC, behavioral data is extracted in real time through Flink SQL, associated calculations are performed, and HBase dimension tables are used to complete. The final result of the calculation is stored in ClickHouse and provided to the BI platform and Kanban for real-time analysis.

In this scenario, logs and business data need to be stored quickly. When real-time logs are written to the cluster, there will be small file problems, and the business database table changes frequently, and batch storage will cause high pressure on the storage. We directly connect MySQL to Kafka through Flink CDC, and write data to ODS through Flink and Hudi. In this link, MySQL is not directly stored in the warehouse, mainly for the reuse of real-time business data, and at present, Hudi is only placed on the ODS layer, and the overall lake warehouse construction will be improved according to business scenarios later.

On the platform, by obtaining metadata information of database tables, configure real-time synchronization tasks, support OceanBase, MySQL, Oracle, etc. for business databases, and perform incremental data collection through Flink CDC. For Kafka data, Flink SQL is used to perform incremental extraction and real-time synchronization to components such as Kafka, Hudi, and ClickHouse, which greatly improves the efficiency of real-time data integration task development and management.

5. Future Outlook

The future plan of our platform is as follows:

  • First, the practice of Flink containerization is mainly to solve the problem of resource occupation when the traffic surges during the day and the traffic is low at night, and realizes dynamic expansion and contraction through K8s combined with Flink containerization.

  • Second, batch stream fusion, in order to achieve real-time and offline unified computing caliber and metadata management.

  • Third, improve the integrated construction of lake warehouses to meet the demands of data warehouse analysis and real-time data analysis of business personnel.

Click to view the original video & speech PPT


more content


Activity recommendation

Alibaba Cloud's enterprise-level product based on Apache Flink - the real-time computing Flink version is now open:
0 yuan to try  the real-time computing Flink version (5000CU* hours, within 3 months)
to learn more about the event:
https://click.aliyun.com/ m/1000372333/

 

It is infinitely faster than Protocol Buffers. After ten years of open source, Cap'n Proto 1.0 was finally released. The postdoctoral fellow of Huazhong University of Science and Technology reproduced the LK-99 magnetic levitation phenomenon. Loongson Zhongke successfully developed a new generation of processor Loongson 3A6000 miniblink version 108. The world's smallest Chromium core ChromeOS splits the browser and operating system into an independent 1TB solid-state drive on the Tesla China Mall, priced at 2,720 yuan Huawei officially released the security upgrade version of HarmonyOS 4, causing all Electron-based applications to freeze AWS will begin to support IPv4 public network addresses next year Official release of Nim v2.0, an imperative programming language
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/2828172/blog/10089918