Exploration and Practice of Flink CDC in JD.com

Abstract: This article is compiled from the sharing of Han Fei, a senior technical expert of JD.com, in the Flink Forward Asia 2022 data integration session. The content of this article is mainly divided into four parts:

  1. JD self-developed CDC introduction
  2. Flink CDC optimization of Jingdong scene
  3. business case
  4. future plan

Click to view live replay and speech PPT

1. Introduction to JD’s self-developed CDC

JD’s self-developed CDC, code-named Fregata, is our self-developed underlying framework for real-time data collection and distribution scenarios. Fregata is an animal called frigate bird. It is the fastest flying bird in the world. It can maintain good flight ability and maneuverability even in bad weather, implying the efficiency and stability of our entire real-time collection and distribution services.

At present, Fregata is the unified entrance for real-time collection and distribution of JD Group's data center, serving BGBUs such as JD's retail, logistics, technology, health and industry, covering order transactions, business intelligence golden eyes, real-time risk control, JD Baitiao, real-time large screen, etc. Core Business.

At present, Fregata has more than 20,000 stable online tasks, and the peak number of processing items in the big promotion is 6.41 billion items/min. This is the total number of data items collected and distributed, and the corresponding peak value of transmitted data volume is 8.3TB/min.

The collection capacity for a single database instance exceeds 500w records/min, far exceeding the speed of database master-slave synchronization.

Fregata tasks currently use a total of more than 60,000 CPU cores and more than 18wGB of memory.

Based on the JDOS platform of JD.com, we have realized the containerized deployment and operation of Fregata tasks, and support the deployment of tasks across computer rooms. Currently, the tasks are mainly distributed in the two computer rooms of Huitian and Langfang, and the two computer rooms are mutually active and standby.

In terms of disaster recovery, it supports one-key disaster recovery switching of tasks. If there is a large-scale failure in the computer room or a network disconnection, the task can be quickly switched to the backup computer room, thereby ensuring the rapid recovery and stable operation of the task.

The left side of the figure above mainly shows the overall architecture of Fregata.

First of all, Fregata is divided into real-time collection and real-time distribution according to its functions. Real-time collection is based on the principle of database master-slave replication. Binlog data is captured in real time for analysis and packaging in a certain format, and then sent to JD. For downstream business real-time consumption, currently supported source database types include physical MySQL, JD self-developed elastic database JED, JD Cloud RDS, JD Digits CDS and Oracle, among which Oracle uses Logminer to realize real-time collection of database logs.

The real-time distribution part is mainly to synchronize the data in multiple formats in JDQ to different target storages in real time. The currently supported message formats are CSV/JSON/ProtoBuf/Xml/Avro, etc. The currently supported target storages are HDFS or Hive( Corresponding to offline data warehouse), OLAP analysis engine includes Doris and ClickHouse, message queue JDQ, ElasticSearch and data lake storage Iceberg. The supported data sources and targets will be continuously enriched according to actual needs.

The design of Fregata's splitting of collection and distribution is mainly based on the idea of ​​"one collection, multiple distributions". For consumption and short-term data playback requirements, the JDQ data is generally stored for 7 days.

The right side of the figure above mainly shows the design framework of the Fregarat engine. The whole engine is mainly divided into three layers, namely Source, Parse, and Sink operators. The operators of each layer are linked through RingBuffer (the disruptor we choose).

  • The Source operator pulls and pushes source data to RingBuffer according to different data source types.
  • The Parse operator pulls data from RingBuffer, parses and assembles the data and performs some ETL processing, and then sends the data to the downstream RingBuffer.
  • The Sink operator pulls the data in the downstream RingBuffer and assembles it in a certain data format according to the requirements of the target data source, and then sends it to different target storages.

In addition, there is a BarrierService that periodically generates Barriers. The entire task is submitted and recorded through the Barrier Service. The principle is similar to the Checkponit mechanism in Flink. BarrierService regularly generates Barriers and passes them to Source operators. After getting the Barrier, the Source operator broadcasts it to the downstream Parse, and after the downstream Parse gets the Barrier, it broadcasts it to all the Sink operators. After each Sink operator receives all the Barriers, it will perform an ack operation to the BarrierService. At this time, the BarrierService will perform a series of status submissions, such as submitting the consumption location, the recorded Binlog location, and so on.

Let's look at the technical characteristics of Fregata, first of all about the location tracking of Binlog.

The right side of the above figure mainly introduces the whole process of starting and running the real-time collection task. Among them, the location service records the Binlog location information that the task has consumed last time, mainly including the Binlog file name, the location where the Binlog file has been consumed, the serverid of the database instance, the transaction generation time corresponding to the Binlog location, and GTID information.

When the collection task starts, it will obtain the last recorded Binlog location information from the location service, and then pass the recorded BinlogPosition or GTID information to the Binlog Connector, and the Binlog Connector will generate a dump command based on the BinlogPostion or GTID information and send it to the database instance, and then the database The instance pushes the Binlog log to the Binlog Connector, and the Binlog Connector deserializes the accepted Binlog log and encapsulates it into a Binlog Event and passes it to Fregata, and Fregata processes the Binlog Event and sends it to JDQ.

Since MySQL only has GTID after version 5.6.5, and the JD online business inventory has a lower database version, Fregata supports both BinlogPosition and GTID, and supports starting from the specified time point and the latest position , start site and specified Binlog position, flexible configuration of multiple consumption modes.

In addition, when the upstream database version is upgraded to a higher version and GTID is enabled, there is a scenario where the collection task needs to be switched from BinlogPosition mode to GTID mode, so Fregata also supports the function of automatically switching the position mode of the task between BinlogPosition and GTID , and ensure that data is not lost or duplicated during the switching process.

The switching process is shown in the lower left corner of the figure above. First, the task restarts from the BinlogPosition mode, and then queries and caches the GTID transactions that have been executed during the restart process. Then the task will first continue to process the GTID EVENT in the Binlog in the BinlogPosition mode, and judge whether the GTID in the previous cache contains the currently consumed GTID. If not, it means that the consumption progress has caught up. At this time, the task will record the position Switch directly to GTID mode.

Then introduce the functions related to Fregata's dynamic perception. Fregata's real-time collection task configuration is the database domain name. If the online database fails or goes offline, there will be scenarios where the database instance needs to be changed. Fregata can sense the change and automatically switch.

Because the Binlog files of the two database instances before and after the switch are generally inconsistent, if the task position recording method is the BinlogPosition mode at this time, the task needs to automatically perform the Binlog alignment operation after the switch to ensure data integrity. (GTID mode does not need to consider this issue)

The entire switching process is shown on the right side of the above figure. In the BinlogPosition mode, the task will query all the Binlog files on the new data instance, traverse the Binlog files in reverse order, and then query the corresponding position according to the timestamp recorded in the location service. Position, and then the task continues to consume from the queried position. This method of searching in reverse order is mainly for the scene of online database splitting. In this case, the query efficiency of using reverse order is relatively high. Generally, it is enough to search for the Binlog 1-2 minutes ago.

Fregata's dynamic perception ability is also reflected in the perception of DDL changes. Fregata can recognize DDL operations in the database and automatically adapt. The types of DDL changes currently supported include, such as adding and deleting fields, modifying field types, and adjusting the order of fields wait.

Since the downstream business side will also pay attention to the DDL operation of the database, when Fregata recognizes the DDL operation, it will automatically notify the administrator and user to pay attention by email or voice.

Fregata also has some data processing and enrichment capabilities.

In the process of collecting Binlog, Fregata will add a unique version number Mid (that is, message id) to each record. Downstream users can deduplicate or determine the latest change record according to this version number. For example, when incremental data When distributing to Hive or other storage without primary key constraints, users can determine which is the latest change operation for multiple operation records of the same primary key according to Mid.

In addition, Fregata will also encapsulate information such as databases, tables, and database instances as metadata into each message body, so that downstream businesses with related needs can use it to determine the source of the data.

In terms of data processing, the collection process also supports the use of various functions to process data, such as encryption and decryption of sensitive fields, type conversion, time conversion, etc.

In terms of deployment, if the upstream business database is sub-database and sub-table mode and covers multiple instances, Fregata will start multiple collection tasks according to the number of database instances, and the collection tasks correspond to database instances one by one.

The advantage of this is that the tasks are independent of each other and the resources are isolated. The change of a single database instance does not affect the collection tasks of other database instances. The disadvantage is that if the number of instances is large, the configuration and maintenance costs will be slightly higher; in terms of configuration, we solve it through the production process For this problem, implement a configuration.

In terms of alarms, Fregata supports task survival alarms. In the case of abnormal task survival, operation and maintenance personnel will receive voice or email alarm information. At the same time, the collection task will report the monitoring index information of collection delay, database master-slave delay, and extraction of zero value at the granularity of minutes for users to observe the running status of the task.

In terms of full incremental data support, Fregata currently only supports the extraction of incremental data, and the extraction of full data depends on the Binlog retention time.

In other words, if all the Binlog data is reserved, all the data can be extracted; otherwise, only the saved Binlog data can be extracted, and other earlier historical data need to be extracted offline to compensate.

2. Flink CDC optimization of Jingdong scene

The above is about the content of Fregata. Generally speaking, our use of Flink CDC is still in a multi-faceted verification and relatively primary stage. For JD.com's internal scenarios, we added some features to Flink CDC to meet our actual needs. So let's take a look at the Flink CDC optimization in the Jingdong scene.

In practice, there will be business parties who want to backtrack historical data according to the specified time, which is a type of requirement; there is also a scenario where when the original Binlog files are all cleaned up, it needs to be reset to the newly generated Binlog on file.

For the above scenarios, we extend the startup modes of the earliest-offset\timestamp\specific-offset three Binlog phases by reusing the scan.startup.mode parameter.

Among them, in the specific-offset mode, you need to set the scan.startup.specific-offset.file parameter to specify the name of the Binlog file, and scan.startup.specific-offset.pos to specify a certain position of the file, and determine the increment according to these two parameters The starting position to be consumed in the stage; in earliest-offset mode, the earliest Binlog file will be read by default; in timestamp mode, a time parameter scan.startup.timestamp-millis needs to be set.

As shown on the right side of the figure above, in the timestamp startup mode, the corresponding Binlog file and Position will be searched in reverse order according to the time specified by the user, and finally the underlying mode will completely reuse the specific-offset method.

No matter which mode is used, the correct Start Binlog Offset will eventually be constructed according to different startup modes, and MySQLBinlogSplit will be further constructed.

In the production of lower versions of MySQL, the database instance will go offline, or there will be significant master-slave delay in the slave database (need to migrate to other slave databases); in these two scenarios, the database cut operation will generally be performed. How to realize automatic library cutting? Or how to realize the automatic library cutting in the Binlogposition mode of the lower version of MySQL?

As shown on the right side of the figure above, we have added a step to check the library:

First of all, the storage of serverid information at the MySQL level is added to MySQLBinlogsplit, and the processing logic of the MySQLSplitBinlog object in the process of state saving & restoration is modified.

Then, query the MySQL instance to obtain the serveid, and compare it with the serverid stored in the MySQLBinlogsplit object.

If they are not consistent, it is considered that a database split operation has occurred. At this time, it is necessary to search in reverse order in the new database and rebuild the start Binlogoffset and further construct MySQLBinlogsplit according to the time information of the consumption site saved in Binlogoffset, that is, timestamp.

Currently, Flink MySQL CDC supports the monitoring indicators of collection delay, sending delay, and idle time. In actual production, users report the need to pay attention to the master-slave delay of upstream databases. At the same time, all monitoring indicators have visualization and abnormal alarm requirements.

Based on the above situation, first we added monitoring indicators for database master-slave delay, and connected all these monitoring indicators to the monitoring system Byzer. As shown in the figure above, the overall process is as follows. Flink JobManager and TaskManager will carry agents when they start, and will send monitoring data to the Byzer system through the agents.

Users can configure monitoring and alarm rules on the JRC platform (real-time computing platform), and these rules will be synchronized to the Byzer system. On the other hand, the JRC platform will pull Byzer monitoring system data and display it visually.

Finally, let’s look at an application-oriented transformation. In actual business, there are many scenarios of sub-database and table sub-database, and online sub-database and sub-table are basically distributed in multiple MySQL instances.

If the community version of Flink MySQL CDC needs to support multiple instances in a job, the user needs to copy the DDL definition statement multiple times and modify the hostname configuration. If there are too many instances, it will affect the user experience and the readability of SQL.

In this regard, we combined the platform to implement multi-instance support. Analyze the user's SQL statement through calcite, find the DDL definition of MySQL-cdc, and parse the hostname field to determine whether it contains multiple instances, that is, multiple hosts are configured. If it contains multiple instances, it will be automatically divided by instance, tables corresponding to different instances will be created, and finally unioned into a view. As shown in the example of the blue scroll in the figure, only one definition of DDL is required at this time.

In addition, in the scene of collecting multiple instances and writing the Sink with Primary Key, we have made an optimization. Since Flink MySQL CDC enters the Binlog stage, it will only execute tasks in the first subtask of the Source operator, and the Primary Key Sink will trigger the Flink engine to optimize the Sink operator to add the NotNullEnforcer operator to check the not null fields related to the data, and then Then hash is distributed to the SinkMaterializer operator and the subsequent Sink operator.

Since there is a forward relationship between Source and NotNullEnforcer, NotNullEnforcer also has only one task to process data, which may not be enough for processing performance in scenarios with many Sources.

In order to make full use of the parallelism of the NotNullEnforcer operator, we added the parameter table.exec.sink.not-null-enforcer.hash, and then added the logic of judging whether to accelerate the NotNullEnforcer operator through this parameter in commonExecSink. If acceleration is enabled, the Primary Key is used for hashing in advance, and then distributed to the NotNullEnforcer operator, thereby realizing the optimization of the NotNullEnforcer operator.

Let's take a look at the comparison before and after optimization.

As can be seen in the first figure, as shown in the red box, only the first Task in the NotNullEnforcer operator is processing data.

After optimization, in the second figure, it can be seen that all 10 parallel degrees of the NotNullEnforcer operator have been utilized, and there is a hash relationship between the Source operator and the NotNullEnforcer operator.

3. Business case

In this case, we combined Flink CDC, Flink's core computing capabilities, and data lake Hudi to conduct a pilot transformation of the technical architecture of a business side of our platform, a business data system of JD Logistics.

This system is a real-time operation monitoring system for small and medium pieces in the logistics operation data center LDC. It is used frequently throughout JD Logistics, whether it is used by managers for decision-making or by front-line personnel for refined progress management.

It covers the three core operation links of logistics, collection, sorting, and delivery, and drills down in different dimensions to provide monitoring and visualization of the operation order quantity of each link of logistics.

The upstream is the elastic database JED, which is divided into databases and tables and distributed on multiple instances.

In the offline link above, the data is first extracted to the BDM layer of the offline data warehouse through plumber. Plumber is the basic service for JD’s offline heterogeneous data exchange. It is responsible for extracting data from different data sources to the data warehouse or computing data in the data warehouse Results are pushed to different data sources.

After the data is extracted to the BDM layer, the data will go through the zipper of the FDM layer and the data processing of the next few layers. Finally, the results of the business data will be aggregated to the APP layer, and then the results will be pushed to the ES through the plumber. The bottom layer of the product used by LDC users Query ES. There is another way, the OLAP engine StarRocks will import the data of the app layer, and then make it available for users to query.

In the real-time link below, Fregata collects the database Binlog and sends it to JDQ, and Flink consumes JDQ data and continues to write it into JDQ. In this way, a real-time data warehouse based on JDQ is constructed corresponding to the hierarchical logic of the offline data warehouse. The final result is passed A synchronization tool called syncer, which synchronizes data from JDQ to ES and StarRocks.

At the same time, there is another link. The most upstream JDQ directly distributes data to the offline BDM layer through Fregata to build a quasi-real-time BDM table. Overall, it is a typical Lambda data architecture.

There are several pain points in the current architecture:

  • Offline links have the problem of SLA line collision. When the upstream link computing resources are congested or abnormal retries occur, the timeliness of data may not be as timely as it is.
  • The storage cost of the ES server is relatively high, about 1 million a year.
  • Some problems of the typical Lambda architecture, such as the inability to reuse server resources due to stream batch splitting, different technology stacks, low development efficiency, and inconsistent data caliber, etc.

Since the real-time data of this business accepts an end-to-end minute-level delay, some modifications have been made to this data architecture.

First, based on our transformed Flink CDC capability, a Flink job is implemented to collect the data of the JED sub-database and sub-table of the upstream multi-instance for full incremental integration.

At the data processing level, combined with FlinkSQL, it provides users with a low-code development method, that is, drag and drop + SQL, and the calculation results are written into the data lake Hudi.

Then, based on Hudi's incremental reading capability, it is further processed to complete the processing logic of different layers such as FDM, GDM, APP, etc., and the result is mounted to the Hudi external table through StarRocks, and provided to the terminal LDC user for query. Through such transformation, an end-to-end quasi-real-time data link is finally constructed.

Summary: First, combined with Flink CDC, Flink's core computing capabilities and Hudi, the end-to-end stream-batch integration is realized for the first time. It can be seen that the three links of collection, storage and calculation are covered. In the end, this link is an end-to-end minute-level data delay (2-3min). The improvement of data timeliness effectively drives new business value, such as the fulfillment of logistics contracts and the improvement of user experience. In terms of data timeliness cost, it solves the problem of offline line collision, a quasi-real-time link, and there is no offline line collision; the combined cost of Hudi+StarRocks is significantly lower than that of ES (approximately 1/3 of the original value after evaluation). Compared with the Lambda architecture, there are significant improvements in server cost, development efficiency, and data quality.

4. Future planning

Future planning includes the following aspects:

  • Try to implement Schema Evolution where tasks don't stop. For example, for Hudi, for JDQ.
  • Continue to transform the Flink CDC based on the Jingdong scene. For example, data encryption, comprehensive docking with real-time computing platform JRC, etc.
  • Try to switch some Fregata production tasks to Flink CDC. The advantage is that the technology stack is unified, which is in line with the trend of overall technology convergence.
  • Combine stream-batch storage to improve overall end-to-end timeliness. For example, combine Table Store to try to achieve lower end-to-end delays, such as second-level delays.

Click to view live replay and speech PPT

Guess you like

Origin blog.csdn.net/weixin_44904816/article/details/130097395