FlinkX data synchronization

Flink Pioneer of Data Synchronization- FlinkX

FlinkX is a distributed offline/real-time data synchronization plug-in based on Flink, which can achieve efficient data synchronization of multiple heterogeneous data sources. It was initially developed by Kangaroo Cloud in 2016. At present, there is a stable R&D team for continuous maintenance. Open source on Github (see the end of the article for the open source address). And in 6 years of this year, batch stream unification was completed, and data synchronization tasks of offline computing and stream computing can be realized based on FlinkX.

FlinkX abstracts different data source libraries into different Reader plug-ins, and the target library abstracts into different Writer plug-ins. It has the following characteristics:

  1. Based on Flink development, support distributed operation;
  2. Two-way reading and writing, a database can be used as a source library or a target library;
  3. It supports multiple heterogeneous data sources and can realize bidirectional collection of nearly 20 data sources such as MySQL, Oracle, SQLServer, Hive, and Hbase.
  4. High scalability, strong flexibility, newly expanded data sources can communicate with existing data sources instantly.

 

From my personal understanding, it Connectoris a connector that connects various data sources. It shields a series of component compatibility issues and realizes a unified data source connection and the abstraction of data entities. It is the infrastructure for data channels. The data channel is more comprehensive DataX.

DataX  is an offline synchronization tool for heterogeneous data sources, dedicated to achieving stable and efficient data synchronization functions between various heterogeneous data sources including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc.

image-20200517125230928

DataXAs an offline data synchronization framework, it abstracts data source reading and writing into Reader/Writerplug-ins, and incorporates them into the entire synchronization framework.

image-20200517125221332

  • Reader: Reader is the data collection module, responsible for collecting data from the data source and sending the data to the Framework.
  • Writer: Writer is a data writing module, which is responsible for continuously fetching data from the Framework and writing data to the destination.
  • Framework: Framework is used to connect reader and writer as a data transmission channel between the two, and deal with core technical issues such as buffering, flow control, concurrency, and data conversion.

And in Flinkthe ecosystem and DataXthe benchmark is FlinkX, it may be the same batch of developers.

FlinkXExisting function

  • Support offline ( MySQL、Hbase、MongoDB、Redis、Hive25 kinds of data sources) and **real-time ( kafka, mysqletc.)** data synchronization

  • Most plug-ins support concurrent reading and writing of data, which can greatly increase the speed of reading and writing;

  • Some plug-ins support the function of failure recovery, which can restore tasks from the failed location, saving running time;

  • The Reader plug-in of the relational database supports interval polling function, which can continuously collect changing data;

  • Some databases support Kerberos security authentication;

  • It can limit the reading speed of the reader and reduce the impact on the business database;

  • Can record dirty data generated when the writer plug-in writes data;

  • The maximum number of dirty data can be limited;

  • Support multiple operating modes;

So Flink Connectorthe better support should be FlinkXnone other than that.

image-20200517114749041

Low-level implementation

When used DataX, it DataXis a stand-alone synchronization tool, and the distributed support of the core low-level channel is not friendly.

Because as a synchronization channel plug-in, it means that the entire synchronization process must be high-performance, high-concurrency, and high-reliability . It also supports incremental synchronization, resumable transmission and real-time acquisition .

Just the scene of synchronization:

For example, if we synchronize a table, if our fragmentation strategy is reasonable, we can add multiple data pipelines to increase the synchronization performance under the theoretical performance of Source and Target. Each pipeline synchronizes different shards.

And Flinkjust to make up for this shortcoming.

image-20200517115711641

Picture from the community

The following is a brief explanation of the implementation methods from three aspects: incremental synchronization, resumable transmission and real-time acquisitionFlinkX .

Incremental synchronization

Incremental synchronization means that the maximum value is recorded each time, and the position of the maximum value is synchronized next time.

The accumulator is a simple structure with addition operations and final accumulation results, which can be used after the job is over.

From the implementation of Flink above, you can use Flink's accumulator to record the maximum value of the job, and each run of the synchronization task uses the previous task instance as the starting position for synchronization.

http

Resumable transfer refers to when a synchronization error occurs in the synchronization process of the synchronization task, there is no need to re-synchronize from the beginning, but only need to re-synchronize from the place where it failed last time. Reduce the cost of synchronization.

Get up from wherever you fall, no need to start again from the starting line.

The realization principle is to record the synchronized position in real time, and read the last synchronized record next time. In Flinkthat the above CheckPointmechanism is simply very fit.

CheckPointIt is FLinkthe core function of the fault-tolerant mechanism, realized by asynchronous lightweight distributed snapshots. Distributed snapshots can take global unified snapshot processing of Task/Operator state data at the same point in time.

image-20200517123621842

As shown in the figure above, the Flinkdata set will generate checkpoint barriers at intervals , divide the data by barriers, and save the data between the two barriers as one CheckPoint. When the application is abnormal, all the states can be restored from the last snapshot.

Real-time collection

Real-time collection refers to real-time data synchronization. When the data of the data source is added, deleted, modified, and checked, the synchronization task monitors these changes and synchronizes the defense data to the target data source in real time. Because of the real-time synchronization feature, the synchronization task The process will stay consistent and will not stop. Kafka is generally used as a real-time collection tool. Here FlinkX supports mysql-binlogand mongodb-oplogcaptures.

image-20200517124153842

The difficulty of real-time collection lies in: for the update strategy of correcting data, such as updating old data, if it is a large batch of data, the pressure on the target data source will be particularly high. How to do the update strategy is a difficult point, as far as I know. Most of them are used without considering the correction data. It's just append. Many people use the lambda architecture, and use offline running batches to periodically correct the results. Flink's subsequent Api planning supports the integration of flow and batch , which is good news for development and operation and maintenance.

 

FlinkX application scenarios

The FllikX data synchronization plug-in is mainly used in the data synchronization/data integration module of the big data development platform. It usually combines the low-level efficient synchronization plug-in and the interface-based configuration method, so that big data developers can complete data concisely and quickly. Synchronous task development. Realize the synchronization of business database data to the big data storage platform, so as to carry out data modeling development, and after the data development is completed, the result data of the big data processing is synchronized to the business application database for use by the enterprise data business.

 

Detailed explanation of the working principle of FlinkX

FlinkX is implemented based on Flink, and its selection and advantages are detailed at https://mp.weixin.qq.com/s/uQbGLY3_cj0h2H_PZZFRGw . The essence of the FlinkX data synchronization task is a Flink program. The data synchronization task read and written will be translated into StreamGraph and executed in Flink. FlinkX developers only need to pay attention to the implementation of the InputFormat and OutputFormat interfaces. The working principle is as follows:

 

 

Engine is a task scheduling engine encapsulated by Kangaroo Cloud. The data synchronization task configured on the WEB side will first be submitted to the task scheduling engine. The Template module loads the Reader and Writer plug-ins corresponding to the source database and the target database according to the configuration information of the synchronization task. The Reader plug-in is implemented InputFormat interface, get DataStream object from database, Writer plug-in implements OutFormat interface, associate target database with DataStream object, and then read and write serially together through DataStream object, assemble into a Flink task and submit it to Flink cluster for operation .

Previously, based on Flink's sharding and accumulator features, it solved scenarios such as incremental synchronization, multi-channel control, dirty data management, and error management during data synchronization. In the past six months, based on Flink's checkpoint mechanism, it has realized functions such as resumable transmission and continuous running of streaming data. Let's learn about its new features.

(1) Resume uploading after a breakpoint

During the data synchronization process, if a task wants to synchronize 500G of data to the target database, it has been running for 15 minutes, but when it reaches 400G, the data synchronization fails due to insufficient cluster resources, network and other factors. If you need to run this task again, you must be the student Going crazy. FlinkX is based on the checkpoin mechanism to support resumable transmission. When the synchronization task fails due to the above reasons, there is no need to rerun the task, just continue the synchronization from the breakpoint, saving rerun time and cluster resources.

Flink's Checkpoint function is its core function to achieve fault tolerance. It can periodically generate snapshots of the state of the Operator/task in the task according to the configuration, and store the state data periodically. When the Flink program crashes unexpectedly, it will be restarted. When running the program, you can selectively restore from these snapshots, so as to correct the abnormal program data caused by the failure.

And the resuming of a breakpoint can cooperate with the task failure retry mechanism, that is, when the task execution fails, the system will automatically retry, if the retry succeeds, the system will continue to synchronize with the breakpoint position, thereby reducing human operation and maintenance.

(2) Real-time acquisition and continuous running

In June of this year, the Kangaroo Cloud Data Stack R&D team implemented batch data collection based on FlinkX. It can collect data sources such as MySQL Binlog, Filebeats, Kafka in real time, and can write data sources such as Kafka, Hive, HDFS, and Greenplum. The task also supports the limitation of the number of concurrent jobs and the job rate, as well as the management of dirty data. And based on the checkpint mechanism, real-time acquisition tasks can be continued. When the collection process is interrupted by business data or the Flink program, the read node of the stream data can be saved based on the snapshots periodically stored by Flink, so that when the fault is repaired, the data breakpoints saved in the history can be selected to continue the running operation To ensure the integrity of the data. This function is implemented in the StreamWorks product of Kangaroo Cloud. Welcome everyone to understand.

(3) Dirty data management of streaming data

Previously in BatchWorks offline computing products, dirty data management for offline data synchronization has been implemented, and dirty data error management is implemented based on Flink's accumulator. When the error amount reaches the configuration, the task fails. At present, the real-time collection of streaming data also supports this function, that is, during the process of writing the source database data to the target database, the error record is stored, so that the dirty data in the data synchronization process can be analyzed and processed later. However, because it is streaming data collection, the task is uninterrupted, and the task stop operation is not triggered when the number of errors reaches the threshold. The user will analyze and process the dirty data by himself.

(4) Data is written to Greenplum, Oceanbase data sources

Greenplum is an MPP database based on PostgreSQL, which supports the storage and management of massive amounts of data, and is currently used by many companies in the market. Recently, DataStack has implemented multi-type data source writing to Greenplum based on FlinkX. In addition to full synchronization, it also supports incremental synchronous writing of some databases. OceanBase is an extensible relational database in the financial field developed by Ali. Its usage is basically the same as MySQL. The realization of OceanBase's data reading and writing is also based on the jdbc connection method to synchronize and write data tables and fields. Supports incremental writing to OceanBase, as well as job synchronization channel and concurrency control.

When writing to relational databases such as Greenplum, transactions are not used by default, because when the amount of data is particularly large, once the task fails, it will have a huge impact on the business database. However, the transaction must be opened when the resumable transmission is turned on. If the database does not support transactions, the function of resumable transmission cannot be realized. When resumable transmission is enabled, the transaction will be committed when Flink generates a snapshot, and the current data will be written to the database. If the task fails during the two snapshots, the data in this transaction will not be written to the database. When the task is resumed Continue to synchronize data from the position recorded in the last snapshot, so that data can be accurately synchronized when the task fails repeatedly.

 

 

FlinkX open source address: https://github.com/DTStack/flinkx 

For specific technical implementation, please refer to https://mp.weixin.qq.com/s/VknlH8L2kpnlcJ3990ZkUw for details

to sum up

​ This article is mainly aimed at some new ideas and perceptions of data synchronization after recognition Flinkand DataXcombination. The characteristic of many open source tools now is that it is just a tool. It does not have a technology ecology and application ecology. From the perspective of the user, it pays more attention to the development and operation control of the operation. From the perspective of the operation development, the current FlinkXand DataXthe use of There is no development tool for Json configuration. FlinkXThe follow-up plan you see is metadata management and job archiving. And our company's data management and data development is to a certain extent to make up for such a shortcoming.

Guess you like

Origin blog.csdn.net/Baron_ND/article/details/112327154