Exploration and practice of open source big data integration framework ChunJun in data restoration

Hadoop is one of the most well-known infrastructure open source projects under the Apache Foundation. Since its birth in 2006, it has gradually developed into the most important basic component for massive data storage and processing, forming a very rich technical ecosystem.

As the top Hadoop open source ecological technology summit in China, the 4th China Apache Hadoop Meetup was successfully held in Shanghai on September 24, 2022.

Focusing on the theme of "Cloud Data Intelligence Gathers the Mainstay", there are enterprises from Huawei, Alibaba, NetEase, ByteDance, bilibili, Ping An Bank, Kangaroo Cloud, Intel, Kyligence, Ampere, etc., as well as companies from Spark, Fluid, ChunJun, etc. , Kyuubi, Ozone, IoTDB, Linkis, Kylin, Uniffle and other open source community guests participated in the sharing and discussion.

file

As one of the participating communities in this Meetup and a project in the field of big data, ChunJun also brought some new voices:

What is the implementation and principle of ChunJun framework in real-time data collection and restoration? What new developments has ChunJun made during this period of time, and what new ideas do you have for future development?

As a senior big data engine development expert of Kangaroo Cloud, Chao Xu brought his sharing and will introduce the exploration and practice of ChunJun data integration in data restoration from a unique perspective.file

1. Introduction to ChunJun Framework

The first question: what is the ChunJun framework? What can I do?

ChunJun (formerly FlinkX) is a data integration framework developed by Kangaroo Cloud based on the Flink base. After more than 4 years of iterations, it has become a stable, efficient and easy-to-use batch-stream integrated data integration tool, which can realize a variety of heterogeneous Efficient data synchronization of data sources, currently 3.2K+Star.

Open source project address:

https://github.com/DTStack/chunjun

https://gitee.com/dtstack_dev_0/chunjun

01 ChunJun frame structure

The ChunJun framework is developed based on Flink, provides a wealth of plug-ins, and adds features such as breakpoint resuming, dirty data management, and data restoration.

file

02 ChunJun batch synchronization

• Support for incremental synchronization

• Supports resumable upload

• Supports multi-channel & concurrency

• Support for dirty data (logging and control)

• Support current limiting

• Support for transformers

03 ChunJun Offline

file

Second, the realization and principle of real-time data acquisition

01 A sample

file

02 ChunJun plugin loading logic

file

03 ChunJun plugin definition

file

04 ChunJun data flow

file

05 ChunJun dynamic execution

Faced with monitoring multiple tables, including data from newly added tables, how do we perform downstream writes:

• Support Update conversion before, after

• Add extended parameters, DB, Schema, Table, ColumnInfo

• Supports dynamic construction of PreparedStatement

06 ChunJun interval polling

What is interval polling? How do we do it?

• Check the polling field type, if it is not a numeric type and the source parallelism is greater than 1, an error is not supported

• Create three data shards, startlocation is null or the configured value, mod is 0, 1, 2 respectively

• Constructing SQL: Different SQLs have different remainder functions, which are implemented by their respective plugins

select id,name,age from table where (id > ? and ) mod(id, 3) = 0 order by id;

select id,name,age from table where (id > ? and ) mod(id, 3) = 1 order by id;

select id,name,age from table where (id > ? and ) mod(id, 3) = 2 order by id;

• Execute SQL, query and update lastRow

• After the first result query, if startlocation is not configured in the script, the previous query SQL is:

select id,name,age from table where mod(id, 3) = 1 order by id;

Update it to:

select id,name,age from table where id > ? and mod(id, 3) = 1 order by id;

• Get the id value in lastRow during CP and save it to state

3. Implementation and principle of real-time data restoration

01 Introduction to data restoration

Data restoration is based on the CDC collection function of the corresponding database, such as the Oracle Logminer and MySQL binglog mentioned above, which supports the complete restoration of the captured data to the downstream, so not only DML, but also DDL needs to be monitored, upstream All changes to the data source are sent to the restore of the downstream database.

difficulty

DDL, how DML is sent downstream in an orderly manner

How DDL statements perform corresponding operations according to the characteristics of downstream data sources (DML conversion between heterogeneous data sources)

How to handle insert update and delete in DML statements

02 A sample

file

03 Overall Process

After the data is acquired from the upstream data source and processed by a series of operators, the data is accurately restored to the target data source according to the order of the data in the original table, and the real-time data acquisition link is completed.

file

04 DDL analysis

file

Data Restoration - DDL Transformation

· Based on Calcite parsing data source DdlSql to SqlNode

· SqlNode is converted to intermediate data DdlData

· ddlData to sql: conversion between different syntaxes; conversion of different data source field types to each other

05 Name Mapping

In real-time restoration, the current upstream and downstream table field correspondences must be the same, that is, the table corresponding to the upstream database schema table can only be written to the same table as the downstream database schema table, and the field names must also be the same. This iteration will perform a custom mapping for the table path and custom mapping for the field type.

• db or schema conversion

• Table name translation

• Field name (case conversion provided), type implicit conversion

06 Intermediate data cache

The data (whether ddl or dml data) is sent to the unblock queue under the corresponding table name. During the polling process, the worker processes the data in the unblock data queue. After encountering the ddl data, the data queue is set to the block state. And hand the queue reference to the store for processing.

After the store gets the queue reference, it sends the ddl data at the head of the queue to the external storage, and monitors the external storage's feedback on the ddl (the monitoring work is performed by an additional thread in the store). At this time, the queue is still in the block state.

After receiving the feedback from the external storage, the ddl data at the head of the data queue is removed, the queue state is returned to the unblock state, and the queue reference is returned to the worker.

file

07 The destination receives data

file

• Get the DdlOperator object

• Convert to the target data source sql according to the DDLConvertImpl parser corresponding to the target data source

• Execute the corresponding sql, such as deleting a table

• Trigger to adjust the DDLChange table and modify the corresponding DDL status

• Intermediate store Restore operator, monitor state changes, and execute subsequent data delivery operations

4. ChunJun's future plan

• Provide Session management

• Provide restful services, ChunJun itself as a service, easy to integrate with peripheral systems

• Enhancements to real-time data restoration, including extended DDL parsing to support more data sources

In addition, the full-text video content shared this time can also be watched at any time. If you are interested, please go to the Kangaroo Cloud B station platform to watch.

Apache Hadoop Meetup 2022

ChunJun Video Review:

https://www.bilibili.com/video/BV1sN4y1P7qk/?spm_id_from=333.337.search-card.all.click

Kangaroo Cloud Open Source Framework DingTalk Technology Exchange Group (30537511), welcome students who are interested in big data open source projects to join and exchange the latest technical information, open source project library address: https://github.com/DTStack/Taier

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3869098/blog/5583386