Hadoop is one of the most well-known infrastructure open source projects under the Apache Foundation. Since its birth in 2006, it has gradually developed into the most important basic component for massive data storage and processing, forming a very rich technical ecosystem.
As the top Hadoop open source ecological technology summit in China, the 4th China Apache Hadoop Meetup was successfully held in Shanghai on September 24, 2022.
Focusing on the theme of "Cloud Data Intelligence Gathers the Mainstay", there are enterprises from Huawei, Alibaba, NetEase, ByteDance, bilibili, Ping An Bank, Kangaroo Cloud, Intel, Kyligence, Ampere, etc., as well as companies from Spark, Fluid, ChunJun, etc. , Kyuubi, Ozone, IoTDB, Linkis, Kylin, Uniffle and other open source community guests participated in the sharing and discussion.
As one of the participating communities in this Meetup and a project in the field of big data, ChunJun also brought some new voices:
What is the implementation and principle of ChunJun framework in real-time data collection and restoration? What new developments has ChunJun made during this period of time, and what new ideas do you have for future development?
As a senior big data engine development expert of Kangaroo Cloud, Chao Xu brought his sharing and will introduce the exploration and practice of ChunJun data integration in data restoration from a unique perspective.
1. Introduction to ChunJun Framework
The first question: what is the ChunJun framework? What can I do?
ChunJun (formerly FlinkX) is a data integration framework developed by Kangaroo Cloud based on the Flink base. After more than 4 years of iterations, it has become a stable, efficient and easy-to-use batch-stream integrated data integration tool, which can realize a variety of heterogeneous Efficient data synchronization of data sources, currently 3.2K+Star.
Open source project address:
https://github.com/DTStack/chunjun
https://gitee.com/dtstack_dev_0/chunjun
01 ChunJun frame structure
The ChunJun framework is developed based on Flink, provides a wealth of plug-ins, and adds features such as breakpoint resuming, dirty data management, and data restoration.
02 ChunJun batch synchronization
• Support for incremental synchronization
• Supports resumable upload
• Supports multi-channel & concurrency
• Support for dirty data (logging and control)
• Support current limiting
• Support for transformers
03 ChunJun Offline
Second, the realization and principle of real-time data acquisition
01 A sample
02 ChunJun plugin loading logic
03 ChunJun plugin definition
04 ChunJun data flow
05 ChunJun dynamic execution
Faced with monitoring multiple tables, including data from newly added tables, how do we perform downstream writes:
• Support Update conversion before, after
• Add extended parameters, DB, Schema, Table, ColumnInfo
• Supports dynamic construction of PreparedStatement
06 ChunJun interval polling
What is interval polling? How do we do it?
• Check the polling field type, if it is not a numeric type and the source parallelism is greater than 1, an error is not supported
• Create three data shards, startlocation is null or the configured value, mod is 0, 1, 2 respectively
• Constructing SQL: Different SQLs have different remainder functions, which are implemented by their respective plugins
select id,name,age from table where (id > ? and ) mod(id, 3) = 0 order by id;
select id,name,age from table where (id > ? and ) mod(id, 3) = 1 order by id;
select id,name,age from table where (id > ? and ) mod(id, 3) = 2 order by id;
• Execute SQL, query and update lastRow
• After the first result query, if startlocation is not configured in the script, the previous query SQL is:
select id,name,age from table where mod(id, 3) = 1 order by id;
Update it to:
select id,name,age from table where id > ? and mod(id, 3) = 1 order by id;
• Get the id value in lastRow during CP and save it to state
3. Implementation and principle of real-time data restoration
01 Introduction to data restoration
Data restoration is based on the CDC collection function of the corresponding database, such as the Oracle Logminer and MySQL binglog mentioned above, which supports the complete restoration of the captured data to the downstream, so not only DML, but also DDL needs to be monitored, upstream All changes to the data source are sent to the restore of the downstream database.
difficulty
DDL, how DML is sent downstream in an orderly manner
How DDL statements perform corresponding operations according to the characteristics of downstream data sources (DML conversion between heterogeneous data sources)
How to handle insert update and delete in DML statements
02 A sample
03 Overall Process
After the data is acquired from the upstream data source and processed by a series of operators, the data is accurately restored to the target data source according to the order of the data in the original table, and the real-time data acquisition link is completed.
04 DDL analysis
Data Restoration - DDL Transformation
· Based on Calcite parsing data source DdlSql to SqlNode
· SqlNode is converted to intermediate data DdlData
· ddlData to sql: conversion between different syntaxes; conversion of different data source field types to each other
05 Name Mapping
In real-time restoration, the current upstream and downstream table field correspondences must be the same, that is, the table corresponding to the upstream database schema table can only be written to the same table as the downstream database schema table, and the field names must also be the same. This iteration will perform a custom mapping for the table path and custom mapping for the field type.
• db or schema conversion
• Table name translation
• Field name (case conversion provided), type implicit conversion
06 Intermediate data cache
The data (whether ddl or dml data) is sent to the unblock queue under the corresponding table name. During the polling process, the worker processes the data in the unblock data queue. After encountering the ddl data, the data queue is set to the block state. And hand the queue reference to the store for processing.
After the store gets the queue reference, it sends the ddl data at the head of the queue to the external storage, and monitors the external storage's feedback on the ddl (the monitoring work is performed by an additional thread in the store). At this time, the queue is still in the block state.
After receiving the feedback from the external storage, the ddl data at the head of the data queue is removed, the queue state is returned to the unblock state, and the queue reference is returned to the worker.
07 The destination receives data
• Get the DdlOperator object
• Convert to the target data source sql according to the DDLConvertImpl parser corresponding to the target data source
• Execute the corresponding sql, such as deleting a table
• Trigger to adjust the DDLChange table and modify the corresponding DDL status
• Intermediate store Restore operator, monitor state changes, and execute subsequent data delivery operations
4. ChunJun's future plan
• Provide Session management
• Provide restful services, ChunJun itself as a service, easy to integrate with peripheral systems
• Enhancements to real-time data restoration, including extended DDL parsing to support more data sources
In addition, the full-text video content shared this time can also be watched at any time. If you are interested, please go to the Kangaroo Cloud B station platform to watch.
Apache Hadoop Meetup 2022
ChunJun Video Review:
https://www.bilibili.com/video/BV1sN4y1P7qk/?spm_id_from=333.337.search-card.all.click
Kangaroo Cloud Open Source Framework DingTalk Technology Exchange Group (30537511), welcome students who are interested in big data open source projects to join and exchange the latest technical information, open source project library address: https://github.com/DTStack/Taier