Building CDC data synchronization pipeline based on Apache SeaTunnel

file

introduction

In the rapidly developing data-driven era, real-time and accurate synchronization of data has become an indispensable part of enterprise information systems. With the advancement of technology, especially in the context of distributed computing and big data technology, building an efficient and reliable data synchronization pipeline has become a challenge.

Apache SeaTunnel, as an advanced data integration development platform, provides the possibility to build efficient CDC data synchronization pipelines. This article will delve into the process of building a CDC data synchronization pipeline using Apache SeaTunnel, reveal the key technologies and practical strategies behind it, and aim to provide practical guidance for professionals facing data synchronization challenges.

Good afternoon everyone, the topic shared today is building a CDC data synchronization pipeline based on Apache SeaTunnel. I used to mainly work on the computing platform for monitoring APM, and later turned to the data integration development platform. Currently, I am developing the CDC data synchronization pipeline based on Apache SeaTunnel and have been active in the open source community for a long time. I am a PMC member of Apache SeaTunnel and a committer of Skywalking.

Introduction to Apache SeaTunnel

Apache SeaTunnel is a data integration development platform, and its development has gone through several important stages:file

  1. ETL era (90s) : Data synchronization for structured databases, used to build data warehouses.
  2. MPP and distributed technologies are popular : Use technologies such as Hive to build data warehouses. At this stage, the mapreduce program is mainly used for data handling and conversion.
  3. The popularity of data lake technology : attaches great importance to data integration, emphasizing synchronizing data to data lake warehousing first, and then conducting business-oriented conversion and design.

file

Technical positioning and challenges

file

In the ELT link, Apache SeaTunnel mainly solves simple conversion problems and quickly moves data. Challenges include:

  • Handle diverse data sources and storage differences.
  • Minimize impact on data sources.
  • Adapt to different data integration scenarios, such as offline and real-time CDC synchronization.
  • Ensure monitoring and quantitative indicators of data integration.

Important features

  • Easy to use : No code, submit jobs through configuration files.
  • Operation monitoring : Provide detailed read and write monitoring.
  • Rich ecology : plug-in architecture, providing a unified reading and writing API.

The development history of Apache SeaTunnel

The predecessor of Apache SeaTunnel is Whatdorp. It joined the Apache Incubator in 2021 and released its first version in 2022. In October 2022, a major refactoring was carried out and a unified API was introduced.

file

In November 2022, an engine dedicated to data synchronization was developed. By the end of 2022, the connector's read and write capabilities will support more than 100 data sources. By 2023, the main focus will be on CDC and entire database synchronization.

file

Introduction to CDC (Change Data Capture)

fileCDC, or change data capture, is a technology that captures database change events (such as inserts, updates, deletes). In business databases, data is constantly changing. The role of CDC is to capture these events and synchronize them to data warehouses, data lakes or other platforms to ensure that the target storage is consistent with the original database.file

CDC application scenarios

  1. Data replication : such as standby database construction or read-write separation.
  2. Data analysis : Conduct BI-based data analysis on the big data platform.
  3. Retrieval business : For example, synchronize product library or document library to retrieval platforms such as ES.
  4. Operational audit : records system changes, used for financial audits, etc.

Common CDC Plan Pain Points

  1. Single table job limitation : In most open source solutions, a job can usually only process one table.
  2. Separation of reads and writes : Some platforms focus on data capture, while others are only responsible for data writing.
  3. Multi-database support issues : Different databases may require different synchronization platforms, which increases maintenance difficulty.
  4. Difficulty processing large-scale tables : You may encounter performance bottlenecks when processing large tables.
  5. DDL change synchronization : Real-time synchronization of database structure (DDL) changes is a complex and important requirement.

Application of Apache SeaTunnel in CDC

fileAs a connector, Apache SeaTunnel can implement abstract Source API and Sink API, that is, read and write API, to achieve data synchronization. Its design goals are:

  1. Supports multiple databases : such as MySQL, Oracle, etc.
  2. Zero coding : Automatically create tables and dynamically add and delete tables without writing code.
  3. Efficient reading : first take a data snapshot, and then track binlog changes.
  4. Ensure consistency : Implement exactly-once semantics so there will be no data duplication even in the event of interruption recovery.

Apache SeaTunnel CDC design practice focus

fileIt handles two stages of data synchronization: snapshot reading and incremental tracking.file

Snapshot reading phase

Basic process

  • Chunk division (Splitting) : In order to efficiently synchronize a large amount of historical data, the table is divided into multiple chunks (or splits), and each chunk processes a part of the data.
  • Parallel processing : Each table is divided into multiple splits, and these splits are assigned to different readers through routing algorithms for parallel reading.
  • Event feedback mechanism : Each reader will report progress (watermark) to the split distributor after completing split reading.file

file

Detailed explanation of Split

  • Composition : Split includes unique ID, pointed table ID, and division details (such as data range).
  • Division method : Split can divide ranges based on different types of columns (such as numbers or time).
  • Processing process : The divided splits are distributed to readers, and the data water level will be reported after the reading of each split is completed.

file

incremental tracking phase

Single threaded stream reading

file

  • Streaming read characteristics : Unlike parallel reads in the snapshot phase, incremental tracking is usually a single-threaded operation.
  • Reduce business pressure : avoid repeatedly pulling binlog and reduce pressure on the business database.

file

file

file

Split management

  • Unterminated Split : The incremental phase of Split has no end point, which means that the stream reading is continuous.
  • Watermark management : Incremental Split contains the minimum watermark of all snapshot Splits, starting from the minimum position.
  • Resource optimization : One reader occupies one connection to maintain efficient and resource-optimized data tracking.

Apache SeaTunnel CDC is designed to allow efficient synchronization of historical data (snapshot reads) and real-time changes (incremental tracking). Through Split management and resource optimization strategies, data synchronization is ensured to be both efficient and have minimal impact on the original database.

Exactly-Once implementation of Apache SeaTunnel CDC

file

The core of Apache SeaTunnel CDC's implementation of Exactly-Once semantics is to handle inconsistencies and system failures in data synchronization.

file

Exactly-Once implementation mechanism

Watermark management of snapshot reads

  • Low water mark and high water mark : When reading a snapshot, the low water mark is recorded first, and the high water mark is recorded after the reading is completed. The difference between these two watermarks indicates that the database has changed during this period.
  • Memory table merge : Changes between the low water mark and the high water mark are merged into the memory table to ensure that no changes are missed.

file

file

Gap processing between Split and Split

  • Handle data gaps : Handle data gaps between splits to ensure that no changes are missed.
  • Reverse filtering and retrieval : Each data point in the snapshot stage is checked to ensure that it has not been overwritten by the previous Split to avoid data duplication.
  • Staged proofreading : It is divided into two stages (Stage 1 and Stage 2), which deal with the gaps between Splits and the gaps between tables respectively to ensure that all changes are captured.

Breakpoint resume and distributed snapshot

Distributed snapshot mechanism

file

  • Different engine adaptation : The distributed snapshot API adapts to different execution engines to ensure state consistency.
  • Checkpoint saving : Regularly initiate checkpoint saving operations, all components upload their own status, and save the complete checkpoint status.
  • Recovery selection : During recovery, any checkpoint version can be selected for recovery.

file

Distributed state alignment

  • Inter-process state synchronization : Handle different memory states within multiple processes to ensure that they reach a consistent state at one point in time.
  • Signal propagation and preservation : A distributed snapshot signal is initiated from one process, and the other processes save their own states according to the signal and transmit the signal until the states of all nodes are aligned.
  • Practical application : In the CDC task, the enumerator node, reading node, and writing node all participate in this process to ensure the state consistency of the entire data synchronization process.

An in-depth discussion of DDL synchronization

file

In Apache SeaTunnel CDC, DDL synchronization is a key challenge. Because the database structure may change during data flow processing, these changes must be handled with caution.

DDL parsing and abstraction

  • DDL event parsing : DDL events are first parsed and converted into a structured abstract form. The purpose of this is to decouple the DDL processing from the syntactic details of the specific database.
  • Structured event processing : For example, the operation of adding a column is converted into a general structured event and no longer relies on the syntax of the specific database.

file

Separation of data flow and structure flow

  • Signal insertion : Before and after DDL operations, the system will insert specific signals to separate the structure flow and data flow. Doing so allows data processing to be paused during DDL operations, avoiding data chaos during structural changes.

Pre- and post-signal processing

  • Pre-signal : Before DDL operation, clear the data status in the memory and pause data processing to ensure data integrity before structural changes.
  • Post-signal : After the DDL operation is completed, the system resumes data processing and continues subsequent data synchronization.file

file

file

Detailed optimization of data transmission

fileIn terms of data transmission, Apache SeaTunnel CDC ensures the efficiency and consistency of data synchronization through a series of optimizations.

Typed processing of data operations

  • Insert : Processing of new data, only the status after the operation is involved.
  • Update : involves state changes before and after operations, which need to be processed accurately to ensure data consistency.
  • Delete : Only focus on the state before the operation, because the data no longer exists after the operation.

Efficient data flow management

In order to improve efficiency, CDC has made a lot of optimizations in data flow management:

  • Table-level data splitting : Ensure the orderliness of data processing in the same table.
  • Key-level data sorting : Data operations on the same key are processed in order to ensure the correctness of the data status.
  • Parallel data writing : Data in the same table can be written in parallel, improving the speed of data processing.

Update optimization

file

For target storage that does not support update operations, CDC adopts an optimization strategy: convert update operations into delete-then-insert operations, thereby bypassing storage limitations.

Shared mining and multi-target writing

file

In order to reduce the burden on original data sources, CDC adopts a shared mining mechanism. This means that the data is read once and then shared among multiple write plugins, allowing the data to be written to multiple target stores. This method effectively integrates the originally scattered data reading and writing processes and improves overall efficiency.

Automatic table creation

Purpose

  • Automatic conversion : Automatically convert the table structure of the original database to the target database, which is suitable for scenarios where you are not familiar with the table structure of the business database or have a large number of tables.

Implementation process

file

  1. Table structure push : Convert all configured tables into common data types and table structures.
  2. Interaction with the write plug-in : On startup, the plug-in receives the table structure, examines it, and creates or updates the table in the target database.
  3. Type promotion : Handle type mismatch problems in heterogeneous databases, such as promoting small types to large types.

community development and engagement

file

current development

  • Multi-table reading and writing : Promote multi-table and multi-engine support.
  • API promotion : Promote APIs such as automatic table creation to the community and implement them in various plug-ins.
  • Connector upgrade : Upgrade the connector to support new multi-table read and write capabilities.
  • DDL parsing : Develop DDL parsing function that supports target side table structure.

Web interface

  • Release and improvement : Release and continuous improvement, supporting data query and synchronization task configuration of different databases.

Community Involvement

  • Join the community : Get more support through the official WeChat public account or join the Chinese user group.
  • Online resources : Obtain resources and support through the project's issue system, Slack channel or official website.
  • Contribute and communicate : Download the trial, report bugs, view newbie tasks, or communicate through the mailing list and Slack.

file

This article is published by Beluga Open Source Technology !

Tang Xiaoou, founder of SenseTime, passed away at the age of 55. In 2023, PHP stagnated . Hongmeng system is about to become independent, and many universities have set up "Hongmeng classes". The PC version of Quark Browser has started internal testing. ByteDance was "banned" by OpenAI. Zhihuijun's startup company refinanced, with an amount of over 600 million yuan, and a pre-money valuation of 3.5 billion yuan. AI code assistants are so popular that they can't even compete in the programming language rankings . Mate 60 Pro's 5G modem and radio frequency technology are far ahead No Star, No Fix MariaDB spins off SkySQL and forms as independent company
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/SeaTunnel/blog/10322460