Flink CDC 2.4 is officially released, learn about the new content of CDC 2.4 in 5 minutes, add Vitess data source, more connectors support incremental snapshots, upgrade Debezium version

Flink CDC 2.4 is officially released, learn about the new content of CDC 2.4 in 5 minutes, add Vitess data source, more connectors support incremental snapshots, upgrade Debezium version

Source: https://ververica.github.io/flink-cdc-connectors/master/

01. Introduction to Flink CDC

Flink CDC [1] is a database-based log CDC technology that implements a data integration framework for full incremental integrated reading. Cooperating with Flink's excellent pipeline capabilities and rich upstream and downstream ecology, Flink CDC can efficiently realize real-time integration of massive data.

What is specifically about Flink CDC? You can read this text

insert image description here
As a new generation of real-time data integration framework, Flink CDC has technical advantages such as full incremental integration, lock-free reading, parallel reading, automatic synchronization of table structure changes, and distributed architecture. At the same time, the community provides comprehensive Chinese and English document support [2] .

02. Overview of Flink CDC 2.4

With the joint efforts of community users and developers, Flink CDC 2.4 was officially released after the Dragon Boat Festival holiday:

https://github.com/ververica/flink-cdc-connectors/releases/tag/release-2.4.0

insert image description here

A total of 32 community contributors participated in the 2.4 version, 141 issues were resolved, 86 PRs were merged, and 96 commits were contributed. From the perspective of code distribution, MySQL CDC, MongoDB CDC, PostgreSQL CDC, incremental snapshot framework (flink-cdc-base) module and document module have brought many features and improvements to users.

This article takes you through the following figure to quickly understand Flink CDC 2.4the major improvements and core features of the version in 10 minutes.

insert image description here

  • Added Vitess CDC connector to support Vitess incremental data synchronization.

  • PostgreSQL CDC, SQL Server CDCthe two major connectors are connected to the incremental snapshot framework, thus providing the ability to read without lock, read concurrently and resume from breakpoints.

  • Version 2.4 upgrades Debezium's dependent version to 1.9.7.Final, and introduces the functions, optimizations and repairs of the new version of Debezium , such as: repairing the problem that some DDL cannot be parsed, fixing the problem of parsing MySQL JSON functions, adding scn information to Oracle events, etc.

  • In version 2.4 of the incremental snapshot framework, the function of automatically closing the idle Reader after the full phase ends is added. This function is very practical and can save resources in the production environment.

  • The MySQL CDC connector supports reading tables without primary keys in version 2.4, and supports the original real-time synchronization link to continue streaming when new tables are added.

  • Community version 2.4 is compatible with Flink 1.13 ~ 1.17five Flink versions . CDC's SQL Connector can run on different Flink clusters without any modification, achieving cross-version compatibility. If it is a Datastream job, you need to introduce different versions of flink-shaded-guava dependencies according to different Flink versions. DataStream users can refer to the packaging method of SQL Connector to manage the correct dependencies.

  • MongoDB CDC supports specifying timestamps to consume data, supports mongodb + srv connection protocol, and fixes several problems, such as: unable to parse library names with hyphens, 'poll.await.time.ms' configuration does not take effect, and parsing DDL is empty pointer etc.

  • OceanBase CDC connector supports JDBC parameter setting, supports specifying Oracle driver, and improves support for Oracle data types.

03. Detailed explanation of core features and important improvements

3.1 In-depth interpretation

Flink CDC version 2.4 has brought many important improvements and features. This article selects the most important five for further interpretation.

insert image description here

1. Add Vitess CDC connector

Vitess [3] is a database solution for deploying, scaling and managing large clusters of MySQL instances.

Vitess' VStream is a change event subscription service that provides the same information as the binary logs from the underlying MySQL shards of the Vitess cluster. The downstream can subscribe to multiple fragments of a keyspace, which is very convenient to realize the downstream CDC processing tool of Vitess. The Vitess CDC connector uses VStream to obtain and send data change messages. Currently, it only supports change synchronization in the read incremental phase, which is equivalent to only supporting the latest startup mode.

There is a small story behind the support of the Vitess CDC connector. The Connector was developed by Simonas Gelazevicius from Vinted. In the spirit of upstream first open source contribution, this contributor has requested the community to merge since version 2.0. However, there are very few domestic users of this data source, and each Maintainer is not familiar with its technical details, so it has not been able to merge into the main branch of the community. Simonas Gelazevicius will take the initiative to rebase PR after each version is released in the Flink CDC community. This persistence has impressed all the Maintainer members in the community. Community Maintainers Ren Qingsheng and Fang Shengkai actively learn Vitess related technologies, help review and improve PR. Finally, the connector was completed by contributors Simonas Gelazevicius, Gintarasm, Fang Shengkai and Ren Qingsheng.

2. PostgreSQL CDC and SQL Server CDC connectors access incremental snapshot framework

In version 2.4, both the PostgreSQL CDC connector and the SQL Server CDC connector are connected to the Flink CDC incremental snapshot framework to implement the incremental snapshot algorithm, thereby providing functions of lock-free reading, parallel reading and resumable upload.

insert image description here

3. PostgreSQL CDC connector supports incremental snapshot principle

The PostgreSQL CDC connector needs to combine the logical decoding [4] function to read the changed data in the PostgreSQL transaction log, which needs to start a unique Replication Slot in the entire cluster, and process these changes with the help of the output plugin [5] , through Record the read WAL position to realize the switching and failure recovery of the incremental phase.

In addition to reading the changed data in the incremental phase, the incremental snapshot framework also needs to start the Backfill Task for each SnapshotSplit in the full phase to synchronize the changes that occur during the snapshot. In order to avoid Replication Slot conflicts, the PostgreSQL CDC connector uses the following methods to create Slots. First of all, the 'slot.name' configuration item is required and needs to be specified by the user. The slot name specified here will be used for the Slot started in the incremental phase, and this Slot will be created when the job starts, and will not be deleted after the job stops , so as to ensure that the incremental phase reads the complete change data after startup, and can restart from Checkpoint. For the slots of each Backfill Task in the full phase, the naming style of "slotname_subTaskId" will be used. In order to avoid conflicts and waste of slot resources, these slots will be deleted after the full reading stops.

4. The SQL Server CDC connector supports the principle of incremental snapshots

The SQL Server CDC connector reads the change data of the specified database and table through the change data capture function [6] , and saves it in a specially created change table. This requires CDC to be enabled on the specified database and table to capture row-level changes. By recording the LSN (Log Sequence Number) of the database log, the switching and fault recovery of the incremental phase are realized.

So far, the data sources that Flink CDC supports the incremental snapshot algorithm continue to expand. In the next version, the community is also planning to allow more connectors to be connected to the incremental snapshot framework.

5. The incremental snapshot framework supports automatic release of resources

The incremental snapshot framework of Flink CDC has two main phases: full phase and incremental phase. The parallelism of these two phases is different. The full phase supports multiple parallelisms, which speeds up the synchronization process of a large amount of data. The incremental phase reads the change log and needs to use single concurrency to ensure the order and correctness of events. After reading in the full phase, since only one concurrency is required in the incremental phase, there will be a large number of idle Readers, which is a waste of resources. When version 2.4 uses the incremental snapshot connector, it supports configuring the function of automatically closing idle Readers to close these idle Readers. Since this function depends on the Checkpoint on finished Task feature supported after Flink 1.14, it is only supported on Flink 1.14 or newer Flink versions.

6. MySQL CDC connector function update

As the most popular MySQL CDC connector in the community, the community introduced some advanced features in version 2.4, including:

  1. Support no primary key table

    MySQL CDC connector version 2.4 supports the use of non-primary key tables. Compared with MySQL tables with primary keys, non-primary key tables have some extra precautions in use. When using a non-primary key table, you need to specify a column as a shard column through the 'scan.incremental.snapshot.chunk.key-column' configuration, which is used to divide the table into multiple shards for synchronization. It is recommended to select an indexed column as a shard Columns, using non-indexed columns will cause the use of table locks during multi-concurrent synchronization in the full phase. Secondly, the selected sharding column needs to ensure that there is no data update operation (such as updating from 1 to 2). If there is an update operation, only At-Least-Once semantics can be guaranteed.

  2. Support real-time continuous streaming of new tables

    When MySQL CDC was processing a new table before, the original real-time synchronization link would be cut off, and it would need to wait for the full reading of the newly added table to complete before continuing to synchronize, which would cause serious delays to users who are sensitive to delays. Influence. For example, a newly added table has a lot of historical data, and it takes 30 minutes to complete the full synchronization of the newly added table. For a table that is already in the incremental stage, it will need to wait 30 minutes before continuing to synchronize the incremental data belonging to the table . Version 2.4 further optimizes the processing logic of the newly added table to ensure that the full stage of the newly added table will not affect the existing real-time synchronization link, which greatly improves the user experience.

  3. bug fix

    In version 2.4, the MySQL CDC connector has repaired the usage problems reported by community users, such as the specified Binlog site consumption cannot be started from the savepoint, special characters in the database cannot be processed, and fragmentation errors caused by case sensitivity, etc.

3.2 Other improvements

The Debezium version depends on upgrading to version 1.9.7.Final, introducing new features and fixes corresponding to the Debezium version.

Flink CDC version 2.4 is compatible with five major versions of Flink 1.13 ~ 1.17, which greatly reduces the upgrade and operation and maintenance costs of users' Connectors.

The OceanBase CDC connector supports JDBC parameter settings, supports specified drivers, improves support for Oracle data types, and fixes problems such as abnormal reconnection always failing.

MongoDB CDC supports specified timestamp consumption data, supports mongodb + srv connection protocol, and fixes problems such as failure to parse library names with hyphens, 'poll.await.time.ms' configuration not taking effect, and null pointers when parsing DDL.

Oracle CDC fixed the data correctness problem in the full stage.

All CDC connectors support printing configuration information for easy troubleshooting.

04. Future planning

The development of the Flink CDC open source community benefits from the selfless contributions of all contributors and the excellent community work of the Maintainer members, and is inseparable from the active use and feedback of the majority of Flink CDC user groups. The Flink CDC community will continue to do a good job in the construction of the open source community. The current community is planning version 2.5 [7] , and positive feedback from contributors and users is welcome. In the next version, the main direction of the community will focus on the following four aspects:

  • rich data sources

    Support more data sources, and promote the use of the incremental snapshot framework in each CDC connector, so that more data sources support features such as lock-free reading, concurrent reading, and resuming uploads.

  • Optimize incremental snapshot framework

    Optimize the problems encountered in the access of the incremental snapshot framework, and extract and organize the reusable codes of each CDC connector in the incremental snapshot framework.

  • Perfect current limiting and monitoring

    Provides a throttling function to reduce the query pressure on the database during the full load phase. Provides richer monitoring indicators, and can obtain indicators related to task progress to monitor task status.

  • More ways to use

    Support At least once semantics, support Snapshot only startup mode, etc., can provide users with more scene applications.

  • Flink versions supported by convergence

    With the increasing number of Flink versions, the maintenance pressure on CDC to be compatible with multiple Flink versions is also gradually increasing. Referring to the current Flink connector rules [8] , in subsequent versions, the CDC connector will consider only supporting the latest 3-4 versions of Flink.

[1] https://github.com/ververica/flink-cdc-connectors

[2] https://ververica.github.io/flink-cdc-connectors

[3] https://vitess.io/

[4] https://www.postgresql.org/docs/current/logicaldecoding-explanation.html

[5] https://www.postgresql.org/docs/current/logicaldecoding-output-plugin.html

[6] https://learn.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-data-capture-sql-server?view=sql-server-2017

[7] https://github.com/ververica/flink-cdc-connectors/issues/2239

[8] https://cwiki.apache.org/confluence/display/FLINK/Externalized+Connector+development#ExternalizedConnectordevelopment-Flinkcompatibility

Supongo que te gusta

Origin blog.csdn.net/qq_32727095/article/details/131432500
Recomendado
Clasificación