Flink from entry to abandonment (nine) - ten thousand words to explain CDC design (1)

1. Preparation

Before starting to study the principle of Flink CDC (this article first introduces the CDC1.0 version, and will extend the introduction of the functions of 2.0 later), the following tasks need to be done (this article starts with the Flink1.12 environment)

  1. Open the Flink official website (see the introduction of the Connector module)

  2. Open Github and download the source code (the link cannot be placed at present, readers can search on github by themselves)

apache-flink

flink-cdc-connectors

debt

  1. start to sink

2. Design Proposal

2.1. Design motivation

CDC (Change data Capture, capturing change data) is a relatively popular model in enterprises, mainly used in data synchronization, search index, cache update and other scenarios; the early community needs to support the ability to directly extract and interpret change logs into Table API and SQL functions to broaden the usage scenarios of Flink .

On the other hand, the concept of "dynamic table" was proposed earlier, and two modes on streams were defined: append mode and update mode . Flink already supports the append mode that converts the stream into a dynamic table, but does not yet support the update mode, so the interpretation of the changelog is to fill in the missing piece of the puzzle to obtain a complete dynamic table concept. Of course, not only update is supported now, but also retract mode is supported, which will be explained separately later.

2.2, CDC tool selection

Common CDC schemes are compared as follows:

picture

1. Perfect function

For CDC tools, there are currently many options, such as Debezium, Canal and other popular solutions; currently Debezium supports Mysql, PG, SQL Server, Oracle, Cassandra and Mongo, if Flink supports Debezium, it means that Flink can connect to the following The changelog of all databases in the screenshot is beneficial for improving the entire ecosystem. Among them, Debezium supports full + incremental synchronization , which is very flexible and makes Exactly-Once possible.

The database types supported by Debezium are as follows:

picture

2. Embrace the community and facilitate expansion

If you choose Debezium as the embedded engine of Flink, you can embed it into the code base as a dependency package instead of running through the kafka connector. You also don’t need to communicate directly with the MySQL server, and you don’t need to deal with complex snapshots, GTIDs, locks, etc. advantage. while embracing and collaborating with the Debezium community

Third, the internal data structure is similar

For the data structure RowData type inside Flink SQL, there is a metadata RowKind, which has 4 types, namely insert (Insert), update (UPDATE_BEFORE before update, UPDATE_AFTER after update), and delete (DELETE). These four data types can be found It is basically consistent with the structure of Binlog.

picture

Here is an explanation of the meaning of each metadata field in Debezium's data structure:

· before field: It is an optional field, which represents the state before the event occurs. If it is a create operation, the value of this field is null.

· after field: It is an optional field, which represents the state of the row after the event occurs. If it is a delete operation, then the value of this field is null.

· source: is a required field, including event meta information, such as offset, binlog file, database and table.

ts_ms: represents the timestamp of Debezium processing events

· OP field: This field also has 4 values, namely C (create), U (Update), D (Delete), and Read®. For the U operation, its data part contains both Before and After.

picture

3. Concepts

There will be a lot of concepts involved here, everyone has a cognition first, and then they will be split and explained separately later.

3.1、Stream

The concept of stream is actually easy to understand. **Stream has two characteristics: boundedness and change mode. **Then introduce these two features separately:

  1. Change mode:

picture

As shown in the figure above: There are two dynamic table modes, append mode and replace mode (upsert and retract) . Next, briefly introduce the difference between the two

For append mode, it is easy to understand. If no primary key is specified in the table definition, then once a record is added to the table, it is never updated or deleted by appending the stream record to the table as a new row.

For the Replace mode, if the primary key key is defined in the table, then if there is no record with the same key attribute, it will be inserted into the table, otherwise it will be replaced. Then for the replace mode, it is subdivided into upsert and retract modes:

For the Upsert mode, it includes two messages, Upsert (insert and update) and DELETE. The main difference between this mode and the retract mode is that UPDATE changes will be encoded with a single message, which is more efficient. _

For the Retract mode, an update stream contains ADD and RETRACT messages. For an Insert change, it will be encoded into an ADD message. For a DELETE change, it will be encoded into a RETRACT message. For an UPDATE change, it will be encoded into Updated-Before as the retract message and ADD message. This mode needs to be disassembled into two messages for the update event, and the efficiency will be relatively low. Here is a brief introduction to what is a retracement flow, as shown in the figure below:

Operation

User

Count(url)

Mark

I(Insert)

Mary

1


I(Insert)

Bob

1


-U(Update-Before)

Mary

0

will delete the record

+U(Update-After)

Mary

2


I(Insert)

Liz

1


-U(Update-Before)

Bob

0

will delete the record

+U(Update-After)

Bob

2


picture

  1. Boundedness

Bounded Stream: Consists of events of bounded size, the job query processes the data currently available and will end after that.

Unbounded Stream: Consists of infinite events, if the input is an unbounded stream, the query is processed continuously as all data arrives.

3. Summary:

Boundedness \ Change Mode

Append

Update

Unbounded

Append Unbounded Stream

e.g. Kafka logs

Update Unbounded Stream

e.g. continuously capture changes of MySQL table

Bounded

Append Bounded Stream

e.g. a parquet file in HDFS,

a MySQL table

Update Bounded Stream

e.g. capture changes of MySQL table until a point of time

3.2, dynamic table Dynamic Table

Dynamic tables are tables that change over time and can be queried like traditional regular tables. A dynamic table can be converted into a stream, and a stream can be converted into a dynamic table (need to have the same schema, and the conversion method depends on whether the table schema contains the definition of the primary key). Note: All tables created in Flink SQL are dynamic tables. A query on a dynamic table generates a new dynamic table (updated according to the input), and whether the query terminates depends on the boundedness of the input.

DynamicTable is a conceptual object and stream is a physical representation.

3.3、Changelog

The Changelog is an append stream consisting of rows containing change operation columns (for insert/delete flags or more in the future) and actual metadata columns. The goal of the design of Flink CDC is to extract the events of the change log and convert them into change operations (such as insert, update, delete events).

Convert from an append stream to an update stream ( ie Interpret Changelog ).

Convert from an update stream to an append stream ( Emit Changelog ).

In Flink SQL, data flows from one operator to another in the form of Changelog Stream. Changelog Stream at any time can be translated into a table or a stream, as shown in the following figure:

picture

The following figure will completely show the conversion between Stream type and Table:

picture

Different CDC tools may have different encoding methods for Changelog, which is also a big challenge for flink. Here are two popular solutions: Debezium and Canal as examples:

  1. Take Debezium as an example. Debezium is a CDC tool built on top of Kafka Connector. It can transmit real-time change streams to Kafka. Debezium generates a unified format for Kafka's change log. Take the update operation as an example:
{
  "before": {
    "id": 1004,
    "first_name": "Anne",
    "last_name": "Kretchmar",
    "email": "[email protected]"
  }, //before作为可选字段,如果是create操作,则该字段为null
  "after": {
    "id": 1004,
    "first_name": "Anne Marie",
    "last_name": "Kretchmar",
    "email": "[email protected]"
  }, //after作为可选字段,如果是delete操作,则该字段为null
  "source": { ... },//强制字段,标识事件元信息,如offset,binlog file,database,table 等等。
  "op": "u",   //强制字段,用来描述操作类型,如C(create),U(update),D(delete)
  "ts_ms": 1465581029523
}

By default, Debezium outputs two events for delete operations: DELETE event and Tombstone (tombstone) event (with null value/payload), the tombstone event is used for Kafka compaction mechanism. It should be noted that: Debezium is not a storage system, but represents a storage format, which is based on the JSON format and can convert deserialized result rows into ChangeRow or Tuple2<Boolean, Row>.

  1. Canal is a popular CDC tool in China. It is used to capture changes from Mysql to other systems. It supports Kafka and RocketMQ stream changes in JSON format and protobuf format. Here is an example of the update operation:
{
  "data": [  
    {//表示真实的数据,如果是更新操作,则是更新后的状态,如果是删除操作,则是删除之前的状态。
      "id": "13",
      "username": "13",
      "password": "6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9",
      "name": "Canal Manager V2"
    }
  ],
  "old": [ //可选字段,如果不是update操作,那么该字段为null
    {
      "id": "13",
      "username": "13",
      "password": "6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9",
      "name": "Canal Manager"
    }
  ],
  "database": "canal_manager",
  "es": 1568972368000,
  "id": 11,
  "isDdl": false,
  "mysqlType": {...},
  "pkNames": [
    "id"
  ],
  "sql": "",
  "sqlType": {...},
  "table": "canal_user",
  "ts": 1568972369005,
  "type": "UPDATE"
}

Flink supports both of these two mainstream CDC tool encoding formats, which can be passed through format="canal-json" or format="debezium-json".

picture

4. Source code traceability

From the above section, it is mentioned that Flink chooses Debezium as the embedded engine to implement CDC. Currently, the connectors supported by Flink CDC are as follows:

Note: The connector support for Mongo needs to be used in Flink CDC2.0 version

Database

Version

MySQL

Database: 5.7, 8.0.xJDBC Driver: 8.0.16

PostgreSQL

Database: 9.6, 10, 11, 12JDBC Driver: 42.2.12

MongoDB

Database: 4.0, 4.2, 5.0MongoDB Driver: 4.3.1

The Flink versions corresponding to the Flink CDC Connectors are as follows:

Flink CDC Connector Version

Flink Version

1.0.0

1.11.*

1.1.0

1.11.*

1.2.0

1.12.*

1.3.0

1.12.*

1.4.0

1.13.*

2.0.0

1.13.*

4.1、Debezium-Mysql

Before learning more about how Flink combines with Debezium and the specific interaction process, let's take a brief look at how Debezium implements event change capture. The figure below shows the role of Debezium (taking Debezium1.2 version as an example) in the entire CDC link:

picture

Taking Mysql as an example, before using Debezium, the following work needs to be prepared:

4.1.0, Preparations

1. Need to authorize the mysql account

 GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'user' IDENTIFIED BY 'password';

2. Open Mysql binlog

-- 1、检查是否已经开启
SELECT variable_value as "BINARY LOGGING STATUS (log-bin) ::"
FROM information_schema.global_variables WHERE variable_name='log_bin';
 
--2、配置服务文件,开启binlog
server-id         = 223344
log_bin           = mysql-bin
binlog_format     = ROW
binlog_row_image  = FULL
expire_logs_days  = 10

3. Enable GTID

Global Transaction Identifiers (GTIDs) uniquely identify transactions occurring on servers within a cluster. Although Debezium MySQL Connector does not need it, using GTIDs can simplify replication and make it easier to confirm whether the master server and slave server are consistent. GTIDS can only be used after Mysql5.6+

-- 1、开启gtid_mode
gtid_mode=ON
 
--2、开启enforce_gtid_consistency
enforce_gtid_consistency=ON
 
--3、检测是否生效
show global variables like '%GTID%';

4.1.1, Performs Database Snapshots (full read)

When the Mysql Connector is started for the first time, it will first perform an initial consistent snapshot of the database. The main reason is that Mysql is usually set to clear the binlog after a certain period of time, so in order to ensure exactly-once, take a snapshot first. The default snapshot mode is inital, which can be adjusted by the snapshot.mode parameter. Next, look at the details of the specific snapshot:

  1. First obtain a global read lock to block other clients from writing (if Debezium detects that global locks are not allowed, it will be changed to a table-level lock). It should be noted that the snapshot itself cannot prevent other clients from executing DDL. Because this may interfere with the Connector trying to read the binlog location and table schema. A global read lock is held while reading the binlog location and then released in a later step.

  2. Start a repeatable read transaction to ensure that all subsequent reads are against a consistent snapshot.

  3. Read the current binlog location.

  4. Read the schema corresponding to the library tables allowed by the Connecto configuration.

  5. Release the global read lock/table-level lock, and other clients can write at this time.

  6. Write the DDL change statement into the corresponding topic (this is also to ensure consistency to save all DDL statements, when the connector restarts after crashing or being gracefully stopped, all DDL statements can be read from this topic to Rebuild the table structure at a specific point in time until the point in the binlog before the crash to prevent schema inconsistencies from causing exceptions).

  7. Scans the library tables and generates an event for each row on the topic of a specific table.

  8. Commit the transaction, recording the completed snapshot in the Connector offset.

If the Connector fails while making a snapshot, it will create a new snapshot when it restarts. Once the snapshot is complete, it will start reading from the same location of the binlog, so that no change events will be lost. If the Connector is stopped for too long, and when the MySQL server purges older binlog files, the last position of the Connector may be lost. When the Connector restarts, the MySQL server no longer has a starting point, and the Connector performs another initial snapshot.

picture

4.1.2. Incremental reading

  1. Initialize a Binlog client, and will start a thread named binlog-client

  2. The client will register the event listener, which will call a handleEvent method, mainly for offset update, forwarding event processing, heartbeat notification and other logic, for example, when mysqld writes to switch to a new binlog or executes flush logs or the current binlog file size is larger than When max_binlog_size is set, the binlog position will be reset.

  3. Configure a series of deserializers to parse according to different event types, such as delete, update, and insert events. Then, event handlers such as handleDelete, handleUpdate, and handleInsert will be called for processing.

  4. When an event is detected, the specific processor will be called according to the specific event type, which is not the focus of explanation here. This article mainly explains how flink receives this part of data and converts the data structure it supports.

4.2、Flink-cdc-mysql

There will be a separate article to introduce how Flink calls the Debezium engine. Here is a call relationship diagram first, and readers can read it by themselves when they have time

picture

4.3、Debezium-Mysql Properties

The following table is the configuration file that comes with debezium. It can still be reused in flink cdc. You only need to add the prefix "debezium." before using it to take effect.

Property

Default

Description

name


The unique name of the Connector, if the name is the same, it will fail and report an error

connector.class


The loading class of the Connector connector, use io.debezium.connector.mysql.MySqlConnector for the Mysql connector

tasks.max

1

The maximum number of tasks created by Connector, only one task is used for mysql.

database.hostname


Mysql database address

database.port

3306

Mysql database port

database.user


Mysql database authentication user name

database.password


Mysql database authentication password

database.server.name


Mysql service name

database.server.id

random

Mysql service ID

database.history.kafka.topic


Topic that stores the history schema of the database table

database.history.kafka.bootstrap.servers


The kafka address where the history schema of the database table is stored

database.whitelist

empty string

一个可选的逗号分隔的正则表达式列表,与要监控的数据库名称相匹配;任何不包括在白名单中的数据库名称将被排除在监控之外。默认情况下,所有数据库都将被监控。不能与database.blacklist一起使用

database.blacklist

empty string

一个可选的逗号分隔的正则表达式列表,与数据库名称相匹配,以排除在监控之外;任何不包括在黑名单中的数据库名称将被监控。不能与database.whitelist一起使用

table.whitelist

empty string

一个可选的逗号分隔的正则表达式列表,用于匹配要监控的表的全称表标识符;任何不包括在白名单中的表将被排除在监控之外。每个标识符的形式是databaseName.tableName。默认情况下,连接器将监控每个被监控数据库中的每个非系统表。不能与table.blacklist一起使用

table.blacklist

empty string

一个可选的逗号分隔的正则表达式列表,用于匹配要从监控中排除的表的全称表标识符;任何不包括在黑名单中的表都将被监控。每个标识符的形式是databaseName.tableName。不能与table.whitelist一起使用

column.blacklist

empty string

一个可选的逗号分隔的正则表达式列表,该列表与应从更改事件消息值中排除的列的全称名称相匹配。列的全称是数据库名.表名.列名,或者数据库名.模式名.表名.列名

column.truncate.to.length.chars

n/a

一个可选的逗号分隔的正则表达式列表,它与基于字符的列的完全限定名称相匹配,如果字段值长于指定的字符数,其值应在变更事件消息值中被截断。在一个配置中可以使用具有不同长度的多个属性,尽管在每个属性中长度必须是一个正整数。列的全称是数据库名.表名.列名的形式

column.mask.with.length.chars

n/a

当列长度超过指定长度,那么多余的值由*来替换

column.mask.hash.hashAlgorithm.with.salt.salt

n/a

列加盐操作

time.precision.mode

adaptive_time_microseconds

时间、日期和时间戳可以用不同种类的精度表示,包括。adaptive_time_microseconds(默认)根据数据库列的类型,使用毫秒、微秒或纳秒的精度值,准确捕获日期、数据时间和时间戳的值,但TIME类型的字段除外,它总是被捕获为微秒。adaptive(已弃用)根据数据库列的类型,使用毫秒、微秒或纳秒的精度,完全按照数据库中的时间和时间戳值来捕捉;或者connector总是使用Kafka Connector内置的时间、日期和时间戳的表示法来表示时间和时间戳值,它使用毫秒精度,而不管数据库列的精度

decimal.handling.mode

precise

指定连接器应该如何处理DECIMAL和NUMERIC列的值:precision(默认)使用java.math.BigDecimal值精确表示它们,在变化事件中以二进制形式表示;或者使用double值表示它们,这可能会导致精度的损失,但会更容易使用。

string选项将值编码为格式化的字符串,这很容易使用,但会失去关于真正类型的语义信息

bigint.unsigned.handling.mode

long

指定BIGINT UNSIGNED列在变化事件中的表示方式,包括:precision使用java.math.BigDecimal表示数值,在变化事件中使用二进制表示法和Kafka Connect的org.apache.kafka.connect.data.Decimal类型进行编码;

long(默认)使用Java的long表示数值,它可能不提供精度,但在消费者中使用起来会容易得多。只有在处理大于2^63的值时,才应该使用精确设置,因为这些值不能用long来表达。

include.schema.changes

true

指定是否要将数据库schema变更事件推送到Topic中

include.query

false

指定连接器是否应包括产生变化事件的原始SQL查询。

注意:这个选项要求MySQL在配置时将binlog_rows_query_log_events选项设置为ON。查询将不会出现在从快照过程中产生的事件中。

启用该选项可能会暴露出被明确列入黑名单的表或字段,或通过在变更事件中包括原始SQL语句而被掩盖。出于这个原因,这个选项默认为 "false"

event.processing.failure.handling.mode

fail

指定connector在反序列化binlog事件过程中对异常的反应。

fail表示将传播异常,停止Connector

Warn将记录有问题的事件及binlog偏移量,然后跳过

skip:直接跳过有问题的事件

inconsistent.schema.handling.mode

fail

指定连接器应该如何应对与内部Schema表示中不存在的表有关的binlog事件(即内部表示与数据库不一致)。

fail将抛出一个异常(指出有问题的事件及其binlog偏移),并停止连接器。

warn将跳过有问题的事件,并把有问题事件和它的binlog偏移量记录下来。

skip将跳过有问题的事件。

max.queue.size

8192

指定阻塞队列的最大长度,从数据库日志中读取的变更事件在写入Kafka之前会被放入该队列。这个队列可以为binlog reader提供反向压力,例如,当写到Kafka的速度较慢或Kafka不可用时。出现在队列中的事件不包括在这个连接器定期记录的偏移量中。默认为8192,并且应该总是大于max.batch.size属性中指定的最大批次大小。

max.batch.size

2048

指定在该连接器的每次迭代中应处理的每批事件的最大长度。默认值为2048

poll.interval.ms

1000

指定连接器在每次迭代过程中等待新的变化事件出现的毫秒数。默认为1000毫秒,或1秒

connect.timeout.ms

30000

指定该连接器在尝试连接到MySQL数据库服务器后,在超时前应等待的最大时间(毫秒)。默认值为30秒

tombstones.on.delete

true

控制是否应在删除事件后生成墓碑事件。

当为真时,删除操作由一个删除事件和一个后续的墓碑事件表示。当false时,只有一个删除事件被发送。

发出墓碑事件(默认行为)允许Kafka在源记录被删除后完全删除所有与给定键有关的事件。

message.key.columns

empty string

一个分号的正则表达式列表,匹配完全限定的表和列,以映射一个主键。

每一项(正则表达式)必须与代表自定义键的<完全限定的表>:<列的逗号分隔列表>相匹配。

完全限定的表可以定义为databaseName.tableName。

binary.handling.mode

bytes

指定二进制(blob、binary、varbinary等)列在变化事件中的表示方式,包括:bytes表示二进制数据为字节数组(默认),base64表示二进制数据为base64编码的String,hex表示二进制数据为hex编码的(base16)String

connect.keep.alive

true

指定是否应使用单独的线程来确保与MySQL服务器/集群保持连接

table.ignore.builtin

true

指定是否应该忽略内置系统表。无论表的白名单或黑名单如何,这都适用。默认情况下,系统表被排除在监控之外,当对任何系统表进行更改时,不会产生任何事件。

database.history.kafka.recovery.poll.interval.ms

100

用于指定连接器在启动/恢复期间轮询持久数据时应等待的最大毫秒数。默认值是100ms

database.history.kafka.recovery.attempts

4

在连接器恢复之前,连接器尝试读取持久化历史数据的最大次数。没有收到数据后的最大等待时间是recovery.attempts x recovery.poll.interval.ms

database.history.skip.unparseable.ddl

false

指定连接器是否应该忽略畸形或未知的数据库语句,或停止处理并让操作者修复问题。安全的默认值是false。跳过应该谨慎使用,因为在处理binlog时,它可能导致数据丢失或混乱

database.history.store.only.monitored.tables.ddl

false

指定连接器是否应该记录所有的DDL语句或(当为true时)只记录那些与Debezium监控的表有关的语句(通过过滤器配置)。安全的默认值是false。这个功能应该谨慎使用,因为当过滤器被改变时,可能需要缺失的数据。

database.ssl.mode

disabled

指定是否使用加密的连接。默认是disabled,并指定使用未加密的连接。

如果服务器支持安全连接,preferred选项会建立一个加密连接,否则会退回到未加密连接。

required选项建立一个加密连接,但如果由于任何原因不能建立加密连接,则会失败。

verify_ca选项的行为类似于required,但是它还会根据配置的证书颁发机构(CA)证书来验证服务器的TLS证书,如果它不匹配任何有效的CA证书,则会失败。

verify_identity选项的行为与verify_ca类似,但另外验证服务器证书与远程连接的主机是否匹配。

binlog.buffer.size

0

Binlog Reader使用的缓冲区的大小。

在特定条件下,MySQL Binlog可能包含Roldback语句完成的未提交的数据。典型示例正在使用SavePoints或混合单个事务中的临时和常规表更改。

当检测到事务的开始时,Debezium尝试向前滚动Binlog位置并找到提交或回滚,以便决定事务的更改是否会流流。缓冲区的大小定义了Debezium可以在搜索事务边界的同时的交易中的最大变化次数。如果事务的大小大于缓冲区,则Debezium需要重新卷起并重新读取流式传输时不适合缓冲区的事件。0代表禁用缓冲。默认情况下禁用。

注意:此功能还在测试

snapshot.mode

initial

指定连接器在启动允许快照时的模式。默认为inital,并指定仅在没有为逻辑服务器名称记录偏移时才能运行快照。

when_needed选项指定在启动时运行快照,只要它认为它需要(当没有可用偏移时,或者以前记录的偏移量在指定服务器中不可用的Binlog位置或GTID)。

never选项指定不运行快照。

schema_only选项只获取启动以来的更改。

schema_only_recovery选项是现有连接器的恢复选项,用来恢复损坏或者丢失的数据库历史topic

snapshot.locking.mode

minimal

控制连接器是否持续获取全局MySQL读取锁(防止数据库的任何更新)执行快照。有三种可能的值minimal,extended,none。

minimal仅在连接器读取数据库模式和其他元数据时保持全局读取锁定仅适用于快照的初始部分。快照中的剩余工作涉及从每个表中选择所有行,即使在不再保持全局读取锁定状态时,也可以使用可重复读取事务以一致的方式完成。虽然其他MySQL客户端正在更新数据库。

extended指在某些情况下客户端提交MySQL从可重复读取语义中排除的操作,可能需要阻止所有写入的全部持续时间。

None将阻止连接器在快照过程中获取任何表锁。此值可以与所有快照模式一起使用,但仅在快照时不发生Schema更改时,才能使用。注意:对于使用MyISAM引擎定义的表,表仍将被锁定,只要该属性设置为MyISAM获取表锁定。InnoDB引擎获取的是行级锁

snapshot.select.statement.overrides


控制哪些表的行将被包含在快照中。此属性包含一个以逗号分隔的完全限定的表(DB_NAME.TABLE_NAME)的列表。单个表的选择语句在进一步的配置属性中指定,每个表由 id snapshot.select.statement.overrides.[DB_NAME].[TABLE_NAME] 识别。这些属性的值是在快照期间从特定表检索数据时要使用的 SELECT 语句。对于大型的仅有附录的表来说,一个可能的用例是设置一个特定的点来开始(恢复)快照,以防止之前的快照被打断。

注意:这个设置只对快照有影响。从binlog捕获的事件完全不受其影响。

min.row.count.to.stream.results

1000

在快照操作中,连接器将查询每个包含的表,为该表的所有行产生一个读取事件。这个参数决定了MySQL连接是否会将表的所有结果拉入内存,或者是否会将结果流化(可能会慢一些,但对于非常大的表来说是可行的)。该值指定了在连接器将结果流化之前,表必须包含的最小行数,默认为1000。将此参数设置为'0',可以跳过所有的表大小检查,并在快照期间始终流式处理所有结果

database.initial.statements


建立到数据库的 JDBC 连接(不是事务日志读取连接)时要执行的 SQL 语句的分号分隔列表。使用双分号 (';;') 将分号用作字符而不是分隔符。注意:连接器可以自行决定建立 JDBC 连接,因此这通常仅用于配置会话参数,而不用于执行 DML 语句

snapshot.delay.ms


连接器在启动后,进行快照之前需要等待的时间间隔;当在集群中启动多个连接器时可以避免快照中断

snapshot.fetch.size


指定在快照时应该从表中一次性读取的最大行数

snapshot.lock.timeout.ms

10000

指定在快照时等待获取表锁的最长时间,如果在指定时间间隔内未获取到表锁的时候,则快照失败

enable.time.adjuster


MySQL allows users to insert year values ​​as 2 or 4 digits. In the case of two digits, the value is automatically mapped to the range 1970 - 2069. This is usually done by the database. Set to true (default) when Debezium does the conversion.

Set to false when the transformation is fully delegated to the database

sanitize.field.names

true when connector configuration explicitly specifies the key.converter or value.converter parameters to use Avro, otherwise defaults to false.

Whether to sanitize field names to comply with Avro naming requirements

skipped.operations


A comma-separated list of op operations to skip. Operations include: c for insert, u for update, and d for delete. By default, no actions are skipped

Guess you like

Origin blog.csdn.net/qq_28680977/article/details/122149529