Flink CDC principle and production practice (continuous update...)

The MySQL CDC connector allows to read snapshot data and incremental data from the MySQL database.
This document translates how to set up the MySQL CDC connector to run SQL queries against the MySQL database based on the ververica official website .

1. Dependency

In order to set up the MySQL CDC connector, the following table provides dependency information for two projects that use a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundle.

1. Maven dependency

<dependency>
  <groupId>com.alibaba.ververica</groupId>
  <artifactId>flink-connector-mysql-cdc</artifactId>
  <version>1.1.0</version>
</dependency>

2. SQL client JAR

Download flink-sql-connector-mysql-cdc-1.1.0.jar and place it under <FLINK_HOME>/lib/.

Two, set up the MySQL server

You must define a MySQL user with appropriate permissions for all databases monitored by the Debezium MySQL connector.

1. Create a MySQL user:

mysql> CREATE USER 'user'@'localhost' IDENTIFIED BY 'password';

2. Grant the required permissions to the user:

mysql> GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'user' IDENTIFIED BY 'password';

3. Finalize user permissions:

mysql> FLUSH PRIVILEGES;

View more information about permission descriptions: https://debezium.io/documentation/reference/1.2/connectors/mysql.html#_permissions_explained .

Three, pay attention

1. How does MySQL CDC source code work?

When the MySQL CDC source is started, it will acquire a global read lock (FLUSH TABLES WITH READ LOCK), which will prevent other databases from writing. Then, it reads the current binlog location and the database and table schema. After that, the global read lock will be released. Then, it scans the database table and reads the binlog from the previously recorded location. Flink will periodically execute checkpoints to record the binlog location. If a failure occurs, the job will be restarted and recovered from the binlog location where the checkpoint was completed. Therefore, it guarantees only once semantics.

2. Grant RELOAD permissions to MySQL users

If the MySQL user is not granted RELOAD permissions, the MySQL CDC source will use table-level locks instead and use this method to perform snapshots. This will prevent writing for a longer period of time.

3. Global read lock (FLUSH TABLES WITH READ LOCK)

The global read lock is maintained during the reading of the binlog location and schema. This may take a few seconds , depending on the number of tables. Global read lock prevents writes , so it may still affect online business.
If you want to skip the read lock and can tolerate the semantics at least once , you can add the option **'debezium.snapshot.locking.mode' ='none'** to skip the lock.

4. Set a differnet SERVER ID for each job

Each MySQL database client used to read binlog should have a unique ID called server id. The MySQL server will use this ID to maintain network connections and binlog locations. If different jobs share the same server id, it may cause reading from the wrong binlog location.
Tip: By default, when the TaskManager is started, the server id is random. If the TaskManager fails, it may have a different server id when it is started again. But this shouldn't happen frequently (job exception will not restart TaskManager), nor will it have much impact on the MySQL server.

Therefore, it is recommended to set a different server id for each job, for example:

  • 通过SQL Hints*: SELECT * FROM source_table /*+ OPTIONS(‘server-id’=‘123456’) / ;

  • Set when creating source through Stream ApI: ** MySQLSource.builder() .xxxxxx .serverId(123456);**

Important : Mysq's ​​binlog can be said to be at the library level, so pulling different tables in a library or the same table with the same server id may cause data loss. So it is recommended to set the server id. (I also asked Jark boss in the community mail: delivery address )

5. The checkpoint cannot be performed during the scan of the database table

During the scan of the table, we cannot perform checkpoints because there is no place to recover. In order not to perform a checkpoint, the MySQL CDC source will keep the checkpoint waiting timeout. Timed out checkpoints will be identified as failed checkpoints, which will trigger the failover of Flink jobs by default. Therefore, if the database table is large, it is recommended to add the following Flink configuration to avoid failover due to timeout checkpoints:

execution.checkpointing.interval: 10min
execution.checkpointing.tolerable-failed-checkpoints: 100
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 2147483647

6. Set MySQL session timeout

When creating an initially consistent snapshot for a large database, the connection you establish may time out while reading the table. You can prevent this behavior by configuring Interactive_timeout and wait_timeout in the MySQL configuration file.

  • interactive_timeout: The number of seconds the server waits for activity before closing an interactive connection. See the MySQL documentation .

  • wait_timeout: The number of seconds the server waits for its activity before closing a non-interactive connection. See the MySQL documentation .

Fourth, how to create a MySQL CDC table

1. Sql method:

(1) The definition table is as follows:

-- register a MySQL table 'orders' in Flink SQL
CREATE TABLE orders (
  order_id INT,
  order_date TIMESTAMP(0),
  customer_name STRING,
  price DECIMAL(10, 5),
  product_id INT,
  order_status BOOLEAN
) WITH (
  'connector' = 'mysql-cdc',
  'hostname' = 'localhost',
  'port' = '3306',
  'username' = 'root',
  'password' = '123456',
  'database-name' = 'mydb',
  'table-name' = 'orders'
);

-- read snapshot and binlogs from orders table
SELECT * FROM orders;

(2) Connector options

      待补充。。。

2、Stream API

The MySQL CDC connector can also be a DataStream source. You can create SourceFunction as follows:

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import com.alibaba.ververica.cdc.debezium.StringDebeziumDeserializationSchema;
import com.alibaba.ververica.cdc.connectors.mysql.MySQLSource;

public class MySqlBinlogSourceExample {
  public static void main(String[] args) throws Exception {
    SourceFunction<String> sourceFunction = MySQLSource.<String>builder()
      .hostname("localhost")
      .port(3306)
      .databaseList("inventory") // monitor all tables under inventory database
      .username("flinkuser")
      .password("flinkpw")
      .deserializer(new StringDebeziumDeserializationSchema()) // converts SourceRecord to String
      .build();

    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    env
      .addSource(sourceFunction)
      .print().setParallelism(1); // use parallelism 1 for sink to keep message ordering

    env.execute();
  }
}

Five, characteristics

1. Exactly-Once Processing

The MySQL CDC connector is a Flink Source connector. It will first read the database snapshot, and then even if a failure occurs, it will continue to read the binary log with a complete one-time process. Please read how the connector performs database snapshots.

2. Single Thread Reading

The MySQL CDC source cannot be read in parallel because only one task can receive Binlog events.

6. Data type mapping.

To be added. . .

Seven, frequently asked questions

1. How to skip the snapshot and read only from binlog?

Debezium.snapshot.mode can be controlled by options , you can set it to:

  • never: Specify that the connection should never use snapshots, and when the logical server name is used for the first start, the connector should read from the beginning of the binlog; please use it with caution, because it is only valid when the binlog is guaranteed to contain the entire history of the database.

  • schema_only: If you don't need continuous snapshots of the data since the connector was started, but only need them to be changed, you can use the schema_only option, where the connector only snapshots the schema (not the data).

2. How to read a shared database containing multiple tables (such as user_00, user_01,..., user_99)?

The table-name option supports regular expressions to monitor multiple tables matching regular expressions. Therefore, you can set table-name to user_.* to monitor all user_ prefix tables. The database-name option is the same. Please note that the shared tables should be in the same schema.

3. ConnectException: DML'...' received for processing, binlog may contain usage statements or events generated based on a mixed copy format

If there is the above exception, please check whether the binlog_format is ROW, you can run it in the MySQL client by running show variables like'%binlog_format%'. Please note that even if your database configuration for binlog_format is ROW, you can change this configuration through other sessions, such as SET SESSION binlog_format='MIXED'; SET SESSION tx_isolation='REPEATABLE-READ'; COMMIT;. Also make sure that no other sessions are changing this configuration

8. Problems encountered in practice

To be added. . .

1. Different Kafka version dependency conflicts will cause cdc to report an error: http://apache-flink.147419.n8.nabble.com/cdc-td8357.html#a8393
2. Timeout problem: Set wait_timeout as mentioned above.
cdc timeout bug diagram.png

3、

Guess you like

Origin blog.csdn.net/weixin_44500374/article/details/112611082