Build real-time data synchronization using Databend Kafka Connect

Author: Han Shanjie

Databend Cloud R&D Engineer

https://github.com/hantmac

Introduction to Kafka Connect

Kafka Connect is a tool for scalable and reliable streaming of data between Apache Kafka® and other data systems. By standardizing the movement of data in and out of Kafka, it becomes simple to quickly define connectors to transfer large data sets in Kafka, making it easier to build large-scale real-time data pipelines.

We use Kafka Connectors to read or write to external systems, manage data flows, and scale systems, all without developing new code. Kafka Connect manages all common issues when connecting to other systems (Schema management, fault tolerance, parallelism, latency, delivery semantics, etc.). Each Connector only focuses on how to copy data between the target system and Kafka.

Kafka connectors are usually used to build data pipelines, and there are generally two usage scenarios:

  • Start and end endpoints: For example, export data from Kafka to the Databend database, or import data from the Mysql database into Kafka.

  • Intermediary for data transmission: For example, in order to store massive log data in Elasticsearch, you can first transfer the log data to Kafka, and then import the data from Kafka into Elasticsearch for storage. Kafka connectors can serve as buffers for each stage of the data pipeline, effectively decoupling consumer programs and producer programs.

Kafka Connect is divided into two types:

  • Source Connect: Responsible for importing data into Kafka.
  • Sink Connect: Responsible for exporting data from the Kafka system to the target table.

Databend Kafka Connect

Kafka currently provides hundreds of Connectors on Confluent Hub , such as Elasticsearch Service Sink Connector , Amazon Sink Connector , HDFS Sink, etc. Users can use these Connectors to build data pipelines between any systems with Kafka as the center. Now we also provide Kafka Connect Sink Plugin for Databend . In this article we will introduce how to use MySQL JDBC Source Connector and Databend Sink Connector to build a real-time data synchronization pipeline.

Start Kafka Connect

This article assumes that Apache Kafka has been installed on the operating machine. If the user has not installed it, you can refer to Kafka quickstart to install it.

Kafka Connect currently supports two execution modes: Standalone mode and distributed mode.

startup mode

Standalone mode

In Standalone mode, all work is done in a single process. This mode is easier to configure and get started, but it does not take full advantage of some important features of Kafka Connect, such as fault tolerance. We can start the Standalone process using the following command:

bin/connect-standalone.sh config/connect-standalone.properties connector1.properties [connector2.properties ...]

The first parameter config/connect-standalone.properties is the worker configuration. This includes configurations such as Kafka connection parameters, serialization format, and frequency of offset submission:

bootstrap.servers=localhost:9092
key.converter.schemas.enable=true
value.converter.schemas.enable=true
offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000

The following configuration is the parameter that specifies the Connector to be started. The default configuration provided above is for a local cluster running with the default configuration provided in config/server.properties. If you use a different configuration or deploy in production, you need to adjust the default configuration. But no matter what, all Workers (independent and distributed) require some configuration:

  • bootstrap.servers: This parameter lists the broker servers that will work with Connect. Connector will write data to these brokers or read data from them. You don't need to specify all brokers in the cluster, but it is recommended to specify at least 3.

  • key.converter and value.converter: Specify the converters used for message keys and message values ​​respectively, used to convert between the Kafka Connect format and the serialization format written to Kafka. This controls the format of keys and values ​​in messages written to or read from Kafka. Since this has nothing to do with Connectors, any Connector can be used with any serialization format. By default, the JSONConverter provided by Kafka is used. Some converters also include specific configuration parameters. For example, specify whether the JSON message contains a schema by setting key.converter.schemas.enable to true or false.

  • offset.storage.file.filename: File used to store Offset data.

These configuration parameters allow Kafka Connect producers and consumers to access configuration, offset, and status topics. To configure the producer used by the Kafka Source task and the consumer used by the Kafka Sink task, you can use the same parameters, but you need to add 'producer.' and 'consumer.' prefixes respectively. bootstrap.servers is the only Kafka client parameter that does not require a prefix.

distributed mode

Distributed mode automatically balances workloads and can dynamically scale up (or down) and provide fault tolerance. Distributed mode execution is very similar to Standalone mode:

bin/connect-distributed.sh config/connect-distributed.properties

The difference lies in the scripts launched and the configuration parameters. In distributed mode, use connect-distributed.sh instead of connect-standalone.sh. The first worker configuration parameter uses the config/connect-distributed.properties configuration file:

bootstrap.servers=localhost:9092
group.id=connect-cluster
key.converter.schemas.enable=true
value.converter.schemas.enable=true
offset.storage.topic=connect-offsets
offset.storage.replication.factor=1
#offset.storage.partitions=25
config.storage.topic=connect-configs
config.storage.replication.factor=1
status.storage.topic=connect-status
status.storage.replication.factor=1
#status.storage.partitions=5
offset.flush.interval.ms=10000

Kafka Connect stores offset, configuration, and task status in Kafka Topic. It is recommended to manually create topics for offset, configuration, and status to achieve the required number of partitions and replication factors. If the Topic has not been created when Kafka Connect is started, the Topic will be automatically created using the default number of partitions and replication factor, which may not be suitable for our application. It is important to configure the following parameters before starting the cluster:

  • group.id: The unique name of the Connect cluster, the default is connect-cluster. Workers with the same group id belong to the same Connect cluster. Note that this cannot conflict with the consumer group ID.

  • config.storage.topic: Topic used to store Connector and task configurations, the default is connect-configs. It should be noted that this is a topic with only one partition, high replication and compression. We may need to manually create the Topic to ensure that the configuration is correct, because the automatically created Topic may have multiple partitions or be automatically configured to be deleted instead of compressed.

  • offset.storage.topic: Topic used to store Offset, the default is connect-offsets. This Topic can have multiple partitions.

  • status.storage.topic: Topic used to store status, default is connect-status. This Topic can have multiple partitions.

It should be noted that in distributed mode, the Connector needs to be managed through the rest api.

for example:

GET /connectors – 返回所有正在运行的connector名。
POST /connectors – 新建一个connector; 请求体必须是json格式并且需要包含name字段和config字段,name是connector的名字,config是json格式,必须包含你的connector的配置信息。
GET /connectors/{name} – 获取指定connetor的信息。
GET /connectors/{name}/config – 获取指定connector的配置信息。
PUT /connectors/{name}/config – 更新指定connector的配置信息。

Configure Connector

MySQL Source Connector

  1. Install MySQL Source Connector Plugin

Here we use the JDBC Source Connector provided by Confluent.

Download the Kafka Connect JDBC plug-in from Confluent hub and extract the zip file to the /path/kafka/libs directory.

  1. Install MySQL JDBC Driver

Because the Connector needs to communicate with the database, a JDBC driver is also required. The JDBC Connector plug-in does not have a built-in MySQL driver, so we need to download the driver separately. MySQL provides JDBC drivers for many platforms . Select the Platform Independent option and download the compressed TAR file. This file contains the JAR file and source code. Extract the contents of this tar.gz file to a temporary directory. jar file (for example, mysql-connector-java-8.0.17.jar), and only this JAR file, to the same libsdirectory as the kafka-connect-jdbc jar file:

cp mysql-connector-j-8.0.32.jar /opt/homebrew/Cellar/kafka/3.4.0/libexec/libs/
  1. Deployment MySQL Connector

/path/kafka/configCreate a configuration file under mysql.propertiesand use the following configuration:

name=test-source-mysql-autoincrement
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=1
connection.url=jdbc:mysql://localhost:3306/mydb?useSSL=false
connection.user=root
connection.password=123456
#mode=timestamp+incrementing
mode=incrementing
table.whitelist=mydb.test_kafka
poll.interval.ms=1000
table.poll.interval.ms=3000
incrementing.column.name=id
#timestamp.column.name=tms
topics=test_kafka

For configuration, we focus here on mode, incrementing.column.name, and timestamp.column.nameseveral fields. Kafka Connect MySQL JDBC Source provides three incremental synchronization modes:

  • incrementing
  • timestamp
  • timestamp+incrementing
  1. In the incrementing mode, each time the column specified by the incrementing.column.name parameter is used, the query is greater than the maximum id since the last pull:
SELECT * FROM mydb.test_kafka
WHERE id > ?
ORDER BY id ASC

The disadvantage of this pattern is that it cannot capture changes from update operations (e.g., UPDATE, DELETE) on a row because the row's id cannot be incremented.

  1. timestamp mode detects new or modified rows based on the timestamp column on the table. This column should ideally be updated with each write and have monotonically increasing values. The timestamp column needs to be specified using the timestamp.column.name parameter.

It should be noted that the timestamp column cannot be set to Nullable in the data table.

In timestamp mode, each time based on the column specified by the timestamp.column.name parameter, the query is greater than the gmt_modified since the last successful pull:

SELECT * FROM mydb.test_kafka
WHERE tms > ? AND tms < ?
ORDER BY tms ASC

This mode can capture UPDATE changes on the row, but the disadvantage is that it may cause data loss. Since the timestamp column is not the only column field, there may be two or more columns with the same timestamp. Suppose a crash occurs during the import of the second item. When the re-import is restored, the second and subsequent items with the same timestamp will appear. Several pieces of data will be lost. This is because after the first entry is imported successfully, the corresponding timestamp will be recorded as being successfully consumed, and after recovery, synchronization will start from records greater than the timestamp. In addition, you also need to ensure that the timestamp column increases with time. If the timestamp column is artificially modified to be smaller than the current maximum timestamp for successful synchronization, the change will not be synchronized.

  1. There are pitfalls in using only incrementing or timestamp modes. Using timestamp and incrementing together can take full advantage of the advantages of incrementing mode without losing data and the advantages of timestamp mode capturing changes in update operations. You need to use incrementing.column.namethe parameter to specify the strictly increasing column and timestamp.column.namethe parameter to specify the timestamp column.
SELECT * FROM mydb.test_kafka
WHERE tms < ?
  AND ((tms = ? AND id > ?) OR tms > ?)
ORDER BY tms, id ASC

Since the MySQL JDBC Source Connector is a query-based data acquisition method, it uses SELECT queries to retrieve data and does not have a complex mechanism to detect deleted rows, so the operation is not supported DELETE. You can use log-based [ Kafka Connect Debezium ].

The effects of the above modes will be demonstrated separately in subsequent demonstrations. For more configuration parameters, please refer to MySQL Source Configs .

Databend Kafka Connector

  1. Install OR compile Databend Kafka Connector

You can compile the jar from the source code or download it directly from the release .

git clone https://github.com/databendcloud/databend-kafka-connect.git & cd databend-kafka-connect
mvn -Passembly -Dmaven.test.skip package

Copy databend-kafka-connect.jarto /path/kafka/libsthe directory.

  1. Install Databend JDBC Driver

Download the latest Databend JDBC from Maven Central/path/kafka/libs and copy it to the directory.

  1. Deployment Databend Kafka Connector

/path/kafka/configCreate a configuration file under mysql.propertiesand use the following configuration:

name=databend
connector.class=com.databend.kafka.connect.DatabendSinkConnector

connection.url=jdbc:databend://localhost:8000
connection.user=databend
connection.password=databend
connection.attempts=5
connection.backoff.ms=10000
connection.database=default

table.name.format=default.${topic}
max.retries=10
batch.size=1
auto.create=true
auto.evolve=true
insert.mode=upsert
pk.mode=record_value
pk.fields=id
topics=test_kafka
errors.tolerance=all

auto.createWhen and auto.evolveare set to , truethe table will be automatically created and synchronized to the target table when the source table structure changes. For an introduction to more configuration parameters, please refer to Databend Kafka Connect Properties .

Test Databend Kafka Connect

Prepare various components

  1. StartMySQL
version: '2.1'
services:
  postgres:
    image: debezium/example-postgres:1.1
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_DB=postgres
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
  mysql:
    image: debezium/example-mysql:1.1
    ports:
      - "3306:3306"
    environment:
      - MYSQL_ROOT_PASSWORD=123456
      - MYSQL_USER=mysqluser
      - MYSQL_PASSWORD=mysqlpw
  1. Start Databend
version: '3'
services:
  databend:
    image: datafuselabs/databend
    volumes:
      - /Users/hanshanjie/databend/local-test/databend/databend-query.toml:/etc/databend/query.toml
    environment:
      QUERY_DEFAULT_USER: databend
      QUERY_DEFAULT_PASSWORD: databend
      MINIO_ENABLED: 'true'
    ports:
      - '8000:8000'
      - '9000:9000'
      - '3307:3307'
      - '8124:8124'
  1. Start Kafka Connect in standalone mode and load the MySQL Source Connector and Databend Sink Connector:
./bin/connect-standalone.sh config/connect-standalone.properties config/databend.properties config/mysql.properties
[2023-09-06 17:39:23,128] WARN [databend|task-0] These configurations '[metrics.context.connect.kafka.cluster.id]' were supplied but are not used yet. (org.apache.kafka.clients.consumer.ConsumerConfig:385)
[2023-09-06 17:39:23,128] INFO [databend|task-0] Kafka version: 3.4.0 (org.apache.kafka.common.utils.AppInfoParser:119)
[2023-09-06 17:39:23,128] INFO [databend|task-0] Kafka commitId: 2e1947d240607d53 (org.apache.kafka.common.utils.AppInfoParser:120)
[2023-09-06 17:39:23,128] INFO [databend|task-0] Kafka startTimeMs: 1693993163128 (org.apache.kafka.common.utils.AppInfoParser:121)
[2023-09-06 17:39:23,148] INFO Created connector databend (org.apache.kafka.connect.cli.ConnectStandalone:113)
[2023-09-06 17:39:23,148] INFO [databend|task-0] [Consumer clientId=connector-consumer-databend-0, groupId=connect-databend] Subscribed to topic(s): test_kafka (org.apache.kafka.clients.consumer.KafkaConsumer:969)
[2023-09-06 17:39:23,150] INFO [databend|task-0] Starting Databend Sink task (com.databend.kafka.connect.sink.DatabendSinkConfig:33)
[2023-09-06 17:39:23,150] INFO [databend|task-0] DatabendSinkConfig values:...

Insert

In Insert mode we need to use the following MySQL Connector configuration:

name=test-source-mysql-jdbc-autoincrement
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=1
connection.url=jdbc:mysql://localhost:3306/mydb?useSSL=false
connection.user=root
connection.password=123456
#mode=timestamp+incrementing
mode=incrementing
table.whitelist=mydb.test_kafka
poll.interval.ms=1000
table.poll.interval.ms=3000
incrementing.column.name=id
#timestamp.column.name=tms
topics=test_kafka

mydbCreate database and tables in MySQL test_kafka:

CREATE DATABASE mydb;
USE mydb;

CREATE TABLE test_kafka (id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,name VARCHAR(255) NOT NULL,description VARCHAR(512));
ALTER TABLE test_kafka AUTO_INCREMENT = 10;

Before inserting data, databend-kafka-connect will not receive events for table creation and data writing.

Insert data:

INSERT INTO test_kafka VALUES (default,"scooter","Small 2-wheel scooter"),
(default,"car battery","12V car battery"),
(default,"12-pack drill bits","12-pack of drill bits with sizes ranging from #40 to #3"),
(default,"hammer","12oz carpenter's hammer"),
(default,"hammer","14oz carpenter's hammer"),
(default,"hammer","16oz carpenter's hammer"),
(default,"rocks","box of assorted rocks"),
(default,"jacket","water resistent black wind breaker"),
(default,"cloud","test for databend"),
(default,"spare tire","24 inch spare tire");

After inserting data into the source table,

The Databend target table is newly created:

At the same time, the data will be successfully inserted:

Support DDL

We are in the configuration file auto.evolve=true, so when the source table structure changes, the DDL will be synchronized to the target table. Here we just need to incrementingchange the mode of MySQL Source Connector from to timestamp+incrementing, we need to add a timestampfield and open timestamp.column.name=tmsthe configuration. We execute in the original table:

alter table test_kafka add column tms timestamp;

And insert a piece of data:

insert into test_kafka values(20,"new data","from kafka",now());

Go to the target table to view:

It is found that tmsthe field has been synchronized to the Databend table, and the data has been inserted successfully:

Upsert

Modify the configuration of MySQL Connector as follows:

name=test-source-mysql-jdbc-autoincrement
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=1
connection.url=jdbc:mysql://localhost:3306/mydb?useSSL=false
connection.user=root
connection.password=123456
mode=timestamp+incrementing
#mode=incrementing
table.whitelist=mydb.test_kafka
poll.interval.ms=1000
table.poll.interval.ms=3000
incrementing.column.name=id
timestamp.column.name=tms
topics=test_kafka

Mainly modechange to timestamp+incrementingand add timestamp.column.namefields.

Restart Kafka Connect.

Update a piece of data in the source table:

update test_kafka set name="update from kafka test" where id=20;

Go to the target table to see the updated data:

Summarize

As you can see from the above content, Databend Kafka Connectit has the following characteristics:

  • Table and Column support automatic creation: With the configuration support of and , Table and Column can be automatically created. Table name is created based on Kafka topic name auto.create;auto-evolve

  • Kafka Shemas support: The Connector supports Avro, JSON Schema and Protobuf input data formats. Schema Registry must be enabled to use Schema Registry-based formats;

  • Multiple write modes: Connector supports insertand upsertwrite modes;

  • Multi-tasking support: With the capabilities of Kafka Connect, Connector supports running one or more tasks. Increasing the number of tasks can improve system performance;

  • High availability: Distributed mode can automatically balance workloads, dynamically expand (or shrink), and provide fault tolerance.

At the same time, Databend Kafka Connect can also use the configuration supported by native Connect. For more configurations, please refer to Kafka Connect Sink Configuration Properties for Confluent Platform .

Connect With Us

Databend is an open source, flexible, low-cost, new data warehouse based on object storage that can also perform real-time analysis. We look forward to your attention and exploring cloud native data warehouse solutions together to create a new generation of open source Data Cloud.

The author of the open source framework NanUI switched to selling steel, and the project was suspended. The first free list in the Apple App Store is the pornographic software TypeScript. It has just become popular, why do the big guys start to abandon it? TIOBE October list: Java has the biggest decline, C# is approaching Java Rust 1.73.0 Released A man was encouraged by his AI girlfriend to assassinate the Queen of England and was sentenced to nine years in prison Qt 6.6 officially released Reuters: RISC-V technology becomes the key to the Sino-US technology war New battlefield RISC-V: Not controlled by any single company or country, Lenovo plans to launch Android PC
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5489811/blog/10116431