Apache Kafka S3-based data export, import, backup, restore, and migration solutions

"Big Data Platform Architecture and Prototype Implementation: Practical Combat of Data Middle Platform Construction" The blogger spent three years carefully creating the book "Big Data Platform Architecture and Prototype Implementation: Practical Practice of Data Middle Platform Construction" is now available Published and distributed by the well-known IT book brand Electronic Industry Press Bowen Viewpoint, click "Heavy Recommendation: Building a big data platform is too difficult!" Send me a project prototype! 》Learn more about the book, JD.com purchase link: https://item.jd.com/12677623.html, Scan the QR code on the left to enter the JD.com mobile book purchase page.
When upgrading or migrating a system, users often need to export (backup) the data in a Kafka cluster, and then import (restore) the data in a new cluster or another cluster. Usually, Kafka MirrorMaker is used for data replication and synchronization between Kafka clusters. However, in some scenarios, due to environmental restrictions, the network between two Kafka clusters may not be connected, or Kafka data needs to be precipitated as Files are stored for future use. At this time, the solution based on Kafka Connect S3 Source / Sink Connector will be a more suitable choice. This article will introduce the specific implementation of this solution.

Exporting, importing, backing up, and restoring data are usually one-time operations. It is not necessary to build a complete and durable infrastructure for this purpose. Saving time and effort, simplicity and convenience are the priority considerations. To this end, this article will provide an out-of-the-box solution that uses Docker to build Kafka Connect. All operations are equipped with automated shell scripts. Users only need to set some environment variables and execute the corresponding scripts to complete all work. This Docker-based monolithic model can handle small and medium-sized data synchronization and migration. If you are looking for a stable and robust solution, you can consider migrating the Docker version of Kafka Connect to Kubernetes or Amamon MSK Connect to achieve cluster deployment.

1. Overall structure

First, let’s introduce the overall architecture of the solution. Export/import and backup/restore are actually two highly similar scenarios, but for the sake of clarity of description, we will discuss them separately. Let’s first take a look at the export/import architecture diagram:

Please add image description

Figure 1. Data export/import between Kafka clusters

In this architecture, Kafka on the Source side is the starting point of the data flow. Kafka Connect with the S3 Sink Connector installed will extract the data of the specified Topic from Kafka on the Source side, and then store it on S3 in the form of Json or Avro files; at the same time, another A Kafka Connect with the S3 Source Connector installed will read these Json or Avro files from S3 and then write them to the corresponding Topic of Kafka on the Sink side. If the Kafka clusters on the Source side and the Sink side are not in the same Region, you can complete the import and export in their respective Regions, and then use S3's Cross-Rejion Replication for data synchronization between the two Regions.

This architecture can be used for the backup/restore of the Kafka cluster with simple adjustments, as shown in the following figure: first back up the data of the Kafka cluster to S3, and then after completing the upgrade, migration or reconstruction of the cluster, transfer data from S3 to Just restore the data to the newly created cluster.
Insert image description here

Figure 2. Data backup/restore of Kafka cluster

This article will give complete environment setup instructions and practical scripts based on the export/import architecture shown in Figure 1. The backup/restore architecture shown in Figure 2 can also be implemented based on the guidance and scripts provided in this article.

2. Preset conditions

This article focuses on the data export/import and backup/restore operations of Kafka Connect. Due to space limitations, it is impossible to introduce in detail the construction and configuration methods of each component in the architecture. Therefore, the following preset conditions require readers to prepare in advance in their personal environment:

① An EC2 instance based on Amazon Linux2 (it is recommended to create a new pure instance). All practical scripts in this article will be executed on this instance. This instance is also the host running Kafka Connect Docker Container.

② Two Kafka clusters, one as Source and one as Sink; if only one Kafka cluster can complete the verification, the cluster will serve as both Source and Sink

③ In order to focus on the core configuration of Kafka Connect S3 Source / Sink Connector, we preset that the Kafka cluster does not have identity authentication turned on (that is, the authentication type is Unauthenticated), and the data transmission method is PLAINTEXT, in order to simplify the connection configuration of Kafka Connect

④ Network connectivity requires that the EC2 instance can access S3, Source-side Kafka cluster, and Sink-side Kafka cluster. If the Source and Sink cannot be connected at the same time in the actual environment, you can operate on two EC2s belonging to different networks, but they must both be able to access S3. If it is cross-Region or account isolation, you need to configure S3 Cross-Region Replication or manually copy the data files.

3. Global configuration

Since the actual operation will inevitably depend on the specific AWS account and various information in the local environment (such as AKSK, service address, various paths, Topic names, etc.), in order to ensure that the operation script given in this article is well portable To improve the performance, we extract all environment-related information and configure it centrally in the form of global variables before actual operation. The following is the configuration script for global variables. Readers need to set the values ​​of these variables according to their personal environment:

# account-specific configs
export REGION="<your-region>"
export S3_BUCKET="<your-s3-bucket>"
export AWS_ACCESS_KEY_ID="<your-aws-access-key-id>"
export AWS_SECRET_ACCESS_KEY="<your-aws-secret-access-key>"
export SOURCE_KAFKA_BOOTSTRAP_SEVERS="<your-source-kafka-bootstrap-servers>"
export SINK_KAFKA_BOOTSTRAP_SEVERS="<your-sink-kafka-bootstrap-servers>"
# kafka topics import and export configs
export SOURCE_TOPICS_LIST="<your-source-topic-list>"
export SINK_TOPICS_LIST="<your-sink-topic-list>"
export TOPIC_REGEX_LIST="<your-topic-regex-list>"
export SOURCE_TOPICS_REGEX="<your-source-topics-regex>"
export SINK_TOPICS_REPLACEMENT="<your-sink-topics-replacement>"    

In order to facilitate demonstration and interpretation, this article will use the following global configuration. The first 6 configurations are strongly related to the account and environment and still need to be modified by the user. The values ​​given in the script are only indicative values, and the next 5 configurations are related to the Kafka data. Import and export are closely related and it is not recommended to modify them because subsequent interpretations will be based on the values ​​set here. After the verification is completed, you can flexibly modify the last five configurations as needed to complete the actual import and export work.

Return to the operation process, log in to the prepared EC2 instance, modify the first 6 configurations related to the account and environment in the script below, and then execute the modified script. In addition, it should be noted that in subsequent operations, some scripts will no longer return after execution, but will continue to occupy the current window to output logs or Kafka messages. Therefore, a new command line window needs to be opened, and each new window needs to be executed. Once here is the global configuration script.

# 实操步骤(1): 全局配置
# account and environment configs
export REGION="us-east-1"
export S3_BUCKET="source-topics-data"
export AWS_ACCESS_KEY_ID="ABCDEFGHIGKLMNOPQRST"
export AWS_SECRET_ACCESS_KEY="abcdefghigklmnopqrstuvwxyz0123456789"
export SOURCE_KAFKA_BOOTSTRAP_SEVERS="b-1.cluster1.6ww5j7.c1.kafka.us-east-1.amazonaws.com:9092"
export SINK_KAFKA_BOOTSTRAP_SEVERS="b-1.cluster2.2au4b8.c2.kafka.us-east-1.amazonaws.com:9092"
# kafka topics import and export configs
export SOURCE_TOPICS_LIST="source-topic-1,source-topic-2"
export SINK_TOPICS_LIST="sink-topic-1,sink-topic-2"
export TOPIC_REGEX_LIST="source-topic-1:.*,source-topic-2:.*"
export SOURCE_TOPICS_REGEX="source-topic-(\\\d)" # to be resolved to "source-topic-(\\d)" in json configs
export SINK_TOPICS_REPLACEMENT="sink-topic-\$1" # to be resolved to "sink-topic-$1" in json configs

Regarding the last five configurations in the above script, there are detailed instructions as follows:

Configuration items sample value illustrate
SOURCE_TOPICS_LIST source-topic-1,source-topic-2 This value will be assigned to the topics configuration item of the S3 Sink Connector, which is used to specify the list of Topics to be exported (separated by commas)
SINK_TOPICS_LIST sink-topic-1,sink-topic-2 This value is a list of Sink Topics corresponding to Source Topics on the Sink side (separated by commas), but it will not appear in the configuration of the S3 Sink Connector because the S3 Sink Connector can be retrieved from the directory structure of S3 Learn which Topics on the Source side exist, and the Topic name on the Sink side is mapped using regular expressions based on the Topic name on the Source side. This value is only used in the script that creates the Topic on the Sink side (Note: Technically, it does not need to be set. For this variable, its value can be parsed from SOURCE_TOPICS_LIST, TOPIC_REGEX_LIST, SINK_TOPICS_REPLACEMENT, but this will increase the complexity of the script. causing inconvenience to readers in reading and understanding the script)
TOPIC_REGEX_LIST source-topic-1:.*,source-topic-2:.* This value will be assigned to the topic.regex.list configuration item of the S3 Source Connector. Its format is <topic1>:<regex1>,<topic2>:<regex2>,.... The function of this configuration is to tell the S3 Source Connector every Which files corresponding to a Topic are data files? Regular expressions are used to match file names (it should be noted that regular expressions are not used to match the middle path of the file. The middle path (such as partition=0) It is controlled by the configuration itempartitioner.class. The S3 Source Connector must use the same Patitioner as the S3 Sink Connector to correctly match the file path
SOURCE_TOPICS_REGEX source-topic-(\\\d) This value will be assigned to the transforms.xxx.regex configuration item of the S3 Source Connector. It is the regular expression of all Topics on the Source-side Kafka cluster. This value usually appears in regular grouping (group ), the associated expression SINK_TOPICS_REPLACEMENT will refer to these groups mapped to the target Topic on the Sink side.
SINK_TOPICS_REPLACEMENT sink-topic-\$1 This value will be assigned to the transforms.xxx.replacement configuration item of the S3 Source Connector. It is the regular expression of all Topics on the Sink-side Kafka cluster. It usually refers to SOURCE_TOPICS_REGEXregular grouping in order to map to the target Topic on the Sink side

Let’s take the values ​​set in the script as an example to explain the functions that these five configurations will achieve when combined, which is also the main content of this article:

There are two Topics named: source-topic-1 and source-topic-2 on the Kafka cluster on the Source side, through Kafka Connect (Docker) with S3 Sink Connector installed Container) Export the data of the two Topics to the designated storage bucket of S3, and then use Kafka Connect with the S3 Source Connector installed (Docker container, which can coexist with the S3 Source Connector as a Docker container). The data in the S3 bucket Write to the Kafka cluster on the Sink side, where the original source-topic-1 data will be written to sink-topic-1, and the original source-topic-2 data will be written to Writesink-topic-2

Specially, if it is a backup/restore scenario, you need to keep the exported/imported Topic names consistent. In this case, you can directly delete the four configurations starting with transforms in the S3 Source Connector. (will appear below), or change the following two items to:

export SOURCE_TOPICS_REGEX=".*"
export SINK_TOPICS_REPLACEMENT="\$0"

If you only have one Kafka cluster, you can also complete the verification work of this article. Just set SOURCE_KAFKA_BOOTSTRAP_SEVERS and SINK_KAFKA_BOOTSTRAP_SEVERS to the cluster at the same time. In this way, the cluster is both the Source side and the Sink side. Since the Source Topics and Sink Topics in the configuration have different names, there will be no conflict.

4. Environment preparation

4.1. Installation toolkit

Execute the following script on EC2 to install and configurejq, yq, docker, < a i=4>, five necessary software packages, you can choose to install all or part of the software according to your own EC2 situation. It is recommended to use a pure EC2 instance to complete all software installation:jdkkafka-console-client

# 实操步骤(2): 安装工具包
# install jq
sudo yum -y install jq
jq --version

# install yq
sudo wget https://github.com/mikefarah/yq/releases/download/v4.35.1/yq_linux_amd64 -O /usr/bin/yq
sudo chmod a+x /usr/bin/yq
yq --version

# install docker
sudo yum -y install docker
# enable & start docker
sudo systemctl enable docker
sudo systemctl start docker
sudo systemctl status docker
# configure docker, add current user to docker user group
# and refresh docker group to take effect immediately
sudo usermod -aG docker $USER
newgrp docker
docker --version

# install docker compose
dockerConfigDir=${dockerConfigDir:-$HOME/.docker}
mkdir -p $dockerConfigDir/cli-plugins
wget "https://github.com/docker/compose/releases/download/v2.20.3/docker-compose-$(uname -s)-$(uname -m)" -O $dockerConfigDir/cli-plugins/docker-compose
chmod a+x $dockerConfigDir/cli-plugins/docker-compose
docker compose version

# install jdk
sudo yum -y install java-1.8.0-openjdk-devel
# configure jdk
sudo tee /etc/profile.d/java.sh << EOF
export JAVA_HOME=/usr/lib/jvm/java
export PATH=\$JAVA_HOME/bin:\$PATH
EOF
# make current ssh session and other common linux users can run java cli
source /etc/profile.d/java.sh
sudo -i -u root source /etc/profile.d/java.sh || true
sudo -i -u ec2-user source /etc/profile.d/java.sh || true
java -version

# install kafka console client
kafkaClientUrl="https://archive.apache.org/dist/kafka/3.5.1/kafka_2.12-3.5.1.tgz"
kafkaClientPkg=$(basename $kafkaClientUrl)
kafkaClientDir=$(basename $kafkaClientUrl ".tgz")
wget $kafkaClientUrl -P /tmp/
sudo tar -xzf /tmp/$kafkaClientPkg -C /opt
sudo tee /etc/profile.d/kafka-client.sh << EOF
export KAFKA_CLIENT_HOME=/opt/$kafkaClientDir
export PATH=\$KAFKA_CLIENT_HOME/bin:\$PATH
EOF

# make current ssh session and other common linux users can run kakfa console cli
source /etc/profile.d/kafka-client.sh
sudo -i -u root source /etc/profile.d/kafka-client.sh || true
sudo -i -u ec2-user source /etc/profile.d/kafka-client.sh || true

# verify if kafka client available
kafka-console-consumer.sh --version

# set aksk for s3 and other aws operation
aws configure set default.region $REGION
aws configure set aws_access_key_id $AWS_ACCESS_KEY_ID
aws configure set aws_secret_access_key $AWS_SECRET_ACCESS_KEY

4.2. Create an S3 bucket

The entire solution uses S3 as the data dump medium, for which a bucket needs to be created on S3. The data from the Source-side Kafka cluster will be exported to this bucket and saved in the form of a Json file. When importing data to the Sink-side Kafka cluster, the Json file stored in this bucket will also be read.

# 实操步骤(3): 创建 S3 存储桶
aws s3 rm --recursive s3://$S3_BUCKET || aws s3 mb s3://$S3_BUCKET

4.3. Create Source Topics on source Kafka

In order to ensure that Topics data can be fully backed up and restored, S3 Source Connector recommends that the number of partitions of Sink Topics should be consistent with that of Source Topics (for details, please refer to [Official Document] two Topics, each containing 9 partitions: source-topic-1 and ), if you let Kafka automatically create Topics, it is very likely that the number of partitions of Source Topics and Sink Topics will be unequal. Therefore, we choose to manually create Source Topics and Sink Topics and ensure that their number of partitions consistent. The following script will create: source-topic-2

# 实操步骤(4): 在源 Kafka 上创建 Source Topics
for topic in $(IFS=,; echo $SOURCE_TOPICS_LIST); do
    # create topic
    kafka-topics.sh --bootstrap-server $SOURCE_KAFKA_BOOTSTRAP_SEVERS --create --topic $topic --replication-factor 3 --partitions 9
    # describe topic
    kafka-topics.sh --bootstrap-server $SOURCE_KAFKA_BOOTSTRAP_SEVERS --describe --topic $topic
done

4.4. Create Sink Topics on target Kafka

The reason is the same as above. The following script will create: sink-topic-1 and sink-topic-2 two Topics, each containing 9 partitions:

# 实操步骤(5): 在目标 Kafka 上创建 Sink Topics
for topic in $(IFS=,; echo $SINK_TOPICS_LIST); do
    # create topic
    kafka-topics.sh --bootstrap-server $SINK_KAFKA_BOOTSTRAP_SEVERS --create --topic $topic --replication-factor 3 --partitions 9
    # describe topic
    kafka-topics.sh --bootstrap-server $SINK_KAFKA_BOOTSTRAP_SEVERS --describe --topic $topic
done

5. Create Kafka Connect image

The next step is to create a Kafka Connect image with S3 Sink Connector and S3 Source Connector. The image and container are both named after kafka-s3-syncer. The following are the specific operations:

# 实操步骤(6): 制作 Kafka Connect 镜像
# note: do NOT use current dir as building docker image context dir,
# it is advised to create a new clean dir as image building context folder.
export DOCKER_BUILDING_CONTEXT_DIR="/tmp/kafka-s3-syncer"
mkdir -p $DOCKER_BUILDING_CONTEXT_DIR

# download and unpackage s3 sink connector plugin
wget https://d1i4a15mxbxib1.cloudfront.net/api/plugins/confluentinc/kafka-connect-s3/versions/10.5.4/confluentinc-kafka-connect-s3-10.5.4.zip \
    -O $DOCKER_BUILDING_CONTEXT_DIR/confluentinc-kafka-connect-s3-10.5.4.zip
unzip -o $DOCKER_BUILDING_CONTEXT_DIR/confluentinc-kafka-connect-s3-10.5.4.zip -d $DOCKER_BUILDING_CONTEXT_DIR

# download and unpackage s3 source connector plugin
wget https://d1i4a15mxbxib1.cloudfront.net/api/plugins/confluentinc/kafka-connect-s3-source/versions/2.4.5/confluentinc-kafka-connect-s3-source-2.4.5.zip \
    -O $DOCKER_BUILDING_CONTEXT_DIR/confluentinc-kafka-connect-s3-source-2.4.5.zip
unzip -o $DOCKER_BUILDING_CONTEXT_DIR/confluentinc-kafka-connect-s3-source-2.4.5.zip -d $DOCKER_BUILDING_CONTEXT_DIR

# make dockerfile
cat << EOF > Dockerfile
FROM confluentinc/cp-kafka-connect:7.5.0
# provision s3 sink connector
COPY confluentinc-kafka-connect-s3-10.5.4 /usr/share/java/confluentinc-kafka-connect-s3-10.5.4
# provision s3 source connector
COPY confluentinc-kafka-connect-s3-source-2.4.5 /usr/share/java/confluentinc-kafka-connect-s3-source-2.4.5
EOF

# build image
docker build -t kafka-s3-syncer -f Dockerfile $DOCKER_BUILDING_CONTEXT_DIR
# check if plugin is deployed in container
docker run -it --rm kafka-s3-syncer ls -al /usr/share/java/

6. Configure and start Kafka Connect

After the image is created, Kafka Connect can be started. Kafka Connect has many configuration items. For details, please refer to its [Official Document] . It should be noted that in the following configuration, we use Kafka Connect's built-in message converter: JsonConverter, if your input/output format is Avro or Parquet, you need to install the corresponding plug-in separately and set the correct Converter Class.

# 实操步骤(7): 配置并启动 Kafka Connect
cat << EOF > docker-compose.yml
services:
  kafka-s3-syncer:
    image: kafka-s3-syncer
    hostname: kafka-s3-syncer
    container_name: kafka-s3-syncer
    ports:
      - 8083:8083
    environment:
      CONNECT_BOOTSTRAP_SERVERS: $SOURCE_KAFKA_BOOTSTRAP_SEVERS
      CONNECT_REST_ADVERTISED_HOST_NAME: kafka-s3-syncer
      CONNECT_REST_PORT: 8083
      CONNECT_GROUP_ID: kafka-s3-syncer
      CONNECT_CONFIG_STORAGE_TOPIC: kafka-s3-syncer-configs
      CONNECT_OFFSET_STORAGE_TOPIC: kafka-s3-syncer-offsets
      CONNECT_STATUS_STORAGE_TOPIC: kafka-s3-syncer-status
      CONNECT_KEY_CONVERTER: org.apache.kafka.connect.storage.StringConverter
      CONNECT_VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
      CONNECT_VALUE_CONVERTER_SCHEMAS_ENABLE: false
      CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 3
      CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 3
      CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 3
      CONNECT_CONFLUENT_TOPIC_REPLICATION_FACTOR: 3
      CONNECT_PLUGIN_PATH: /usr/share/java
      AWS_ACCESS_KEY_ID: $AWS_ACCESS_KEY_ID
      AWS_SECRET_ACCESS_KEY: $AWS_SECRET_ACCESS_KEY
EOF
# valid, format and print yaml with yq
yq . docker-compose.yml
docker compose up -d --wait
docker compose logs -f kafka-s3-syncer
# docker compose down # stop and remove container

After the above script is executed, the command window will no longer return, but will continue to output the container log, so the next step requires opening a new command line window.

7. Configure and start the S3 Sink Connector

In Section 5, we have installed the S3 Sink Connector into the Kafka Connect Docker image, but we still need to explicitly configure and start it. Open a new command line window, execute it first《实操步骤(1): 全局配置》, declare global variables, and then execute the following script:

# 实操步骤(8): 配置并启动 S3 Sink Connector
cat << EOF > s3-sink-connector.json
{
  "name": "s3-sink-connector",
  "config": {
    "tasks.max": "1",
    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter.schemas.enable": "false",
    "topics": "$SOURCE_TOPICS_LIST",
    "s3.region": "$REGION",
    "s3.bucket.name": "$S3_BUCKET",
    "s3.part.size": "5242880",
    "flush.size": "1",
    "storage.class": "io.confluent.connect.s3.storage.S3Storage",
    "format.class": "io.confluent.connect.s3.format.json.JsonFormat",
    "partitioner.class": "io.confluent.connect.storage.partitioner.DefaultPartitioner"
  }
}
EOF
# valid, format and print json with jq
jq . s3-sink-connector.json
# delete connector configs if exsiting
curl -X DELETE localhost:8083/connectors/s3-sink-connector
# submit connector configs
curl -i -X POST -H "Accept:application/json" -H  "Content-Type:application/json" http://localhost:8083/connectors/ -d @s3-sink-connector.json
# start connector
curl -X POST localhost:8083/connectors/s3-sink-connector/start
# check connector status
# very useful! if connector has errors, it will show in message.
curl -s http://localhost:8083/connectors/s3-sink-connector/status | jq

8. Configure and start the S3 Source Connector

Same as above, in the operation in Section 5, we have installed the S3 Source Connector into the Docker image of Kafka Connect. We also need to explicitly configure and start it:

# 实操步骤(9): 配置并启动 S3 Source Connector
cat << EOF > s3-source-connector.json
{
  "name": "s3-source-connector",
  "config": {
    "tasks.max": "1",
    "connector.class": "io.confluent.connect.s3.source.S3SourceConnector",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter.schemas.enable": "false",
    "confluent.topic.bootstrap.servers": "$SOURCE_KAFKA_BOOTSTRAP_SEVERS",
    "mode": "RESTORE_BACKUP",
    "topics.dir": "topics",
    "partitioner.class": "io.confluent.connect.storage.partitioner.DefaultPartitioner",
    "format.class": "io.confluent.connect.s3.format.json.JsonFormat",
    "topic.regex.list": "$TOPIC_REGEX_LIST",
    "transforms": "mapping",
    "transforms.mapping.type": "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.mapping.regex": "$SOURCE_TOPICS_REGEX",
    "transforms.mapping.replacement": "$SINK_TOPICS_REPLACEMENT",
    "s3.poll.interval.ms": "60000",
    "s3.bucket.name": "$S3_BUCKET",
    "s3.region": "$REGION"
  }
}
EOF
# valid, format and print json with jq
jq . s3-source-connector.json
# delete connector configs if exsiting
curl -X DELETE localhost:8083/connectors/s3-source-connector
# submit connector configs
curl -i -X POST -H "Accept:application/json" -H  "Content-Type:application/json" http://localhost:8083/connectors/ -d @s3-source-connector.json
# start connector
curl -X POST localhost:8083/connectors/s3-source-connector/start
# check connector status
# very useful! if connector has errors, it will show in message.
curl -s http://localhost:8083/connectors/s3-source-connector/status | jq

At this point, the entire environment has been set up, and a Kafka data export, import, backup, and restore link using S3 as the transfer medium is already running.

9. Testing

Now, let's verify that the entire link is working properly. First, use kafka-console-consumer.sh to monitor the two Topics source-topic-1 and sink-topic-1, and then use the script to continue to source-topic-1 Write data. If you see the same data output in sink-topic-1, it means that the data was successfully exported from source-topic-1 and then imported to sink-topic-1, correspondingly, the "precipitated" data files can also be seen in the S3 bucket.

9.1. Open Source Topic

Open a new command line window, execute it first《实操步骤(1): 全局配置》, declare global variables, and then use the following command to continuously monitor the data in source-topic-1: a>

# 实操步骤(10): 打开 Source Topic
kafka-console-consumer.sh --bootstrap-server $SOURCE_KAFKA_BOOTSTRAP_SEVERS --topic ${SOURCE_TOPICS_LIST%%,*}

9.2. Open Sink Topic

Open a new command line window, execute it first《实操步骤(1): 全局配置》, declare global variables, and then use the following command to continuously monitor the data in sink-topic-1: a>

# 实操步骤(11): 打开 Sink Topic
kafka-console-consumer.sh --bootstrap-server $SOURCE_KAFKA_BOOTSTRAP_SEVERS --topic ${SINK_TOPICS_LIST%%,*}

9.3. Write data to Source Topic

Open a new command line window, execute it first《实操步骤(1): 全局配置》, declare global variables, and then use the following command to write data to source-topic-1: a>

# 实操步骤(12): 向 Source Topic 写入数据
# download a public dataset
wget https://data.ny.gov/api/views/5xaw-6ayf/rows.json?accessType=DOWNLOAD -O /tmp/sample.raw.json
# extract pure json data
jq -c .data /tmp/sample.raw.json > /tmp/sample.json
# feeding json records to kafka
for i in {
    
    1..100}; do
    kafka-console-producer.sh --bootstrap-server $SOURCE_KAFKA_BOOTSTRAP_SEVERS --topic ${SOURCE_TOPICS_LIST%%,*} < /tmp/sample.json
done

9.4. Phenomenon and conclusion

After performs the above write operation, the written data can be quickly seen in the command line window of Monitor source-topic-1, which shows that the Source-side Kafka has begun to continuously generate data. Afterwards (about 1 minute), you can see the same output data in the command line window of Monitorsink-topic-1, which means that the data synchronization on the target side has also started to work normally. At this time, when you open the S3 bucket, you will find a large number of Json files. These Json files are exported from source-topic-1 by the S3 Sink Connector and stored on S3. Then the S3 Source Connector reads these Json files and It is written into sink-topic-1. At this point, the demonstration and verification of the entire solution is completed.

10. Clean up

During the verification process, we may need to adjust and retry multiple times. It is best to restore to the initial state for each retry. The following script will help us clean up all created resources:

# 实操步骤(13): 清理操作
docker compose down
aws s3 rm --recursive s3://$S3_BUCKET || aws s3 mb s3://$S3_BUCKET
kafka-topics.sh --bootstrap-server $SOURCE_KAFKA_BOOTSTRAP_SEVERS --delete --topic 'sink.*|source.*|kafka-s3-syncer.*|_confluent-command'
kafka-topics.sh --bootstrap-server $SOURCE_KAFKA_BOOTSTRAP_SEVERS --list
kafka-topics.sh --bootstrap-server $SINK_KAFKA_BOOTSTRAP_SEVERS --delete --topic 'sink.*|source.*|kafka-s3-syncer.*'
kafka-topics.sh --bootstrap-server $SINK_KAFKA_BOOTSTRAP_SEVERS --list

11. Summary

This solution is mainly positioned to be lightweight and easy to use. There are many configurations related to performance and throughput in S3 Sink Connector and S3 Source Connector, such as: s3.part.size, , etc. Readers can adjust them according to actual needs. In addition, Kafka Connect can also be easily migrated to Kuberetes or Amamon Kafka. Connect to achieve cluster deployment. , flush.size, s3.poll.interval.mstasks.max


Appendix: Common Mistakes

Question 1: Error when starting Kafka Connect: java.lang.NoSuchMethodError: 'void org.apache.kafka.connect.util.KafkaBasedLog.send

This problem was found on confluentinc-kafka-connect-s3-source-2.5.7 + kafka-connect-7.5.0. The NoSuchMethodError error is generally caused by multiple components relying on different versions of the same Jar package, but ultimately loading the lower version. version of the Jar package. Due to the limited log information provided by Kafka Connect, it is impossible to locate the specific Jar package problem. Downgrading confluentinc-kafka-connect-s3-source to 2.4.5 can solve this problem.

Problem 2: Error when starting S3 Source Connector: java.lang.IllegalArgumentException: Illegal group reference

This problem is caused by an incorrect configuration. When configuring the S3 Source Connector, transforms.mapping.replacement was incorrectly configured as: sink-topic-$(1), a regular grouping variable. The form is: $0,$1,…, not: $(0), $(1),…, Change to: sink-topic-$1 and then the problem is solved

Appendix: References

Amazon S3 Sink Connector Official Documentation

Amazon S3 Source Connector official documentation

Kafka Connect Transformations :: RegexRouter

Guess you like

Origin blog.csdn.net/bluishglc/article/details/132826681