Detailed explanation and application practice of Kafka Connect

1. Introduction

Kafka Connect is a tool for data import and export.
It can connect various data sources (such as MySQL, HDFS, etc.) with Kafka to realize data interaction and data flow between different systems.

Kafka Connect has the following advantages:

  • Extensibility: Kafka Connect supports custom Connectors, and users can connect to more data sources by writing their own Connectors.
  • Reliability: Kafka Connect ensures data reliability by using the data replication mechanism provided by Kafka itself.
  • Ease of use: Kafka Connect provides a large number of Connectors and corresponding configuration files, allowing users to get started quickly.

Kafka Connect is suitable for the following scenarios:

  • Data Migration: After the data is moved from the relational database to Kafka, it is processed in a unified manner.
  • Offline analysis of data: Offline tasks obtain data from Kafka for analysis.
  • Real-time calculation of data: real-time tasks consume data in Kafka for calculation.

Two, configuration

Configuring Kafka Connect
Kafka Connect needs to be configured to work properly. The following is an example of a configuration file:

name=kafka-connect-example

connector.class=FileStreamSink

tasks.max=1

topics=my-topic

file=/opt/kafka/sinks/my-file.txt

The configuration file outputs the data in my-topic to the /opt/kafka/sinks/my-file.txt file. Among them, name indicates the name of the Connector, connector.class indicates the class name of the Connector used, tasks.max indicates the number of Tasks available at the same time, topics indicates the Kafka Topic to be connected, and file indicates the file location of the data output.

3. Development API introduction

3.1 Working principle

Kafka Connect is a framework for connecting Kafka clusters and external systems. Kafka Connect can import data from external systems to Kafka message queues, and can also export data from Kafka message queues to external systems. The core parts of the Kafka Connect framework are Connector and Task. Connector realizes the logic of importing or exporting data from external systems, and Task is the data processing unit actually executed after Connector is instantiated.

3.2 Commonly used Connector types (Source Connector, Sink Connector)

Two types of Connectors are provided in Kafka Connect: Source Connector and Sink Connector. The Source Connector imports the data in the external system into the Kafka message queue, and the Sink Connector exports the data in the Kafka message queue to the external system. Since the Connector provided by Kafka Connect is defined based on the interface, it is easy to implement a custom Connector.

3.3 How to write a custom Connector

To write a custom Connector, you need to implement the org.apache.kafka.connect.connector.Connector interface, which contains 4 main methods:

  • start(Map<String, String> props)
  • stop()
  • taskClass()
  • config()

Among them, the start() method will be called when the Connector starts, the stop() method will be called when the Connector stops, the taskClass() method returns the Task class corresponding to the Connector, and the config() method is used to configure the Connector. configuration information.

In addition, the org.apache.kafka.connect.sink.SinkConnector interface needs to be implemented to enable the Sink Connector. To enable the source connector, you need to implement the org.apache.kafka.connect.source.SourceConnector interface.

Kafka Connect also provides some ready-made Connectors, such as JDBC Connector, HDFS Connector, etc., which can be used directly.

4. Practical cases

This article will introduce three practical cases of Kafka Connect, namely data synchronization, database real-time backup and data stream conversion.

4.1 Data Synchronization Case

In the case of data synchronization, we use Kafka Connect to synchronize data between two Kafka clusters. The specific steps are as follows:

Step 1: Create a Kafka Connect connector configuration file

We need to set up the Kafka Connect environment in the source Kafka cluster and the target Kafka cluster respectively, and create a connector configuration file, for example:

name=kafka-connect-replicator
connector.class=io.confluent.connect.replicator.ReplicatorSourceConnector
config.action.reload=restart
tasks.max=1
src.kafka.bootstrap.servers=source-kafka:9092
dest.kafka.bootstrap.servers=target-kafka:9092
topic.whitelist=some-topic

In the above code, the connector name, type (used here ReplicatorSourceConnector), the number of tasks, the bootstrap servers of the source Kafka cluster and the target Kafka cluster, and the topic name to be synchronized are configured.

Step 2: Start the Kafka Connect connector

We need to start the corresponding Kafka Connect connectors in the source Kafka cluster and the target Kafka cluster respectively, and enter the following commands in the shell:

$ connect-standalone connect-standalone.properties kafka-connect-replicator.properties

Step 3: Perform data synchronization

Data synchronization will be performed between the source Kafka cluster and the target Kafka cluster, and the topic to be synchronized is specified through the topic.whitelist parameter in the connector configuration file. Data synchronization occurs automatically after the connector is started.

4.2 Case of real-time database backup

In the real-time database backup case, we use Debezium to capture the change events of the MySQL database in real time and persist them to the Kafka cluster. Specific steps are as follows:

Step 1: Download and configure Debezium

We need to download and configure Debezium in the system first. For details, please refer to the official documentation.

Step 2: Create a Kafka Connect connector configuration file

Next, we need to create a connector configuration file to set the information about Debezium connecting to the MySQL database and Kafka cluster, for example:

name=mysql-connector
connector.class=io.debezium.connector.mysql.MySqlConnector
tasks.max=1
database.hostname=mysql-source
database.port=3306
database.user=debezium
database.password=dbz
database.server.id=184054
database.server.name=my-app-connector
database.whitelist=mydb
database.history.kafka.bootstrap.servers=kafka:9092
database.history.kafka.topic=my-app-connector-history

In the above code, the connector name, type (used here MySqlConnector), number of tasks, MySQL host name and port number, user name and password, and the name of the database to be backed up are configured.

Step 3: Start the Kafka Connect connector

We need to enter the following command in the shell to start the Kafka Connect connector:

$ connect-standalone connect-standalone.properties mysql-connector.properties

Step 4: Make a database backup

After the connector starts, change events in the MySQL database are automatically captured and persisted to the Kafka cluster.

4.3 Data Stream Transformation Case

In the data stream conversion case, we use the Kafka Connect converter to convert the data in JSON format and send it to the Kafka cluster. Specific steps are as follows:

Step 1: Download and configure the Kafka Connect converter

We need to download and configure the Kafka Connect converter in the system first. For details, please refer to the official documentation.

Step 2: Create a Kafka Connect connector configuration file

Next, we need to create a connector configuration file to set the relevant information between the Kafka Connect converter and the Kafka cluster, for example:

name=json-transformer
connector.class=io.confluent.connect.transforms.Flatten$Value
transforms=ValueToJson

In the above code, the connector name, type ( Flatten$Valueconverter is used here), and the field name to be converted are configured.

Step 3: Start the Kafka Connect connector

We need to enter the following command in the shell to start the Kafka Connect connector:

$ connect-standalone connect-standalone.properties json-transformer.properties

Step 4: Perform data flow conversion

After the connector starts, it will automatically convert the data in JSON format and send it to the Kafka cluster.

Kafka Connect performance optimization

5.1 How to evaluate the performance of Kafka Connect application

The performance of Kafka Connect depends on many aspects, including but not limited to the following factors:

  • Complexity of Connector Implementation
  • Network bandwidth and latency for data transfers
  • Hardware specifications and configuration of Kafka cluster
  • Number of threads for consumers and producers
  • Batch size, interval, and cache size

The performance of Kafka Connect applications can be measured through the following indicators:

  • Throughput and latency of connector tasks
  • Latency for configuration changes
  • memory usage

5.2 Optimizing Data Transmission Efficiency and Throughput

Optimizing data transmission efficiency and throughput can start from the following aspects:

5.2.1 Increase batch size and cache size

Setting the batch size and cache size too small will result in frequent data submissions and increase network overhead. A good value can usually be found by gradually increasing the batch size and cache size.

5.2.2 Increase the number of workers in the connector

Increasing the number of workers in the connector can increase the parallelism of data transmission, thereby increasing the throughput. When increasing the number of workers, you need to pay attention to the physical resource limit of the Kafka Connect node, otherwise increasing the number of workers may break the stability of the system.

5.2.3 Using compression algorithms

For scenarios involving large amounts of data transmission, consider enabling the data compression function. Kafka Connect supports multiple compression algorithms, including snappy, gzip, and lz4.

5.3 Realize data caching mechanism

The data caching mechanism can reduce the network communication of data transmission and improve the throughput of the system. Data caching can be implemented in the following ways:

  • Increase the batch size of the connector worker
  • Cache on the data source side, such as setting a read cache on the database side or using Redis cache
  • Configure memory cache on Kafka Connect nodes to balance memory usage and latency

Application of Kafka Connect in production

6.1 High availability cluster deployment

Kafka Connect provides a distributed mode for deployment, and high availability can be achieved by building multiple Connect worker nodes. One of the nodes (called "Leader") is responsible for managing and assigning tasks, and the other nodes act as "Followers" to receive and execute tasks.

When deploying a high availability cluster, there are a few things to consider:

  • Make sure that different nodes have different ones group.id, and set in the node configuration file bootstrap.serversto all broker addresses of the Kafka cluster, so that each node can connect to Kafka;
  • Configure the communication mechanism between nodes, including which protocol, port and authentication method to use;
  • offset.storage.topicSpecify and in the configuration file config.storage.topicas existing topics in the Kafka cluster to ensure that all nodes share the same offset and configuration information;
  • A reverse proxy or load balancer can be used to distribute requests from external clients for better load balancing and failover.

6.2 Monitoring and alarming

Kafka Connect supports monitoring and management using JMX. By connecting to the JMX port of the Connect worker node, you can view information such as running status, performance indicators, and log output in real time. At the same time, Kafka Connect can also integrate third-party monitoring tools, such as Prometheus and Grafana, to achieve more comprehensive monitoring and alarming.

When monitoring and alarming, you need to pay attention to the following aspects:

  • Health status: including whether the node is alive, whether the connection is normal, task execution status, etc.;
  • Performance indicators: including processing speed, delay, load, etc.;
  • Error information: including connection errors, data format errors, task failures, etc.;
  • Log output: including standard output and error output.

Here is a code example that uses the Kafka Connect API to create a Connect worker and connect to a JMX port:

import org.apache.kafka.connect.runtime.Connect;
import org.apache.kafka.connect.runtime.ConnectorConfig;
import org.apache.kafka.connect.runtime.WorkerConfig;
import java.util.Properties;

Properties connectProps = new Properties();
connectProps.setProperty(WorkerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
connectProps.setProperty(ConnectorConfig.GROUP_ID_CONFIG, "my-connect-group");
connectProps.setProperty("plugin.path", "/path/to/connector/plugins");
connectProps.setProperty("key.converter", "org.apache.kafka.connect.json.JsonConverter");
connectProps.setProperty("value.converter", "org.apache.kafka.connect.json.JsonConverter");

Connect connect = new Connect(connectProps);
connect.start();

String jmxUrl = "service:jmx:rmi:///jndi/rmi://localhost:10010/jmxrmi";
JMXServiceURL serviceUrl = new JMXServiceURL(jmxUrl);
JMXConnector jmxConnector = JMXConnectorFactory.connect(serviceUrl);
MBeanServerConnection mbeanConn = jmxConnector.getMBeanServerConnection();

6.3 Log Management

The log output of Kafka Connect can be divided into the following categories:

  • Error log: record the error information during the startup and operation of Connect worker;
  • Information log: record connection status, task status, configuration update and other messages;
  • Debug log: record more detailed debugging information, such as message sending, processing and conversion process, etc.

When doing log management, you need to consider the following points:

  • Make sure the log output level is set properly to avoid too much or too little output;
  • Configure an appropriate log rotation policy and size limit to avoid the impact of large log files on performance;
  • More detailed log analysis and visualization can be achieved using third-party tools or libraries.

Guess you like

Origin blog.csdn.net/u010349629/article/details/130932313