In-depth understanding of Kafka series (6)-Kafka data pipeline

Series Article Directory

Kakfa authoritative guide series articles

Preface

This series is my transcript and thoughts after reading the book "The Definitive Guide to Kafka".

text

Kafka is actually like a data pipeline. Basically, it is positioned as a middleware. Our data flows into Kafka, let Kafka manage, and then the data flows out of Kafka to where it is needed. Then Kafka, as a large buffer between the various data segments of the data pipeline, can effectively decouple the producers and consumers of pipeline data. Next, let's talk about Kafka Connect and some cases.

Kafka Connect

Connect is part of Kafka, which provides a reliable and scalable way to move data between Kafka and external data storage systems.

  1. Connect runs as a cluster of worker processes, and we install connector plug-ins based on worker processes.
  2. Then use the Rest API to manage and configure the connectors. These worker processes will become long-running jobs.
  3. The connector starts additional tasks, effectively utilizes the resources of the worker nodes, and moves large amounts of data in a parallel manner.
  4. The connector of the data source is responsible for reading data from the source system and providing the data object to the worker process.
  5. The connector of the data pool is responsible for obtaining data from the worker process and writing them to the target system.

Run Connect

The first step: modify the configuration file (enter the kafka/config directory)

vi connect-distributed.properties 

Generally, the Connect process has 3 more important configuration parameters (here we only need to modify bootstrap.server)
Insert picture description here

  1. bootstrap.server

This parameter lists the broker servers that will work with Connect, and the connector will write data to these brokers or read data from them.

  1. group.id

Workers with the same group.id belong to the same Connect cluster.

  1. key.converter和value.converter

Connect can process data stored in different formats in Kafka. These two parameters specify the converter used for the key and value of the message. The JSONConverter provided by Kafka is used by default.

Step 2: Start the Connector in the background (single machine only)

 ./bin/connect-distributed.sh config/connect-distributed.properties &

Insert picture description here
Step 3: Verify whether the startup is successful

1. Verify that the port is open:

netstat -nlp | grep :8083

Insert picture description here
2. Verify that it is running normally via REST API:

curl http://192.168.237.130:8083/
curl http://192.168.237.130:8083/connector-plugins

If it is normal, it will return the current Connect version number and other information.
Insert picture description here


Connect Demo1: File data source and file data pool

After ensuring that the above zookeeper, Kafka, and Connector are turned on, let's start the operation of the case.

1. The first step: Start a file data source . For convenience, let him read the configuration file of Kafka ---- send the content of the configuration file of Kafka to the topic.

echo '{"name":"load-kafka-config","config":{"connector.class":"FileStreamSource","file":"config/server.properties","topic":"kafka-config-topic"}}' | curl -X POST -d @- http://192.168.237.130:8083/connectors --header "content-Type:application/json"

This JSON fragment contains the name of the connector: load-kafka-config, the configuration information of the connector (including the thunder of the connector, the name of the file to be loaded and the name of the topic)

Insert picture description here
2. Step 2: Verify that the configuration file is loaded on the topic through the Kafka console consumer.

 ./bin/kafka-console-consumer.sh --bootstrap-server=192.168.237.130:9092 --topic kafka-config-topic --from-beginning

Result: If the following result appears, it means success.
Insert picture description here

1. The output of the above results is the content of config/server.properties, which is converted line by line into JSON records and sent to the topic of kafka-config-topic by the connector.
2. By default, the JSON converter will attach a schema to each record. The schema here is very simple. There is only one payload column, which represents the string type and contains a line of content in the file.

3. Step 3: Use the converter of the file data pool to import the content in the theme into the file.

echo '{"name":"dump-kafka-config1","config":{"connector.class":"FileStreamSink","file":"copyServerProperties","topics":"kafka-config-topic"}}' | curl -X POST -d @- http://192.168.237.130:8083/connectors --header "content-Type:application/json"

Result: The following words indicate success
Insert picture description here

1. First of all, the configuration has changed this time, the class name FileStreamSink replaces FileStreamSource. The attributes of the file point to the target file instead of the original file. Specify topics instead of topics, you can use the data pool to write multiple topics to a file, and a data source is only allowed to write to one topic.
2. If everything is normal, you will get a file called copyServerProperties, the content is exactly the same as server.properties.

Insert picture description here

If you call the command again, you will find an error message: The connector already exists.
Insert picture description here
Delete the connector:

curl -X DELETE http://192.168.237.130:8083/connectors/dump-kafka-config1

View connectors:

curl -X GET http://192.168.237.130:8083/connectors/

Insert picture description here
At this point, we have completed the use of Kafka's built-in connector to write the content of the file into the topic, and then write the data from the topic to the target data source.

Note: Write can only write a single topic (one-to-many). Write can be many to one


Connect Demo2: From Mysql to Kafka

The first step: Get the relevant jar package (I packaged it by my own maven, too much trouble,
I will give you the thread directly) My Baidu Cloud Password: 6eq9


Step 2: Upload the jar package to the kafka/libs directory, and then restart the connect.

 ./bin/connect-distributed.sh config/connect-distributed.properties &

Verify plugin:

curl http://192.168.237.130:8083/connector-plugins

result:
Insert picture description here


Step 3: Create a table in Mysql (I don’t need to start and install mysql)

create database test;
use test;
create table login(username varchar(30),login_time datetime);
insert into login values('test1',now());
insert into login values('test2',now()); 
commit;

Then create and configure the JDBC connector: relevant official configuration documents and case point me

echo '{"name":"mysql-login-connector","config":{"connector.class":"JdbcSourceConnector","connection.url":"jdbc:mysql://localhost:3306/test?user=root&useSSL=true","connection.password":"ljj000","mode":"timestamp","table.whitelist":"login","validate.non.null":false,"timestamp.column.name":"login_time","topic.prefix":"mysql."}}' | curl -X POST -d @- http://192.168.237.130:8083/connectors --header "content-Type:application/json"

parameter:

  1. table.whitelist: equivalent to the whitelist of the table, indicating the table to be monitored.
  2. topic.prefix: The prefix of the topic, combined with the previous parameters to form the final topic.
  3. connector.class: The type of connector used.
  4. connection.url: connection address.
  5. timestamp.column.name: Use the name of the timestamp field.

I stepped on a big pit here, let everyone clear the mine.

  1. Does the version of Mysql correspond to the version of the mysql-connector-java.jar package? (I used a version of mysql8 or higher on the virtual machine at first, and later changed to a local localhost with a version 5.7. My jar package is 5.1.37, which certainly cannot be used with version 8 of mysql)
  2. Whether to import related jar packages under kafka/libs.

If the following words appear, it means success:
Insert picture description here
use Kafka consumer authentication:

./bin/kafka-console-consumer.sh --bootstrap-server=192.168.237.130:9092 --topic mysql.login --from-beginning

Insert picture description here
At this point, we successfully imported the data in Mysql into the topic of Kafka.


Deep understanding of Connect

To understand the working principle of Connect, we must first know a few basic concepts and how they interact.

Connectors and tasks

Connector: responsible for 3 things

  1. Decide how many tasks need to be run .
  2. Split data replication according to tasks.
  3. Get the task configuration from the worker process and pass it on.

Example:
1. The JDBC connector will connect to the database, count the tables that need to be replicated, and determine how many tasks need to be performed.
2. Then choose the smallest value between the configuration parameter max.tasks and the actual data amount as the number of tasks.
3. After determining the number of tasks, the connector generates a configuration for each task. The configuration contains the configuration items of the connector, such as connection.url and the tables that each task needs to copy.
4. A mapping list is returned by the taskConfigs() method. The worker process is responsible for starting and configuring tasks. Each task only copies the data table specified in the configuration item.

task:

  1. Responsible for moving data into or out of Kafka.

1. The task will get a context allocated by the worker process when it is initialized. The source system context (Source Context) contains an object that can store the offset of the source system record in the context (such as the primary key ID of the data table).
2. Of course, the source system has a context, and the target system also has a context. It will provide some methods, and the connector will use them to manipulate the data obtained in Kafka.
3. After the task is initialized, start the work according to the configuration specified by the connector.
4. The source system task polls the external system and returns some records, and the worker process sends these records to kafka.
5. The data pool task receives the records from Kafka through the worker process and writes them to the external system.

worker process

  1. As a container for connectors and tasks, it is responsible for processing http requests.
  2. These requests are used to define the connector and the configuration between the connectors, and are also responsible for saving the configuration and starting tasks between the connectors.

It can be understood this way: connectors and tasks are responsible for data movement, while worker processes are responsible for REST API , configuration management, reliability and high availability, and load balancing .

Data model of converter and Connect

Connect provides a set of data APIs, including data objects and schemas for describing data.

Example:
1. The JDBC connector reads a field from the database and creates a Connect Schema object based on the data type of the field.
2. Use this Schema object to create a Struct that contains all database fields.
3. The source connector reads events from the source system, and generates schema and values ​​for each event (the value is the data object itself).
4. The target connector gets the schema and value, uses the schema to parse the value, and writes it to the target system.

The converter is the steps that need to go through when data flows into or out of Kafka, for example, the object is converted to JSON data.

A little summary is:

  1. The connector is responsible for transferring data, obtaining task configuration, and importing and exporting data into Kafka.
  2. The linker generates a task, and after the task is started, the data starts to be transferred.
  3. In the process of data transmission, it will be converted by the converter, such as converting to String, JSON type, etc., to facilitate storage.

Offset management

In addition to providing REST API services, the worker process also provides offset management services, allowing the connector to know which data has been processed.

  1. The records returned by the source connector actually contain the partition and offset of the source system, and the worker will send these records to Kafka. ( Combined with the context from the task above )

For example:
1. For the file source, the partition can be a file, and the offset can be a line number or character number in the file.
2. For the JDBC source, the partition can be a data table, and the offset can be the primary key of a record.

  1. If Kafka confirms that the record is saved successfully, the worker process will save the offset, usually by topic.

  2. The target connector is the opposite. They will read the record containing topic, partition and offset information from Kafka, and then call the connector's put() method, which will save the record to the target system.

  3. If the save is successful, the connector will save the offset to Kafka through the consumer client.


to sum up

This article probably talked about several points:

  1. The Connect connector that comes with Kafka.
  2. I talked about two demos, the source file-Kafka-the destination file, Mysql-Kafka.
  3. Explained in Connect: the related concepts and relationships of worker process, connector, converter, and task.

The next article is going to explain from the management of Kafka.

Guess you like

Origin blog.csdn.net/Zong_0915/article/details/109676679