Dry goods | Debezium realizes efficient real-time synchronization of Mysql to Elasticsearch

Inscription

Questions from the Elasticsearch Chinese community——

There is no unique incremental field and no unique incremental time field for tables in MySQL. How to use logstash to realize real-time incremental data import from MySQL to es?

Both logstash and kafka_connector only support incremental synchronization of data based on self-incrementing id or timestamp update.

Back to the question itself: If there are no related fields in the library table, what should I do?

This article gives relevant discussions and solutions.

1. Binlog recognition

1.1 What is binlog?

Binlog is a binary log maintained by the Mysql server layer. It is a completely different log from the redo/undo log in the InnoDB engine; it is mainly used to record SQL statements that update or potentially update mysql data, and use "transactions" The form is saved on the disk.

The main functions are:

  • 1) Replication: To achieve the goal of consistent master-slave data.
  • 2) Data recovery: recover data through mysqlbinlog tool.
  • 3) Incremental backup.

1.2 Ali's Canal realizes incremental Mysql synchronization

Dry goods | Debezium realizes efficient real-time synchronization of Mysql to Elasticsearch

A picture is worth a thousand words. Canal is a middleware developed with java based on database incremental log analysis, providing incremental data subscription & consumption.
At present, canal mainly supports MySQL binlog parsing, and canal client is used to process the related data after parsing is completed. Purpose: incremental data subscription & consumption.

In summary, using binlog can break through the limitations of logstash or kafka-connector without self-incrementing id or no timestamp field, and achieve incremental synchronization.

2. Synchronization method based on binlog

1) Debezium open source project based on kafka Connect, address: https://debezium.io/

2) Independent applications that do not rely on third parties: Maxwell open source project, address: http://maxwells-daemon.io/

Since conluent (the enterprise version of kafka, which comes with zookeeper, kafka, ksql, kafka-connector, etc.) has been deployed, this article is only for Debezium.

3. Introduction to Debezium

Debezium is an open source distributed synchronization platform that captures real-time dynamic changes of data. Real-time capture of data sources (Mysql, Mongo, PostgreSql): add (inserts), update (updates), delete (deletes) operations, real-time synchronization to Kafka, strong stability and very fast.

Features:

  • 1) Simple. No need to modify the application. Can provide external services.
  • 2) Stable. Keep track of every change in every line.
  • 3) Fast. Built on Kafka, it is scalable, and it can handle large-capacity data after official verification.

    4. Synchronous architecture

    Dry goods | Debezium realizes efficient real-time synchronization of Mysql to Elasticsearch

As shown in the figure, the synchronization strategy from Mysql to ES adopts the "curve to save the country" mechanism.

Step 1: Based on Debezium's binlog mechanism, synchronize Mysql data to Kafka.

Step 2: Based on the Kafka_connector mechanism, synchronize the Kafka data to Elasticsearch.

5. Debezium realizes real-time synchronization of Mysql to ES addition, deletion and modification

Software version:

confluent:5.1.2;
Debezium:0.9.2_Final;
Mysql:5.7.x.
Elasticsearch:6.6.1

5.1 Debezium installation

For the installation and deployment of confluent, see: http://t.cn/Ef5poZk, so I won’t go into details here .

The installation of Debezium only needs to unzip the compressed package of debezium-connector-mysql and place it in the decompressed plug-in directory (share/java) of Confluent.

Download link of MySQL Connector plugin compressed package:

https://debezium.io/docs/install/

Pay attention to restart confluent to make Debezium take effect.

5.2 Mysql binlog and other related configurations.

Debezium uses MySQL's binlog mechanism to monitor data dynamic changes, so MySQL needs to configure binlog in advance.

The core configuration is as follows, add the following configuration under mysqld in /etc/my.cnf of Mysql machine.

1[mysqld]
2
3server-id         = 223344
4log_bin           = mysql-bin
5binlog_format     = row
6binlog_row_image  = full
7expire_logs_days  = 10

Then, restart Mysql to make binlog take effect.

1systemctl start mysqld.service

5.3 Configure the connector connector.

Configure confluent path directory: /etc

Create folder command:

1mkdir kafka-connect-debezium

Store the configuration information of the connector in mysql2kafka_debezium.json:

1[root@localhost kafka-connect-debezium]# cat mysql2kafka_debezium.json
 2{ 
 3        "name" : "debezium-mysql-source-0223",
 4        "config":
 5        {
 6             "connector.class" : "io.debezium.connector.mysql.MySqlConnector",
 7             "database.hostname" : "192.168.1.22",
 8             "database.port" : "3306",
 9             "database.user" : "root",
10             "database.password" : "XXXXXX",
11             "database.whitelist" : "kafka_base_db",
12             "table.whitlelist" : "accounts",
13             "database.server.id" : "223344",
14             "database.server.name" : "full",
15             "database.history.kafka.bootstrap.servers" : "192.168.1.22:9092",
16             "database.history.kafka.topic" : "account_topic",
17             "include.schema.changes" : "true" ,
18             "incrementing.column.name" : "id",
19             "database.history.skip.unparseable.ddl" : "true",
20             "transforms": "unwrap,changetopic",
21             "transforms.unwrap.type": "io.debezium.transforms.UnwrapFromEnvelope",
22             "transforms.changetopic.type":"org.apache.kafka.connect.transforms.RegexRouter",
23             "transforms.changetopic.regex":"(.*)",
24             "transforms.changetopic.replacement":"$1-smt"
25        }
26}

Note the following configuration:

  1. "database.server.id" corresponds to the configuration of server-id in Mysql.
  2. "database.whitelist": The name of the Mysql database to be synchronized.
  3. "table.whitlelist": The name of the Mysq table to be synchronized.
  4. Important: "database.history.kafka.topic": Stores the Shcema record information of the database, not the topic,
  5. "database.server.name": The logical name, each connector is unique, as the prefix name of the Kafka topic where data is written.

Pit 1: Transforms related 5-line configuration function is to write data format conversion.

If not, the input data will include: before, after, before and after record modification, and metadata information (source, op, ts_ms, etc.).

This information is not needed in subsequent data writing to Elasticsearch. (Pay attention to combining your own business scenarios).

Principles related to format conversion: http://t.cn/EftoaIi

5.4 Start the connector

1curl -X POST -H "Content-Type:application/json" 
2--data @mysql2kafka_debezium.json.json 
3http://192.168.1.22:18083/connectors | jq

5.5 Verify that the writing is successful.

5.5.1 View kafka-topic

1    kafka-topics --list --zookeeper localhost:2181

Here you will see the information written to the data topic.

Pay attention to the format of the newly written data topic: database.schema.table-smt consists of three parts.

Topic name of this example:

full.kafka_base_db.account-smt

5.5.2 The consumption data verification write is normal

1./kafka-avro-console-consumer --topic full.kafka_base_db.account-smt --bootstrap-server 192.168.1.22:9092 --from-beginning

At this point, Debezium has completed mysql synchronization kafka.

6, kafka-connector realizes kafka synchronization Elasticsearch

6.1 Introduction to Kafka-connector

See official website: https://docs.confluent.io/current/connect.html

Kafka Connect is a framework for connecting Kafka with external systems (such as databases, key-value stores, retrieval system indexes, and file systems).

The connector realizes that common data source data (such as Mysql, Mongo, Pgsql, etc.) is written into Kafka, or Kafka data is written into the target database, or you can develop your own connector.

6.2, Kafka to ES connector synchronization configuration

Configuration path:

1/home/confluent-5.1.0/etc/kafka-connect-elasticsearch/quickstart-elasticsearch.properties

Configuration content:

1"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
2"tasks.max": "1",
3"topics": "full.kafka_base_db.account-smt",
4"key.ignore": "true",
5"connection.url": "http://192.168.1.22:9200",
6"type.name": "_doc",
7"name": "elasticsearch-sink-test"

6.3 Kafka to ES start connector

Start command

1confluent load  elasticsearch-sink-test 
2-d /home/confluent-5.1.0/etc/kafka-connect-elasticsearch/quickstart-elasticsearch.properties

6.4 Kafka-connctor RESTFul API view

The connector details of Mysql2kafka and kafka2ES can be viewed with the help of postman or a browser or command line.

1curl -X GET http://localhost:8083/connectors

7. Pit replay.

Pit 2: Errors may occur during the synchronization process, for example: Kafka topic cannot consume data.
The troubleshooting ideas are as follows:

  • 1) Confirm whether the consumed topic is the topic of writing data;

  • 2) Confirm that there is no error during synchronization. You can use the connector to view the following commands.
1curl -X GET http://localhost:8083/connectors-xxx/status

Pit 3: Mysql2ES does not recognize the date format.

It is the problem of the Mysql jar package. The solution: configure the time zone information in my.cnf.

Pit 4: Kafka2ES, ES does not write data.

Troubleshoot ideas:

  • 1) Suggestion: first create an index consistent with the topic name. Note: Mapping is statically customized, not dynamic identification and generation.
  • 2) Analyze the cause of the error through connetor/status and analyze it step by step.

8. Summary

  1. The realization of binlog breaks through the limitation of the field. In fact, the industry's go-mysql-elasticsearch has been implemented.

  2. Comparison: logstash, kafka-connector, although Debezium "curve to save the country" two steps to achieve real-time synchronization, but the stability + real-time performance is relatively good.

  3. Recommend everyone to use. If you have a good synchronization method, please leave a message to discuss and exchange.

Reference:
[1] http://t.cn/EftX2p8
[2] http://t.cn/EftXJU6
[3] http://t.cn/EftXO8c
[4] http://t.cn/EftXn9M
[5] http://t.cn/EftXeOc

Recommended reading:
Blockbuster | Elasticsearch Methodology Cognitive List (2019 Spring Festival Update)
Dry goods | Debezium realizes efficient real-time synchronization of Mysql to Elasticsearch
Elasticsearch basic, advanced, and actual first public account

Guess you like

Origin blog.51cto.com/15050720/2562050