4 MySQL synchronous ES solutions, yyds!

Last week, I heard my colleagues in the company share the solution of synchronizing data from MySQL to ES. I found it very interesting. I feel that it is necessary to summarize and refine this knowledge point, and I have this article.

This article will first describe 4 data synchronization solutions, and give common data migration tools , full of dry goods!

Not BB, on the article directory:

b0e8c2156f8344a8b5418d7538537413.png

1 Introduction

In actual project development, we often use MySQL as a business database and ES as a query database to achieve read-write separation, relieve the query pressure of the MySQL database, and deal with complex queries of massive data.

One of the most important issues is how to realize data synchronization between MySQL database and ES. Today, I will talk to you about various schemes for data synchronization between MySQL and ES.

Let's take a look at the following four commonly used data synchronization schemes.

2. Data synchronization scheme

2.1 Synchronous double write

This is the easiest way to write data to ES while writing data to MySQL.

f6db2b23bb1155fd41eaa018736cffb1.png

advantage:

  • Simple business logic;

  • High real-time performance.

shortcoming:

  • Hard-coded, where you need to write to MySQL, you need to add the code to write to ES;

  • Strong business coupling;

  • There is a risk of data loss due to double-write failure;

  • The performance is poor. Originally, the performance of MySQL is not very high. Adding an ES will inevitably reduce the performance of the system.

2.2 Asynchronous double write

For scenarios where multiple data sources are written, MQ can be used to implement asynchronous multi-source writing.

95c660e11e9de3ae6d34e560a1183d37.png

advantage:

  • high performance;

  • It is not prone to data loss, mainly based on the consumption guarantee mechanism of MQ messages, such as ES downtime or write failure, and MQ messages can be consumed again;

  • Multi-source writes are isolated from each other, making it easy to expand more data source writes.

shortcoming:

  • Hard-coded issues, access to new data sources requires new consumer codes;

  • The complexity of the system increases, and message middleware is introduced;

  • MQ is an asynchronous consumption model, and the data written by users may not be immediately visible, causing delays.

2.3 Extraction based on SQL

There are hard-coded problems in the above two solutions, and the code is too intrusive. If the real-time requirements are not high, you can consider using a timer to deal with it:

  1. Add a timestamp field to the related table of the database, any CURD operation will cause the time of this field to change;

  2. The CURD operation in the original program does not make any changes;

  3. Add a timer program, let the program scan the specified table according to a certain period of time, and extract the changed data within the period of time;

  4. Write to ES one by one.

617ec6f6ac5b385976b6d206d81012c5.png

advantage:

  • No change to the original code, no intrusion, no hard coding;

  • There is no strong business coupling, and the performance of the original program will not be changed;

  • Worker code is easy to write and does not need to consider adding, deleting, modifying and checking.

shortcoming:

  • The timeliness is poor. Since the timer is used to synchronize data according to the fixed frequency lookup table, even if the synchronization period is set to the second level, there will still be a certain time delay;

  • There is a certain polling pressure on the database. One way to improve it is to put the polling on the slave library with little pressure.

Classic solution: use Logstash to realize data synchronization. The underlying implementation principle is to regularly use SQL query to write new data into ES according to the configuration, so as to realize incremental synchronization of data.

2.4 Real-time synchronization based on Binlog

The above three solutions either have code intrusion, hard coding, or delay, so is there a solution that can ensure real-time data synchronization without substitution intrusion?

Of course, you can use MySQL's Binlog for synchronization.

95204cafa9f7101730f4d00f85944594.png

Specific steps are as follows:

  • Read the Binlog log of MySQL to obtain the log information of the specified table;

  • Convert the read information to MQ;

  • Write an MQ consumer program;

  • Consume MQ continuously, and write the message into ES every time a message is consumed.

advantage:

  • No code intrusion, no hard coding;

  • The original system does not require any changes and has no perception;

  • high performance;

  • Business decoupling, no need to pay attention to the business logic of the original system.

shortcoming:

  • Building a Binlog system is complex;

  • If the Binlog information parsed by MQ consumption is used, there will also be a risk of MQ delay like the second scheme.

3. Selection of data migration tools

For the above four data synchronization schemes, the "Binlog-based real-time synchronization" scheme is currently the most commonly used, and many excellent data migration tools have also been born. Here we mainly introduce these migration tools.

Many of these data migration tools are implemented based on Binlog subscription, simulating a MySQL Slave to subscribe to Binlog logs, so as to implement CDC (Change Data Capture), and send the submitted changes to the downstream, including INSERT, DELETE, and UPDATE.

As for how to disguise? You need to understand the master-slave replication principle of MySQL first. Students who need to learn this knowledge can read the high-concurrency tutorial I wrote before, which explains it in detail.

3.1 Channels

Based on database incremental log parsing, it provides incremental data subscription & consumption, and currently mainly supports MySQL.

The principle of Canal is to pretend to be a MySQL slave node to subscribe to the Binlog log of the master node. The main process is:

  1. The Canal server transmits the dump protocol to the master node of MySQL;

  2. After receiving the dump request, the MySQL master node pushes the Binlog log to the Canal server, parses the Binlog object (originally a byte stream) and converts it into Json format;

  3. The Canal client listens to the Canal server through the TCP protocol or MQ, and synchronizes data to ES.

a419324558a731ff95f51edfdea5ecf5.png

The following is the core process executed by Cannel, where Binlog Parser is mainly responsible for the extraction, parsing and pushing of Binlog, and EventSink is responsible for data filtering, routing and processing, just for understanding.

3f11a72a42f3ba7972ec4411fd217349.png

3.2 Alibaba Cloud DTS

Data Transmission Service DTS (Data Transmission Service) supports data transmission between RDBMS, NoSQL, OLAP and other data sources.

It provides multiple data transmission methods such as data migration, real-time data subscription, and real-time data synchronization. Compared with third-party data streaming tools, DTS provides various, high-performance, highly secure and reliable transmission links, and it also provides many convenient functions, which greatly facilitate the creation and management of transmission links.

Features:

  • Multiple data sources: support data transmission between RDBMS, NoSQL, OLAP and other data sources;

  • Multiple transmission methods: support multiple transmission methods, including data migration, real-time data subscription and real-time data synchronization;

  • High performance: The underlying layer adopts a variety of performance optimization measures. During the peak period of full data migration, the performance can reach 70MB/s and 200,000 TPS. High-specification servers are used to ensure that each migration or synchronization link can have good transmission performance ;

  • High availability: the bottom layer is a service cluster. If any node in the cluster goes down or fails, the control center can quickly switch all tasks on this node to other nodes, and the link stability is high;

  • Ease of use: Provide a visual management interface and a wizard-style link creation process, and users can easily create transmission links in their console;

  • Fees apply.

Look at the system architecture of DTS again.

488695bb365e329cc9f6aba495cd2db1.png
  • High availability: Each module in the data transmission service has an active and standby architecture to ensure high availability of the system. The disaster recovery system detects the health status of each node in real time, and once a node is found to be abnormal, it will quickly switch the link to other nodes.

  • Dynamic adaptation of data source address: For data subscription and synchronization links, the disaster recovery system will also monitor changes such as connection address switching of the data source. Once it finds that the connection address of the data source has changed, it will dynamically adapt to the new connection of the data source way, in the case of data source changes, to ensure the stability of the link.

For more information, please check Ali official documentation: https://help.aliyun.com/product/26590.html

3.3 Databus

Databus is a low-latency, reliable, transactional, and consistent data change capture system. Open sourced by LinkedIn in 2013.

Databus pulls database changes from the database in a real-time and reliable manner by mining database logs, and the business can obtain changes in real time and perform other business logic through a customized client.

Features:

  • Multiple data sources: Databus supports change capture from multiple data sources, including Oracle and MySQL.

  • Scalable, Highly Available: Databus can scale to support thousands of consumer and transactional data sources while maintaining high availability.

  • Ordered Transaction Commit: Databus maintains transactional integrity in source databases and delivers change events sequentially by transaction grouping and source commit order.

  • Low latency and support for multiple subscription mechanisms: After the data source change is completed, Databus can submit the transaction to the consumer within milliseconds. At the same time, consumers can only obtain the specific data they need by using the server-side filtering function in Databus.

  • Unlimited backtracking: Support infinite backtracking capability for consumers, for example, when consumers need to generate a complete copy of data, it will not impose any additional burden on the database. This feature can also be used when the consumer's data lags significantly behind the source database.

Look at the Databus system architecture.

Databus consists of Relays, bootstrap service and Client lib, etc. Bootstrap service includes Bootstrap Producer and Bootstrap Server.

895d5666f344b1ba143ec96763a0061c.png
  • Fast-changing consumers get events directly from Relay;

  • If a consumer's data update is far behind, the data it wants is not in the Relay log, but needs to request the Bootstrap service, and the returned will be a snapshot of all data changes since the last time the consumer processed the change.

Open source address: https://github.com/linkedin/databus

3.4 Others

Considerable

  • Stateful computing distributed processing engine and framework on bounded and unbounded data streams.

  • Official website address: https://flink.apache.org

CloudCanal

  • Data synchronization migration system, commercial product.

  • Official website address: https://www.clougence.com/?utm_source=wwek

Maxwell

  • Simple to use, directly output data changes as json strings, no need to write clients.

  • Official website address: http://maxwells-daemon.io

DRD

  • The distributed database middleware product independently developed by Alibaba Group focuses on solving the scalability problem of stand-alone relational database, and has the characteristics of light weight (stateless), flexibility, stability, and high efficiency.

  • Official address: https://www.aliyun.com/product/drds

yugong

  • Help users complete data migration from Oracle to MySQL.

  • Access address: https://github.com/alibaba/yugong

4. Epilogue

Through this article, let you know the synchronization scheme of MySQL and other multidimensional data, as well as commonly used data migration tools, to help you make better choices.

So this article is also a standard technology selection article. I have written about the selection of message queues, microservice gateways, registration centers, configuration centers, and data monitoring before. This article can be used as the sixth article on technology selection.

Writing an article is not the purpose, the most important thing is how to apply it to the project. At present, the data synchronization solution code of MySQL + Cannel + ES has been implemented in the technical school project. The next article will explain the specific implementation logic, so stay tuned !

Guess you like

Origin blog.csdn.net/g6U8W7p06dCO99fQ3/article/details/130979473