Based on MySQL binlog log, realize near real-time synchronization practice of Elasticsearch

The author of this article first appeared on the WeChat public account of "Bianlifeng Product Technology". Welcome everyone to pay attention.

Bianlifeng Campus Recruitment Internal Recommendation Link: https://app-tc.mokahr.com/m/recommendation-apply/bianlifenghr/8007?sharePageId=9193&recommendCode=NTAANCF&codeType=1&hash=%23%2Frecommendation%2Fpage%2F9193

Bianlifeng.gllue.me/portal/share_url/0b9e0fe4-a3ec-4bbb-a541-e65d83cf6364 _

First link: Implementing near-real-time synchronization practice of Elasticsearch based on MySQL binlog logs

The content of the current page is not an official release channel of the company and is only distributed by the author .


background

During our development process, we often use multiple database systems in one project. In some specific scenarios, we hope to synchronize data from one database to another heterogeneous database in order to perform data analysis and statistics, complete real-time monitoring, real-time search and other functions. This process of synchronizing heterogeneous data sources is called Change Data Capture.

picture

What we discuss in this article is the incremental and full synchronization operation process in the scenario where the Source is MySQL and the Target is ElasticSearch. As we all know, MySQL database has become the most popular relational database due to its excellent performance, stable service, open source code, active community and other factors. However, when the data size is large or multi-table operations are involved, it may need to be based on When querying geographical location, MySQL usually cannot give us good support.

In order to solve the problem of slow and unable query in MySQL, we usually use ElasticSearch and other methods for cooperative retrieval. In the traditional way, data synchronization from MySQL to ElasticSearch usually takes the form of double writing, MQ messages, etc. These forms have risks such as high coupling, poor performance, and data loss.

Therefore, we need to explore a solution that synchronizes MySQL to the ElasticSearch heterogeneous database without intrusion into the business. This article will discuss it from two levels: incremental synchronization and full synchronization.

Incremental synchronization

Architecture

Real-time incremental synchronization of MySQL data to ElasticSearch is generally achieved with the help of the MySQL incremental log binlog. The currently popular binlog parsing and acquisition middleware is Canal, which is open sourced by Alibaba. Canal is translated as waterway/pipeline/ditch. Its main purpose is to provide incremental data subscription and consumption based on incremental log parsing of MySQL database.

So our overall solution is: the upstream sends the parsed binlog message to Kafka through Canal, and the downstream uses an Adapter to parse, configure and consume the message.

picture

Referring to the picture above, we can know that the whole thing is divided into three steps:

  1. Canal pretends to be MySQL Slave, simulates the interaction protocol of MySQL Slave, and sends the dump protocol to MySQL Master. MySQL Master receives the dump request sent by Canal and starts pushing binlog to Canal, and then Canal parses the binlog;

  2. Canal sends the parsed and serialized binlog information to Kafka;

  3. The Adapter receives the information analysis and processing in Kafka based on the user configuration information, and updates the final data to ElasticSearch.

Adapter design ideas

After research, the Adapter decided to use SQL statement configuration, and the system parsed the SQL to obtain the required table and field mapping relationships. The parsing process is implemented using the open source database connection pool Druid.

For example, in the following demonstration, the user configures a SQL statement, and the system automatically parses and determines the field information of ElasticSearch, and caches the field mapping relationship between MySQL tables and fields and ElasticSearch.

Users can set the corresponding field name in ElasticSearch by defining the alias of the field. Fields with the same name do not need aliases.

picture

The overall process of the Adapter can be represented by the following figure:

picture

Scene processing

We have clarified the overall architectural idea. Next, we need to consider what parsing capabilities we need to have in the process of writing the Adapter.

Single table scenario

Single table synchronization is the simplest synchronization scenario. When the contents of a table in the database change, the required fields are extracted and written into ElasticSearch.

For example, we now have a student table, and we want to synchronize the id, name, age, and birthday field information in it to ElasticSearch, as shown below:

picture

 Then the statements that need to be configured are:

SELECT
    s.id AS _id,
    s.id,
    s.name,
    s.age,
    s.birthday
FROM
    student s;

Among them, _id represents the unique identifier of ElasticSearch; id is the actual field in ElasticSearch.

What we need to do is relatively simple, just filter the original data and write the required data into ElasticSearch.

Simple scenario with multiple tables

Multi-table synchronization refers to the scenario where two tables are associated using Left Join. Generally, a field in the left table records the ID information of the right table. After Left Joining the two tables, all the required information can be obtained.

For example, we want to record the class information of students, so we add the class_id field to the student table, which corresponds to the id field of the class table;

As shown in the two tables below, we hope to associate the student table and class table and store the id, name, class_id, and class_name information in ElasticSearch.

picture

Then we can configure it in the following form:

SELECT
    s.id AS _id,
    s.id,
    s.name,
    c.class_id,
    c.class_name
FROM
    student s
    LEFT JOIN class c
    ON s.class_id = c.id;

 

First of all, we agree that fields with associated properties are called associated fields. For example, the class_id field of the student table and the id field of the class table in SQL above are both associated fields.

For this scenario, the Adapter will monitor the updates of two tables at the same time: student and class, and determine the solution to the changed fields:

  • Non-associated fields in the left table: directly updated to ElasticSearch through the information in the binlog;

  • Non-associated fields in the right table: After searching ElasticSearch to determine the scope of influence, write the modified data into ElasticSearch;

  • Related field update: update after querying MySQL by splicing the where condition after the SQL statement.

Complex scenarios with multiple tables

Multiple lines to one line

Converting multiple rows of data into one row of data is also a form of multi-table association. Generally, multiple pieces of data are connected using specified splicing symbols.

For example: multiple courses a student needs to study, multiple mobile phone numbers of a contact, etc.

For example, we now have a student parent table: parent. We hope to store all the student's parent name information in ElasticSearch, including the student's id, name, age and parentNames. parentNames are the names of all parents of this student, separated by commas. As shown below:

picture

Then we can achieve this effect by configuring the following statement:


SELECT
    s.id AS _id,
    s.id,
    s.name,
    s.age,
    p.parentNames
FROM
    student s
    LEFT JOIN (
    SELECT
        student_id,
        group_concat( parent_name ORDER BY id DESC SEPARATOR ',' ) AS parentNames
    FROM
        parent
    GROUP BY
    student_id
    ) p
    ON s.id = p.student_id

field subquery

The above example of changing multiple rows into one row can also be achieved through field subqueries. We can configure the following statement:

SELECT
    s.id AS _id,
    s.id,
    s.name,
    s.age,
    (
    SELECT
        group_concat(
            p.parent_name
        ORDER BY
            p.id DESC SEPARATOR ','
        ) AS parentNames
    FROM
        parent p
    WHERE
        p.student_id = s.id
    GROUP BY
        p.student_id
    ) parentNames
FROM
    student s

The Adapter will cache the relationship between the subquery table and the outer data table when parsing the configuration, and take different actions when it senses changes in different data tables:

  • Parent table update: Obtain the associated field student_id information, and splice the field qualification conditions of the outer table after the entire SQL, such as s.id = ××. Update after query;

  • Student table update: Get the id field of the main table, and update it after splicing the restriction query after the entire SQL.

Of course, Adapter also supports more complex forms, such as Left Join in the subquery or Left Join in the outer query.

Save to ElasticSearch

From obtaining binlog to storing it in ElasticSearch, we need to ensure some features and solve some problems.

Focus on features

sequentiality

In some single-table and multi-table scenarios, we will not go back to MySQL to query the latest data again. Directly placing the data in the binlog into ElasticSearch means that we must ensure the overall order.

If we handle the order of the two binlogs incorrectly, we are likely to cause the data written to be only an intermediate version of the update process rather than the final version.

The orderliness includes the orderliness of the binlog itself and the orderliness of Adapter processing.

Versions prior to MySQL 5.6 ensure that the transaction submission order of the upper binary log of the MySQL database and the Innodb storage engine layer is consistent through the prepare_commit_mutex lock. In versions 5.6 and later, BLGC (Binary Log Group Commit) is introduced to divide the binary log submission process into three stages, Flush stage, Sync stage, and Commit stage, so that the binary log and actual submission order are consistent.

The ordered parsing of the Adapter can be achieved through database association grouping and topic transfer. So what is a relational data table? For example, the user configured the following three SQL statements:

SELECT
    a.id AS _id,
    a.student_name,
    a.student_age,
    a.class_id,
    b.class_name,
    b.teacher_name
FROM
    student a
    LEFT JOIN teacher b
    ON a.teacher_id = b.id;
 
SELECT
    b.id AS _id,
    b.teacher_name,
    b.teacher_age,
    b.campus_id,
    c.campus_address,
    c.campus_name
FROM
    teacher b
    LEFT JOIN campus c
    ON b.campus_id = c.id;
    
SELECT
    d.id AS _id,
    d.class_name,
    d.class_introduce
FROM
    class d;

We can see that the first two statements involve the teacher table, as well as the student table and campus table respectively; the third statement only involves the class table. Since the first two statements need to rely on the teacher's binlog during synchronization, we set student, teacher, and campus as one group, and class as one group. Binlog data in the same group needs to be in order.

Since it is difficult for the Adapter to affect the partition allocation of upstream Kafka through the downstream definition, our recommended approach is to use a single topic and a single partition for each group. Of course, there is also a grouping mechanism inside the Adapter. If multiple groups are mixed and transferred, we can also perform perfect grouping and multi-thread efficient parsing processing.

reliability

If we want to ensure that the data in ElasticSearch and the MySQL database are consistent, we need to completely process each binlog without losing messages. Therefore, we need to ensure reliability.

Reliability means ensuring that every message is consumed without errors caused by message loss or repeated consumption, so we need to achieve consumption idempotence.

Considering that the Adapter synchronization program may face various normal or abnormal exits, we use Kafka's manual Offset mechanism to commit the Offset of a Message after confirming that the Message data has been successfully written to ElasticSearch. This ensures No messages will be lost.

For scenarios where data is directly placed into ElasticSearch without querying back to the table and the type is Update, we will judge the initial state of the data in ElasticSearch and only update it when it is confirmed to be consistent with the oldData information in the binlog data; for scenarios where data is returned to the table The latest data has been obtained, so repeated consumption will not have an impact.

Solve the problem

Bulk submission

In order to improve the speed, ElasticSearch will adopt the form of batch submission to improve the overall speed. Calling the add method will automatically store different BulkRequest objects according to different ElasticSearch service clusters, and submit them after all processing of the group is completed.

JSON data

We must admit that storing JSON in MySQL is not an elegant solution to the problem, but in some scenarios we will still store JSON strings in MySQL.

In our Adapter, we provide the ability to automatically recognize JSON and insert it in the form of nested documents; of course, before using this feature, we must ensure that the corresponding content with the same name is of the same type.

Geodata

In MySQL, longitude and latitude are usually two fields, but we expect to store it as a geo_point type when it is stored in ElasticSearch; or a polygon data is stored as a string in MySQL, and we expect it to be stored as a string when it is stored in ElasticSearch. geo_shape type.

At this time, simple synchronization update alone cannot solve the problem, especially for geo_shape scenes such as geographical location polygons. We finally chose to process it in the form of a tag function, configuring the user to use a custom geo_shape tag function to indicate that the system needs to parse the fields. When the Adapter parses SQL, it obtains the fields that need to be parsed and processed through the user-defined function.

For example, the following SQL:

SELECT a.id as _id, geo_shape(a.geo_shape) FROM geography;

It is worth mentioning that in ElasticSearch, geographical coordinate graphics often cannot be stored due to illegal construction. Therefore, we use spatial4j to judge in advance before updating the geographical data type. If there are problems, report them directly to the monitoring center for automatic perception. and notifications.

Content filtering

Our Adapter hopes to be able to flexibly select the data to be dumped to ElasticSearch based on user definitions. There are many common usage scenarios of tombstone deletion, but the current open source Adapter usually has problems in supporting where conditions.

We combined the issues of the existing open source Adapter to analyze the reasons that cause the where deletion logic to fail: ① A single table update does not return the table query, so there is no chance to splice the where query ② After the table query, the qualified updates are entered into ElasticSearch, but the original ElasticSearch Data that is consistent but does not currently comply cannot be deleted in a timely manner.

In order to solve these two problems and improve efficiency, we selected a solution that allows users to configure MVEL expressions, parse the expressions before submission to verify the data to be submitted, and submit if it matches. If it does not match, it will be added to ElasticSearch based on _id. Delete the corresponding data and stop updating this data information to ElasticSearch.

summary

At present, we have supported the above functions. Judging from online monitoring, the average processing time of a single message is about 30ms; P99 is about 80ms, which basically meets the actual online business needs.

Full synchronization

Overview

Full synchronization is much simpler than incremental synchronization. The main function of full synchronization is to elegantly and completely place the data in MySQL into ElasticSearch.

This seems very easy. You can just write the query results into ElasticSearch in the form of SQL query. What issues do we need to consider in advance during this process?

focus point

Deep paging and data loss

Deep paging will cause a greater burden and is easier to understand, but deep paging may also cause missing processing problems. For example, if you delete a piece of data in a part with a smaller ID, deep paging may cause changes in the paging seams, and data may be lost between the current processing page and the next page.

For example, in the example below, we obtain 5 pieces of data each time during deep paging. Since the user deleted the 4th piece of data with the information "5" after the first acquisition, the next paging starts directly from "9", resulting in information Data of "8" are omitted. Our solution is to update based on the id range.

picture

Full incremental impact on each other

Let’s take an example scenario to better understand this impact:

  1. The program reads the data with IDs 1 ~ 1000 into the memory and prepares for full update

  2. MySQL modifies the data with id = 100 and changes the name from "Zhang San" to "Li Si"

  3. The program performs incremental updates normally, verifies that the data name with _id 100 in ElasticSearch is "Zhang San", and updates it to "Li Si"

  4. After the program is fully processed, 1 ~ 1000 pieces of data are written to ElasticSearch, among which the data name with _id 100 is written as "Zhang San"

  5. Continue with full volume until the end

During this process, the name of the data with id 100 in the database is "Li Si", but due to the mutual influence of full update and incremental update, the ElasticSearch data is mistakenly written as "Zhang San".

We can refine the lock granularity in the form of segmented locks to achieve full synchronization and incremental synchronization at the same time.

Conclusion

Synchronization of heterogeneous databases has always been a relatively complex issue. I hope this article can help you learn more about heterogeneous database synchronization and provide you with more ideas and help.

author

Qi is a 2021 fresh graduate from the campus recruitment of Bianlifeng Construction and Equipment Technology Team.

Finally, Bianlifeng is looking for excellent partners. We will take every resume seriously and look forward to meeting you.

  • Email address: [email protected]

  • Email subject: Construction and equipment technical team


PS: You can submit the link directly from the top of the article~ 

Guess you like

Origin blog.csdn.net/qq_20051535/article/details/120389517