Data inconsistency caused by data migration to MongoDB and solutions

insert image description here

story background

Enterprise status

At the beginning of 2019, I received a mysterious phone call, and the other end of the phone actually said my nickname: Shanghai Xiaopang.

I thought it was not easy, so I replied: Hello, I am Xiaopang, may I ask who you are?

"I just added your WeChat xxx"

Oh...he just reported my WeChat nickname...

! [Inserted here described image] ( https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X2pwZy9ScVFhU1FaUldYVEd2ZU10QmQ5WHpNeEYxVG9nekZiUmJFWjY2OXJjTmF1TlRYNlAyZlF3d0RYdmljVVlhTGpQWnJCODlEdkw3QXJGd1Z4SXFYdDBSSXcvNjQw?x-oss-process=image/format,png#pic_center = 300X)

After in-depth communication, I learned that the other party is the technical director of the big data department of a confidential unit of a central enterprise, because the entire group is currently undergoing digital transformation. During the decision-making process, several obstacles were encountered.

First of all, the data foundation of most departments and departments is still very weak, and there are phenomena such as confusing data standards, uneven data quality levels, and serious data islanding between various blocks, which hinders the sharing and application of data.

Secondly, limited by the scale of data and the richness of data sources, the data application of most enterprises has just started, mainly focusing on limited scenarios such as precision marketing, public opinion perception and risk control, the application depth is not enough, and the application space needs to be developed urgently.

Third, because the value of data is difficult to assess, it is difficult for enterprises to assess the cost of data and its contribution to the business, making it difficult to manage data assets like operating tangible assets.

<br>

In the spirit of seriousness, responsibility, and professional research, the technical director has devoted himself to the field of big data, trying to find a product on the market that can meet his needs and help him solve data pain points.

After communication, we learned that the current status of enterprise data is:

  • The data is scattered in various departments and departments, with a total of 50+ departments in 8 major departments
  • The amount of data is very large, 100GB of data can be generated per hour during peak hours, and 1TB of data is stored every day
  • Rich data types, including:
    • Relational databases: Oracle, MySQL, PostgreSQL, GBase, GauseDB, etc.
    • Non-relational database: MongoDB
    • Structured files: XML, Excel, CSV, TXT
    • Unstructured files: audio, video, pdf
  • There will be 5 new projects every month, and each time a new project is connected, it takes 1-3 months to connect the data
  • The project cycle is long, and most of the time is spent on data redundancy, cleaning, and filtering
  • The data maintenance cost caused by multiple copies of data is also increasing, which affects the progress of research and development

Consider migrating

In the unswerving implementation of the digital transformation strategy and winning the battle of turning traditional data organizations into big data ecology, the technical leader realized a truth. To win this tough battle, data integration must be done!

To do data integration, isn't that the traditional data warehouse and data lake? After some market research, the technical director found that neither the data warehouse nor the data lake could satisfy the future big data architecture in his mind.

<br>

What kind of data architecture can't satisfy it? Shared data for application development

In short, data warehouses and data lakes cannot be delivered in real time. The current application scenarios are as mentioned above: the application depth is not enough, and the application space needs to be developed urgently.

After several investigations, the technical director found a product called Tapdata, in his original words: "The concept of this product is very advanced, and it can be said that it coincides with my idea."

By extension it is:

  • Complete data aggregation and collection through data synchronization
  • Provide external data services through data release
  • Effective management of data assets through data governance

And most importantly, the data is reusable and delivered in real time .

solution

Architecture

Tapdata's data synchronization tool can complete the synchronization of multi-source databases by simply dragging and dropping. At the same time, depending on the flexible js scripting ability, complex ETL scenarios can also be handled very easily.

Then here is the previous architecture diagram that was given to the technical director on the status quo of their enterprise. Because this article is discussing data migration, I will give the architecture diagram of data synchronization.

insert image description here

The entire architecture adopts mongodb sharded cluster as the underlying storage, uses data synchronization tools to extract multi-source data into mongodb in real time, and completes data cleaning and filtering during the extraction process.

Technical realization

When using data synchronization tools for data migration, you need to communicate with users about specific scenarios, such as:

  • Is the goal of this time a one-time data import, or is it necessary to maintain incremental synchronization afterward?
  • Are there complex ETL scenarios in data migration?
  • Requirements for synchronization delay
  • Synchronized data volume estimation, peak estimation

After clarifying the goals and requirements, we adopted a multi-node distributed collection method to meet the data volume generated during application peaks. According to the estimated peak value of 100GB per hour and 500GB of storage per day at that time.

Through data tools, different data sources are combined through task scheduling to complete data cleaning.

The user's data synchronization requirements this time are more in terms of data synchronization performance and data volume, and there are not too many requirements for data ETL. They are all simple field renaming and field type conversion.

Therefore, it only takes 1 minute to complete the synchronization from the source database to the target mongodb through the data synchronization tool.

<br>

Create a data source

insert image description here insert image description here

Orchestrate tasks

insert image description here

Compared with before implementation

Currently online data sources include Oracle, MongoDB, MySQL, PostgreSQL, and GBase. The number of database clusters reaches 10+ sets, supporting the operation of 3 complete business lines at the same time, and the concurrency peak reaches 18w/s.

It effectively solved the biggest obstacle that hindered the implementation of technical leaders at that time: the data synchronization work of large data volume, and the data management after landing.

When adding new services, user technicians only need to do a simple drag to complete. Reduce the development work of technical staff, helping technical staff to focus more time on core business. Greatly shorten the project launch cycle.

Orphan Documentation

Phenomenon

After running for a period of time, after a new application is accessed, it is found that the accessed data is duplicated. Through the TD data comparison tool, it is found that there is indeed a difference in the data volume of the same table between the source mongodb and the target mongodb.

This matter is very fatal to data synchronization tools. The core capability of data synchronization is to ensure data consistency, and the data idempotency of data synchronization tools has also been tested and certified by the China Software Evaluation Center.

Our team attaches great importance to the occurrence of this phenomenon. If the data inconsistency caused by the synchronization tool is really, it is a fatal bug. All functionality needs to be returned.

Check

Randomly contacted the user technology at the first time, and carried out a series of investigation work.

Confirm database type

The first step in troubleshooting is to confirm the database type and operating system configuration of the source and target.

The database situation involved in the task of data duplication this time is:

  • source database
    • mongo 3.2
    • single instance replica set
    • 64c 256GB SAS HDD
    • 10 Gigabit Fiber Intranet
  • target database
    • mongo 4.0
    • 6-shard cluster
    • 64c 256GB SAS HDD
    • 10 Gigabit Fiber Intranet

find duplicate data

Since there are duplicate data, we must first find these data.

The source database is mongo, and the target is mongo, which is easier to handle, just write a set of js scripts. There will be a pit here, as will be mentioned later, that is, the sharded cluster needs to be checked on each node, not on mongos.

The script is very simple, because the data synchronization tool will synchronize according to the business primary key when synchronizing, so I can traverse each piece of data in the target collection, and then use the business primary key to query the source database and compare all values. .

The process will be slow, but just wait.

Of course, it should be noted that since the source database is a single node, it is theoretically better to synchronize a copy of the data for comparison, but since the service has not yet been launched, it has little impact. The target data can be compared by looking up the slave node data.

The comparison result is more than 20 tables with a total of 1kw data, with hundreds of thousands of repetitions. It seems that the amount of repeated data is quite a lot.

The duplicate data I define here is that the same business primary key should be unique in the data, but more than one is found on the target side.

Check the data synchronization tool log

Now that you have duplicate data, you can go to the data synchronization tool log to query.

Whether there is data duplication or ERROR during the synchronization process, if you find the word duplicate key, you can check down this record.

But unfortunately, no similar words were found in the log

Check mongodb log

The log of the data synchronization tool was fruitless, so I switched to the mongodb log. By viewing the mongodb log, I found that there are a large number of recvChunk and moveChunk logs.

insert image description here

When I saw this, I felt sleepy all of a sudden.

I will simply tell you what this log is doing. Because the target mongodb is a sharded cluster, there is a very important concept in sharding called block movement. The sharded cluster is stored in the smallest unit of data block (chunk). By default, a block can store 64MB of data.

So what does this have to do with this data inconsistency? The place where the spirit is shaking is coming. mongodb's processing method for balanced sharding is: first copy the chunk of the shard 1 node to the shard 2 node, and delete the chunk of the shard 1 node after the complete copy of the chunk is completed.

Then in the process of this transfer, you can imagine that there will be accidents in several links, which will lead to the failure of block migration.

At this point, it is necessary to confirm the operation process of the day with the user.

<br>

Sure enough, there was actually a server disconnection that day, and this disconnection happened 10 minutes after the service was just connected. Let's restore the crime scene.

The user starts the synchronization task, and the data starts to be synchronized to the target database according to the rules as expected.

After 10 minutes of synchronization, the computer room is disconnected from the network. At this time, the data synchronization task is in the retry stage, and the mongodb cluster is all disconnected from the network.

During the disconnection, the chunk migration in progress by mongodb was forced to terminate.

After a period of time, the network is restored, and the automatic data synchronization retry mechanism ensures that users can continue to start the synchronization task without manual intervention.

mongodb continues to start chunk migrations.

<br>

I found no, in the fifth step, the block migration of mongodb did not interfere with the result of the previous block migration failure. In fact, it is not rigorous. The metadata recorded on the mongodb config server still thinks that this block is on shard1, but has been changed from shard1. The data copied from shard 1 node to shard 2 node is not deleted. Therefore, the data in the final count will be larger than the total number of original data.

solve

In order to solve this problem, the official has foreseen it. The official solution is given.

Here I will help you to summarize, execute the following script on each shard node.

var nextKey = { };
vard result;

while ( nextKey != null ) {
  result = db.adminCommand( { cleanupOrphaned: "<COLLECTION>", startingFromKey: nextKey } );

  if (result.ok != 1)
     print("Unable to complete at this time: failure or timeout.")

  printjson(result);

  nextKey = result.stoppedAtKey;
}

This script is doing one thing: find the data ID range that is not part of the config node record, and delete it.

Summarize

Then through this matter, looking at the official documents, we have summarized a few points:

When using the data synchronization tool to migrate data to the mongodb sharded cluster, the following actions are required

  • Stop the Balancer: How to Stop the Balancer
  • Using the cleanOrphan command: How to clean up orphan documents
  • In the face of data inconsistency, troubleshooting ideas can start from the database and synchronization logic
  • Professional things are left to professional people.
{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324101232&siteId=291194637
Recommended