Remember a unicorn company's dual system migration and merger solution

Preface

I met my former colleague a few days ago and exchanged simple greetings. Looking at this little partner who gave us a hard test at the beginning, I think of the dual-system migration and merger at that time, because he was the one who tested the project work I did. Thinking about it now, this project is quite remarkable. So record and share, and miss it for yourself.

background

The last company was a P2P wealth management company. At that time, P2P was in full swing in Hangzhou. At that time, two things happened. One was that our own wealth management platform was connected to the bank depository system, and the other was that our company acquired a local P2P company in Hangzhou (I have to say that the company's strength was still available at that time) . Therefore, the company decided to synchronize the user data migration of the acquired company to our own system. First, there is no need to re-access the bank depository to the acquisition platform. Second, it is easy to manage and maintain data. , But also for the purpose of acquiring more platforms in the future to do a good job of demonstration work (I have to say that the ambition is great; but the ambition is not worth a red head document, hahaha).

analysis

In the following, I use the A and B systems to respectively indicate that the B system is the target of migration and needs to be merged into the A system.
For work such as system migration, I personally think that we need to start from two aspects:

  1. In terms of business, data migration needs to be compatible with existing systems;
  2. In terms of technology, the data migration work needs to achieve the consistency of the data of both parties, and the impact on users needs to be minimized to achieve an ideal smooth migration.

In fact, there is nothing to talk about in terms of business. Because of the difference in system business, how to be compatible with existing businesses depends on the actual situation. For example, in the user tables of the A and B systems, there are more or less different fields, or there are conceptual conflicts, which are all possible. This depends on their own choices. Of course, there is a special scenario that needs to be considered, that is, users have registered in the A and B systems and there are user behaviors. This is also to be selected according to the business scenario. At that time, our approach was to still keep the records of the B system, but this record will not be used as a real one, but only as an identification. According to B's record, the real user data A can be queried.
Of course, since it is a solution, it is more from a technical point of view. Therefore, how to design this solution is an overall task.
First of all, we must clarify the effect that this migration wants to achieve:

  1. Data consistency;
  2. Migrate as smoothly as possible.

The data consistency problem here is actually not too big a problem, although this is the most necessary link. How to smoothly migrate data is the most difficult challenge. When I heard the request for a smooth migration, my first reaction was the several garbage collectors mentioned when I watched the JVM before, especially the classic analogy: spit melon seeds while sweeping the floor. In fact, this scene is the same. Because of the smooth migration, then it means that the A and B systems, especially the B system, will not stop serving. Then it is possible that new data will be generated during the migration, which is the DML operation in the user traffic.

Solutions

It seems that it is not so easy to do. Is there no solution? I can think of a similar solution that is MySQL master-slave synchronization. At present, most companies' MySQL adopts a master-slave architecture scheme to ensure its availability. So how does the MySQL slave library keep it with the master library after it is broken and repaired? That's right, what everyone thinks of is the practice of using backup + binlog logs.
MySQL binlog log is a very important thing for MySQL, it is used to record the DML operation of each record. For a record, by collecting the binlog log on this record, it will ensure that the data is consistent at multiple ends. In other words, the combination of multiple binlogs is a final database record. This reminds me that when I worked in the first company, I used to do MySQL table aggregation to generate wide tables. Its method was to monitor the binlog log on MySQL and send it to Kafka to consume Kafka messages. Synchronously generate and maintain database table records to keep the data consistent with the main table.
Unfortunately, because there is no engineering service related to MySQL monitoring binlog at the company level, we cannot proceed directly from the MySQL level. However, the idea is still such an idea, which can be incrementally processed at the code level.

What is involved

A system

For the A system, the role in this migration process is not very important, but it is worth mentioning that because there will be new users of the B system, it is necessary to vacate a part of the auto-increment ID of the user table, such as the A system There are already more than 200w data in the user table, and there are more than 50w data in the B system, then the auto-increment ID needs to be changed to 300w, and the auto-increment starts from 300w. And this part of the users of the B system are inserted according to the mapping method of (240w + B system user ID).
Why do you want to do this? This is mainly to facilitate the positioning of the data has been calibrated. After the data is migrated, data calibration is indispensable. Otherwise, it can only be judged based on the user's unique information. Then when the user information exists in both A and B systems, it will be very tasteless at this time.

B system

For system B, what needs to be guaranteed is that users can use it normally during migration. Of course, some performance loss can be allowed, but it cannot be said that it loses availability. When the user performs a query operation, this will not have an impact; but when there is an addition, deletion, and modification operation, you need to pay attention to recording the relevant SQL at this time. Because according to the above idea, a backup will be cut out at a certain moment, so the migration work is also for this backup. The user's subsequent additions, deletions, and modifications need to be synchronized in the form of incremental SQL. Of course, there will be some situations here: what if the user data is being migrated and SQL is generated at this time? What if the migration is already processing incremental SQL and system B generates new SQL? We all need to analyze these key scenarios

Migration and consolidation procedures

There is nothing specific about the migration procedure, because the more important part is the business migration and consolidation operations. Just need to pay attention to the coordination with the B system. After completing the data migration and consolidation work, those incremental SQLs need to be executed.

Calibration procedure

After completing the migration work, a few days of observation period need to be set aside. In these days, this calibration procedure needs to be executed regularly every day to check the data consistency of the information, amount, product and other information of this part of the user.

Program

Here comes a lot. I've said a lot about it. I guess you must be annoying. If you continue to talk about it, you will close the webpage. What, it's going to be closed now! Uncle, please stay aside and save face, for the sake of my code word is not easy.

FIG upper figure this is a flowchart of the B system. This is the flow chart of the migration process.
Before explaining, I will do two things in advance:

  1. Set the user of system B in Redis, and the identification content is 0, for example, the key is com:showyool:user:123, the value is 0, 0 indicates that the migration has not yet started, and 1 indicates that the migration is in progress.
  2. Send these user IDs to RocketMQ. When the migration program on multiple machines is started, these messages will be consumed and stored in the ConcurrentLinkedQueue in the memory. The thread pool in the migration program will go to this queue to acquire users one by one and process them concurrently.

Flow chart of B system

Next, we can explain it. First, the flow chart of the thread B system will start to explain:

  1. When the user's DML operation request comes, first we obtain the distributed lock key1, of course this is a user-level lock. The purpose of this lock is to coordinate with the migration program.
  2. If the acquisition is successful, you need to go to redis to determine whether com:showyool:user:xxx exists. It doesn't matter whether 0 or 1 is here, because as long as it exists, it means that the user data has not been migrated, and because the distributed lock key1 has been obtained, Therefore, there is no need to worry about the operation of the migration program. You can read it again in this follow-up. If it exists, then perform the DML operation normally, then store the organization's SQL into MySQL, and finally release the distributed lock key1. If it does not exist, indicating that the user's data has been migrated, then organize the SQL and send it to RocketMQ. (You may ask here, why in the first case, SQL is stored in MySQL, and in the second case, SQL is sent to RocketMQ. The original assumption here is that if it is sent to RocketMQ, it may be quickly consumed, and the first In one case, the migration has not yet been carried out. It is a bit inappropriate to add, delete and modify the SQL in a situation where there is no record. Therefore, it is stored in MySQL in the hope that the information here can be actively retrieved after the migration work is completed. Of course, SQL is executed. Sequentiality, in fact, both MySQL and RocketMQ can guarantee).
  3. Going back to the previous level, if the distributed lock key1 is not obtained at the beginning, then get the content of this user in redis. If it does not exist, it indicates that the user data has been migrated. This is the normal user traffic usage, which means that the previous user request may have acquired the distributed lock first. Perform normal DML operations and organize SQL to be sent to RocketMQ. If the content is 0, the expression is normal concurrent user traffic, and the data has not yet started to migrate at this time, then normal DML operations are performed and the SQL is organized to be stored in MySQL. There is another situation, which is also more complicated, that is, the content is 1, so it is already being migrated at this time. In other words, the migration program first obtained the distributed lock key1, and then set the content to 1, and the migration has already begun. Then the B system can perform normal DML operations at this time. Next, the more troublesome thing is that you need to obtain the distributed lock key2, which is also user-level. Why do we need to build another distributed lock? These are the scenarios we have analyzed before. Let's look at the picture below

When our migration program is acquiring incremental SQL and executing it, there may be three points in time that will face the new SQL from the B system. Let's briefly analyze:

  1. The first new SQL, since the migration program has not yet started to process the new SQL, it is reasonable to add SQL at this point in time;
  2. The second new SQL, because the migration program has already acquired the new SQL, but a new SQL is added. This is actually a typical phantom reading problem when we learn database principles. This is unreasonable;
  3. The third new SQL, because the migration program has already processed the staff incremental SQL, the new SQL in this place will not be processed, and the situation is similar to the second one.

So how to deal with the latter two cases? Let's take a look at the third new SQL. In this case, you don't need to save it in MySQL anymore, just send it to RocketMQ for consumption and processing. The trouble is in the second situation. In this case, we actually shouldn't happen, so we need to carry out a concurrency control, which is why I will introduce a second distributed lock!
Then the above said, if the distributed lock key2 is successfully obtained, then judge whether there is the user in redis, if it does not exist, so the migration program has been processed, then it is the third case, just send it to RocketMQ. If redis still exists, then it is the first case, just store it in MySQL. If acquiring the distributed lock fails, either it is a concurrency problem of normal user traffic, or the migration is processing incremental SQL. At this time, just retry and reacquire the distributed lock, which eliminates the second situation. happened.

Flow chart of the migration process

If you already understand the flow chart of the above B system, then you will definitely be able to understand the flow of this migration program very quickly. With a serious attitude, I still want to tell everyone: (You see my attitude is so good, or like a like, 嘤嘤嘤)

  1. The threads in the thread pool acquire users from ConcurrentLinkedQueue, and then migrate this user from the beginning.
  2. First of all, you still need to obtain the distributed lock key1. If it is successfully obtained, then the user content in redis is set to 1, indicating that it has entered the migration state. This is followed by a long period of migration and merger work. This is a business code operation, so I won’t say more. When the migration and merging work is completed, then the distributed lock key2 is acquired, that is, to enter the link of executing incremental SQL. In most cases, it will be successful quickly (after all, it is done in the middle of the night, at that point Normal people are all asleep, what, you didn't sleep, then you??). Then the incremental SQL is executed, the user information in redis is cleared after the execution, and the distributed lock key2 and key1 are released finally. Then if the distributed lock key2 fails to be obtained, it means that the user is operating at this time, so just try again.
  3. If the distributed lock key2 fails to obtain, then the user traffic has entered the B system and holds this key2, then record the user in redis and open a new key, such as com:showyool:fail:user:123, value is the number of failures. Then get the distributed lock key1 after a period of time. If the acquisition is successful, execute the above process, if it fails, continue to increase the counter. If there are 15 failures in total, no more operations will be performed on this user and the next user will be selected. This is to take into account that this user may be a very active user, so this type of user may be treated separately, although the processing flow is similar to the main flow (it did not appear at the time of the migration, but there is Several users failed once or twice, this is quite a surprise, yes, surprise, you translate it to the translation, what is a surprise).

Implement

So the core ideas and plans are described above, and then the specific implementation situation at that time will also be shared.

  1. Shield the registration entrance of B system, and guide user registration to A system.
  2. Choose 2 o'clock in the morning to start the operation. This time point was chosen, one is because at 2 o'clock, most users are already asleep, so users do not have a lot of DML operations. In addition, there will be a large wave of timed tasks that will be executed in the early morning, such as calculating interest, penalty interest, etc. (you see the income of Yu'ebao on Alipay, not displayed on time at 0 o'clock, all timed tasks are running tasks in calculation). Because the timed task will have a large wave of DML operations during the execution of the task, it can avoid that point in time. Generally, it can be executed before 1:30.
  3. At 2 o'clock, the service of system B needs to be offline, and the written code should be sent to systems A and B, and the migration and consolidation program should be deployed. At the database level, a backup of the B system database is required, and the auto_increment of the user table in the A system database is modified to 300w. Before system B goes online, you need to set up user data in redis, for example, com:showyool:user:123=0. If you want to say a complete smooth migration, then this time is actually not considered, after all, this is also a small suspension of services. Fortunately, this place does not take a lot of time.
  4. The inspection services are all online, and the inspection and calibration procedure is started. Regarding this, I didn't say anything about it above, because I think this program is just a business calibration, and it is also for data consistency. In addition, the user ID needs to be sent to RocketMQ, and these migration and merging programs will consume this topic and store it in the ConcurrentLinkedQueue in their respective memory. The threads in the thread pool will continue to obtain user IDs from this queue. Once the user ID appears, the migration will begin.
  5. The next step is to observe various data, such as how many users have been migrated, and how many users are left.
  6. Of course, the above flowchart also mentions the failed user, although it does not appear in the actual situation, but if there is such a failed user at the end, resend it to RocketMQ to continue execution.
  7. In the end, of course, it was handed over to the test... and I was just looking at the rising sun, not knowing whether I was in a daze or sleeping.

Follow-up

My thoughts pulled me from that moment to the bald uncle in front of me, and between the laughter, it seemed like a familiar tacit understanding. After all, it was a good memory, and it was also the blooming of flowers on the road of my growth. My past is not necessarily perfect, and there will be welcome criticisms and corrections, but I hope I will go on bravely, with you, with me, and with Java.


Author: showyool
link: https: //juejin.cn/post/6931718215177338893
Source: Nuggets
 

Guess you like

Origin blog.csdn.net/m0_50180963/article/details/113935802
Recommended