Talking about the tens of millions of system reconstruction projects done in these years

foreword

When the business develops to a certain extent, the original old system has a bottleneck, or the company's technology stack changes (such as PHP to Java language), it needs to be refactored, but the online business must run stably at the same time as the refactoring. So how to make a smooth transition and reduce the risk of refactoring problems is particularly important. This article will talk about this issue.

Looking back on my career, the three refactorings I have done

The first time: June 2016 Yidao tens of millions of order system reconstruction

Introduction: PHP technology stack, the order table is divided into 1024 libraries, and it took 10 people 3 months to complete it. I found an article on the Internet: https://www.admin5.com/article/20160705/673189.shtml

Experience: This is the first time I have been exposed to refactoring in the third year of working, and I am very honored to follow Boss Yu (the author of FastDFS and FastCFS) to participate in this refactoring project. It was also at that time that I began to understand "architecture".
I was particularly impressed by using PHP to encapsulate the underlying DB operation library of Mysql. It took me about three days to write a very flexible DB library that supports various chain operations and was directly passed by Lao Yu. The boss said: "You only need to add, delete, modify and check, the simpler the better." I didn't understand it at first, but it took half a day to rewrite it. Later, after going online, I gradually realized that simplicity is beauty, and performance is beauty.

The second time: Image storage service reconstruction in June 2018 (that is, image storage service)

Introduction: Java technology stack (SpringBoot/Dubbo/Redis), the image storage table is divided into 512 libraries, and it took 3 people 2 months to complete

Background: 1. The database has one master and multiple slaves mode, and a single DB is written, and the pressure on the DB is very high during peak hours. 2. The image processing service is mixed with other business services, which is quite chaotic

Goal: Storage independence, image storage and other storage are separated 2. Multi-master DB write mode, data is distributed to multiple DBs. By adding physical machines, the writing and reading speed can be increased linearly 3. Business is independent and the boundary is clear. Only complete image storage, separate from business functions

Experience: This image reconstruction is a reconstruction technical solution written by me, and I lead the development. Among them, the sub-database and sub-table middleware is natively based on JDBC self-development (the main reason for not using open source middleware such as Sharding-JDBC is that it does not meet the needs of our business scenarios).

The third time: July 2021 order sub-database sub-table reconstruction

Introduction: Java technology stack (SpringBoot/Dubbo/Redis/Mango), the order table is divided into 8 libraries, and 256 tables took 7 people to complete in one month.

I also wrote an article before introducing: Talking about the road to order reconstruction

Experience: Leading the design of technical solutions and the implementation of R&D
In the case of tight schedules and heavy tasks, advance in an orderly manner and quickly develop and launch. From the perspective of the attributes of the business itself, solve the problem of order performance.
The underlying Dubbo service of this project currently supports QPS 5W+ during the online peak period, and write RT<5ms, read request RT<2ms, it can be said that the performance is excellent.

According to the stress test results, at present, a database cluster can support 1W TPS for ordering and 5W TPS for reading, and can support horizontal expansion, supporting up to 8 database cluster expansions.

Refactoring, what needs to be done.

It just so happened that a friend's company's PHP technology stack switched to Java recently and asked me some questions about system refactoring. I have summarized the following three points:

1. What to do

First of all, what is the problem to be solved? What are the benefits of refactoring? Whether there will be immediate results after completion, what is the input-output ratio, and so on.

1. The first thing to consider is the learning cost of the members in the group. Although language switching is not a problem (personal experience), there must be at least a buffer period of about 3 months. 2. It is necessary to clearly know what is the bottleneck encountered? What is the core problem to be solved? 3. Module division, first determine which parts need to be reconstructed into Java, and start from one point to solve the core problem. Improve the Java technology stack (registration center, configuration center, sub-database and sub-table middleware, monitoring, alarm, full link tracking) and run stably before migrating other parts.

2. How to do it?

After thinking about what to do, analyze the modules that need to be refactored, analyze each SQL statement, how to split the sub-database and sub-table, and whether the table structure needs to be modified.

What is the shardkey of the sub-table, and how to query when there is a non-shardkey (build a redundant table, or aggregate to ES+Hbase query, etc.)

How to sub-database and sub-table, whether to sub-database, sub-table, or both sub-database and sub-table, how many databases, how many tables.

When considering sub-databases, the main considerations are whether there is a writing bottleneck and the convenience of horizontal expansion in the future. Consider sub-tables mainly because the query volume is relatively large, and the volume of individual tables is relatively large. It is recommended that the number of records in a single table be controlled below 1000W.

The table structure has been modified, whether double writing is required. How to do the double-write data migration solution. How to cover positions with data, compare scripts, etc.

How to do the grayscale solution and how to ensure smooth migration.

If the time is tight, you can complete the core parts such as business splitting, sub-database and sub-table first. Auxiliary tools such as monitoring, alarming, and full link tracking can be improved after Grayscale goes online.

3. What do you need?

To preliminarily estimate the expected goals

For example, how much manpower development needs to be invested (some support refactoring, and some people also need to support the normal operation of existing businesses),

How much TPS do we need to support the order, how many docker service resources, mysql database resources, Redis machine resources, etc. are needed after refactoring.

Refactoring, what is the most important thing?

Answer: They are all important, interlocking, and every link is indispensable.

If I had to give one.

Then it is: "grayscale scheme"

Historical experience tells me that there are no problems with refactoring. So the core question is how to minimize the risk. That's what the "grayscale solution" solves.

Next, I will focus on the grayscale scheme.

If you just refactor an interface, it's easy, just add a switch. The impact will not be too great.

But if the entire order system is reconstructed, and the table structure is also modified, can it be done directly online? I can't imagine the consequences of a problem.

Our common solution is to de-greyscale according to traffic.

stage flow ratio illustrate
stage one whitelist (Generally add internal personnel, only this part of people can access the new system, and if there is a problem, it will only affect the part of the people I added)
stage two ten thousandth If there is a problem, it will only affect one in ten thousand people. For example, if I place an order of 100,000 in one minute, then if there is a problem, it will affect 10 people in one minute. Basically, some problems can be found at this stage.
stage three 1% The flow is gradually enlarged.
stage four 10% Generally, occasional problems will be highlighted at this stage, such as concurrency problems.
--- ---- ----
Phase N 100% After 100% of the traffic is switched to the new system, the old system will withdraw from the stage of history.

The number of stages above can be determined by the company's business scenarios and request volume. The purpose is to reduce risk and make risk controllable.

So, how to do it according to traffic grayscale?

Those who follow me know that I have also written an article before, talking about the development of a lightweight, traffic-controlled grayscale module based on openresty (nginx+lua) , when will this be used.

It depends on the scope of your refactoring system. If you just refactor the order system, it will have no effect on other groups. Then adding a grayscale algorithm at the gateway layer can solve it.

Very simple, get an integer random number. Random number / base 10000 < X ​​calculation, when it is true, go to the new system, otherwise go to the old system. For example, 10% flow rate, X = 10 is enough.

But if the scope of your refactoring system is relatively large, and the underlying structure of the entire company will change, affecting the refactoring of multiple groups, then it may be necessary to develop a lua grayscale module at the nginx layer.

Summarize

Boss Yu said this before: "Refactoring is enough to do it once", but I tossed it three times. But I have to say that after refactoring, the performance of the system has improved, and it is very pleasant to solve problems that have not been encountered. Perhaps, this is the beauty of architecture.

Here, I also thank the brothers who have struggled together.

Finally, if you have friends who have any problems with converting PHP to Java or refactoring, you can private message me to communicate.

Welcome to pay attention to the official account of "Talking about Architecture" and share original articles from time to time. Exchange and learn together, make progress, and encourage each other!

 

Guess you like

Origin blog.csdn.net/weixin_38130500/article/details/122711225