Technological breakthrough, performance soaring tenfold: the reconstruction of the billion-level e-commerce platform reveals the secret

Restructuring helps business take off: a technical story of a billion-level e-commerce platform with an annual growth rate of 10 times

introduction

In this article, I will talk about how an e-commerce platform with a cumulative GMV of more than 10 billion will go from 0 to 1 under the hurricane growth of 10 times the annual performance (annual GMV 500 million to annual 5 billion). Refactoring practice.

Main share:

  • Why refactor?
  • How is refactoring initiated and implemented?
  • What challenges and pitfalls have we encountered in practice?

Welcome to exchange, discuss and learn together.

Core indicator data after system reconstruction

Let's take a look at the data first. This is the business result data obtained on the core indicators of the system after we refactored:

1. System performance improvement, boosting performance : the platform supports up to 500,000+ simultaneous online users for instant flash sales, and the performance of the whole link system is rapidly improved, from 50,000 QPS to a maximum of 320,000 QPS, from monthly GMV of 100 million+ to breaking through 10 100 million + mark.

2. System architecture transformation, pistols for cannons : After the transformation from monolithic architecture to micro-service architecture, the system architecture capabilities have been significantly improved, the technical team is no longer subject to technical debt, and technical problems no longer hinder business development.

3. Build a business middle platform to achieve efficient delivery : By building a business middle platform in the core field of e-commerce, the architecture system of small front desk and large middle desk has been realized, thereby improving the efficient delivery capability of business functions.

4. Open-source technical capabilities and give back to the open-source community : Contributed a distributed secondary cache framework called L2Cache to the world's top IT open-source community to help solve problems such as cache avalanche, cache penetration, and hotspot cache. Take it open source, return it open source. So far, four technical open source projects have been contributed ( Github warehouse address ).

https://github.com/weeget-tech/l2cache

1. Background

Why do refactoring?

From 2018 to 2019, in the start-up stage of the business, in order to quickly verify the feasibility of the product, a single architecture was technically selected to meet the demands of rapid iterative launch of the product. With the development of business, various new functions are becoming more and more abundant, and the system has become larger, bloated, and complex, and there are codes that affect the whole body everywhere. Whenever there is a big promotion, the only way to barely withstand the impact of instantaneous traffic on the system is by spending money to upgrade various infrastructures.

At the beginning of 2020, at the beginning of the epidemic, affected by the epidemic, the online shopping trend was further accelerated, and the number of users and business volume of Yunhuobao showed explosive growth . At the same time, the system is facing more and more severe tests. The system will be shut down every time there is a major promotion, and spending money to upgrade is of no avail. The underlying architecture design of the old system has greatly affected the rapid development of the business.

With the continuous expansion of the size of the production and research team (at its peak, the number of the team was nearly 300), the efficiency of multi-person collaboration began to emerge . For example, code branching and function development often conflict. In addition, the pain points of the monolithic architecture are becoming more and more obvious, and technology has become the biggest bottleneck of business growth.

What problems does refactoring solve?

Before refactoring, the situation faced by the system at that time mainly included the following points:

1. Product delivery efficiency is low, and problems often occur online : product delivery efficiency is low, and delays often occur. Problems such as code merge conflicts, waiting for online requirements, and slow online SQL occur from time to time, which leads to the phenomenon of "business waiting for technology" from time to time. After it went online, technical "ass wipe" incidents often occurred, and various product function bugs continued to appear.

2. Low cross-team collaboration efficiency and high communication costs : Constrained by the old technical framework and the insufficiency of historical architecture design, it is currently difficult to meet the requirements of delivering product functions in an efficient manner under the collaboration of multiple teams.

3. The code has performance problems, and the system often goes down : due to the high degree of coupling of the code business logic, this makes the maintainability and readability of the code very poor. A small error may cause fatal damage to the entire system, and there is a huge bottleneck in the performance of the system. In this case, the confidence of technical students has been severely hit.

4. The technical debt is heavy and accumulates more and more : With the rapid growth of business, the iteration speed of requirements is accelerating, and the team members are also expanding rapidly. However, due to the impact of historical issues such as technical specifications, technical frameworks, and system architecture in the past, the team has accumulated a large amount of technical debt, making it difficult to cope with rapidly changing market demands.

The above pain points are typical problems faced by the monolithic architecture system after it develops to a certain stage, and they are also our core breakthrough points.

How to solve these problems?

In order to completely solve the above pain points, we finally chose "refactoring".

Why choose global refactoring instead of local optimization?

At that time, the business was growing rapidly and technical problems were constantly emerging. We found that simply optimizing the existing system cannot fundamentally solve the problem. Therefore, we made a difficult decision and chose "refactoring" to deal with these pain points.

We chose to put aside the burden of history and start anew. We thought it was the best choice at that time.

Next, we will introduce the practical process of refactoring in detail


2. Practical process of system refactoring

At that time, the business was growing rapidly, but the system frequently went down. Now that the decision has been made and "refactoring" is imperative, how can the car change tires without stopping at high speed?

Here we share our practical process of system refactoring, aiming to help students who have similar refactoring needs, especially for the No. 1 technical person in charge of refactoring, and hope to provide you with some ideas for reference.

Below we will describe each step in detail.

The first step: build a team

Appoint a refactoring technical lead

System reconstruction is a very important task. First of all, there must be an experienced and skilled technical person in charge to lead and guide the whole process. This is the first step, and the most critical one.

Why? Because the relevant matters of system reconstruction are all made by him.

What kind of person should I choose? What kind of technology to choose? do what? What not to do? How to do? These decisions determine the direction and success rate of system refactoring.

At that time, the decision to restructure was to establish a new department (architecture department) and recruit some new personnel. Functionally, it is separated from the business delivery development department and only focuses on refactoring.

Why create a new department?

1. In the existing team, few people have complete reconstruction experience of medium and large e-commerce systems.

2. The existing team members have experienced many pain points of the old system, so they have scruples and cannot make drastic changes, but the people in the new department have no historical burden.

3. After the functional separation and focus, the refactoring requirements will not conflict with the priority of the business requirements, causing the priority of the refactoring requirements to lag behind.

Tips: If you have encountered the same scenario as us and refactoring is imperative, it is recommended to separate the "refactoring team" from the original business delivery and development team to focus and focus.

Step 2: Determine the technical framework and specifications

1. Determine the architecture selection

Why did you choose "Microservice Architecture"?

1. High scalability: The microservice architecture splits the entire system into multiple small services, each of which is independent and can be independently deployed, expanded, and maintained to facilitate horizontal expansion of the system. 

2. Higher reliability and fault tolerance: Under the microservice architecture, the failure of a service will not affect the entire system, but only the function of the service. Therefore, when a certain service fails, it will not cause the entire system to crash, reducing the risk and loss of the system.

3. Better team collaboration: each service can be developed and maintained by different teams, the coupling between teams is reduced, and collaboration is more flexible and efficient.  

4. Better flexibility: Under the microservice architecture, services can be added, modified, or deleted at any time according to business needs, making the system more flexible.

2. Determine the technology stack and framework

To upgrade from a monolithic architecture to a microservice architecture, technology selection is the first step, and our choice is Spring Cloud Alibaba.

Why did you choose Spring Cloud Alibaba?

Spring Cloud Alibaba is an open source, Spring Cloud-based microservice solution for the following reasons:

1. A complete micro-service system with a standardized structure: including service registration and discovery, load balancing, configuration center, distributed transactions, etc. These components can quickly build a stable and reliable microservice system.

2. High performance and good stability: The underlying technology of Spring Cloud Alibaba uses Alibaba's Nacos service registry and Sentinel flow control components. These components have excellent performance and stability and can support large-scale microservice application scenarios.

3. Extensive ecosystem with high community activity: The ecosystem of Spring Cloud Alibaba is very extensive, including many commonly used open source components, such as Spring Boot, Dubbo, RocketMQ, etc. These components can be tightly integrated with Spring Cloud Alibaba to form a complete microservice solutions.

4. Easy integration and deployment: Spring Cloud Alibaba provides a series of convenient tools and plug-ins, which can be easily integrated into the existing development environment, support containerized deployment and automatic operation and maintenance, and can quickly realize the development and maintenance of microservices. deploy.

To sum up, choosing Spring Cloud Alibaba can help us build a stable and reliable microservice system more quickly and conveniently, and can meet the needs of large-scale application scenarios.

3. Formulate development specifications

In the process of technology upgrading, formulating development specifications is a necessary part, which is the development guidelines and standards followed by the entire team. It helps ensure code quality, maintainability, and scalability, improving development efficiency and reducing maintenance costs.

The following are the core content when we formulate the development specification:

- Define coding specifications: including interface specifications, code formats, naming conventions, annotation requirements, framework specifications, architecture specifications, etc.

- Develop a code review process: including the frequency of code review, participants, review content, etc.

- Define test requirements: including test types, test coverage, test tools, etc.

- Define document requirements: including document type, document content, document format, etc.

- Formulate version control process: including branch strategy, version number management, submission information format, code merging process, etc.

- Develop deployment process: including automated construction, automated deployment, etc.

4. Quickly implement the microservice framework

In order to quickly implement the microservice architecture, we have passed an MVP business scenario (search product scenario) in advance, and integrated the overall framework, technical specifications, unit testing, integration testing, acceptance testing, interface contract testing, and new development processes such as DevOps All run through.

Step 3: Determine the scope of refactoring

Determining the scope of refactoring is very critical. In the process of refactoring, it will directly affect the success or failure of refactoring, which involves the complexity and duration of refactoring.

Therefore, we need to carefully consider what the refactoring scope is and where the refactoring boundary is.

However, we chose the most challenging "core main link" as the first breakthrough in reconstruction.

The main functions of the "core main link" here include: login page, e-commerce platform home page, product list page, product details page, coupon usage process, order confirmation page, order process, payment process, refund function, Shopping cart functionality.

Why did you choose "Core Main Link"?

- Team confidence problem: Due to the setbacks of previous refactoring failures, the technical team lacks confidence in the new round of refactoring.

- The biggest bottleneck of business growth: During the process of rapid business growth, frequent system downtime and technical problems have become the biggest bottleneck of business growth.

- The value and significance of refactoring: Just optimize and improve some small problems, although some small breakthroughs can be achieved in a short period of time, but for the underlying architecture design problems, this method is not of great value.

In the end, in order to show the confidence and determination of the technical team and revive the morale of the team, we chose the "core main link" with the largest performance bottleneck and the most difficult reconstruction as a breakthrough.

This is not only to solve the bottleneck problem in business growth, but also to demonstrate the strength and determination of the team and convey to the outside world our belief in victory.

Step 4: Sort out the original business logic

Why sort out the original business logic?

1. The product business gameplay of each e-commerce platform has its own unique features, and there is no identical business logic applicable to all platforms.

2. For a business that is growing at a high speed, the business cannot be interrupted, let alone be affected in any way. It is necessary to ensure the correctness of the business process after reconstruction.

3. The core of this reconstruction mainly lies in the redesign and construction of the underlying architecture, while the original business logic remains unchanged.

Therefore, before the new team officially starts writing code, it is a very critical, important and necessary step to sort out and get familiar with the original business logic.

How to sort it out?

Mainly share several practical experience points:

1. It is recommended to arrange two people to sort out the core business together to improve the accuracy of business logic understanding

For a specific business, developers need to focus and sort out specific business details in depth. After determining the scope of reconstruction as "core main link", for example: 2 people are responsible for sorting out [order-related business], and 2 people are responsible for sorting out [commodity-related business].

Why arrange 2 people? Ensure the accuracy of business carding.

Such a division of labor can not only improve the efficiency of combing, but also make a deviation correction to better ensure the accuracy of the business logic level.

2. It is recommended to have a product classmate join the team, so that the team has a global business perspective

The responsibility of the product manager is to focus on the overall business, responsible for sorting out the business overview of the entire product, and communicating quickly with team members to help the team understand the overall business.

From a product point of view, ensure that the results delivered by the team meet business expectations.

3. When sorting out business logic, you don’t have to read old code

Not all business logic needs to be understood by looking at legacy code. As long as the business scenario is clear, we can write brand new code directly based on the product business logic without being bound by the old code.

This method will not be guided and disturbed by historical codes, and can get rid of historical burdens faster.

Step 5: New system architecture design

System architecture planning and design

Based on our understanding of the current situation and future planning, we divide the "core main link" into the following basic areas of e-commerce, among which "user domain, order domain, and product domain" are the core areas of this reconstruction scope.

Overall system architecture diagram design

 

Step 6: Determine Refactoring Goals

Identify long-term refactoring goals

The long-term goal is to improve the performance and reliability of the entire platform through system reconstruction to ensure that the platform can support business development in the coming year.

Identify short-term refactoring goals

After determining the long-term goals, it is necessary to use the long-term goals as the baseline, formulate short-term reconstruction goals, and formulate project plans, and run in small steps.

The short-term goal is: no more system downtime, no business interruption, and no technology lagging behind .

Step 7: Implement the refactoring plan

This is a project management process, and there is a lot of information on the Internet, so I won’t go into details here.

Mainly share several practical experience points:

1. Does everyone who should know know?

This is the most critical step. Most of the problems come from "someone doesn't know", that is, the relevant parties do not know about this matter, such as the scope of changes or the pace of time.

The relevant parties referred to here are, first of all, technical and process related parties, as well as related parties of business departments and users, as well as company management.

2. Have you rehearsed in your head?

Almost all accidents are related to insufficient preparation. If we do not plan well in advance, or some links are ignored, not enough time is reserved, etc., it is very likely that the progress of the entire project will be delayed. Something went wrong. So, play out the entire planning process in your head as much as possible.

For example, the project must have a clear project plan meeting, which will list what to do in each step, the sequence and person in charge, key milestone nodes, required resources and support, as well as possible abnormal situations and plans.

In terms of project risk management, I would like to share a mantra for summarizing experience: " unified goals, specific responsibilities, and full commitments ", that is, do everyone understand the goals clearly? Does each task have a person in charge? For each task, has the person in charge given the promised completion time?

3. Is there a way out in case of an accident?

Everything should be done in a good direction and planned in a bad direction. While making full preparations, you should also prepare plans for unexpected events, such as a rollback plan.

3. The pits stepped on during the refactoring process

1. In the process of refactoring the ordering process, if the business iteration also involves the ordering process, how to deal with it at this time?

This is a very critical decision-making process that requires consideration of various factors and trade-offs. Essentially, it's a matter of priorities. In the urgent situation of frequent system downtime at that time, system reconstruction was obviously more important.

Therefore, we formulated a strategy that during the refactoring process, all business requirements related to the order process were suspended in order to prioritize the refactoring work.

Although this strategy may affect the needs of some business parties, it ensures the stability and reliability of the system and lays the foundation for subsequent business development.

The "slowness" at this moment is for the "fastness" in the future.

2. How to deal with historical problems exposed during the test?

For historical issues, our strategy is: generals hurry and don't chase after rabbits. In addition to fatal problems, the remaining problems are generally not dealt with during the refactoring process.

On the one hand, time and human resources are limited, and if all historical issues are dealt with, the refactoring work will not be delivered as scheduled. On the other hand, historical problems already exist in the online system, and they have not caused fatal consequences, so there is no need to rush to deal with them.

Therefore, during the refactoring process, it is necessary to focus on the refactoring itself and deal with key issues to ensure the smooth progress of the refactoring.

3. How to smoothly switch to the new system during the online process?

There are a few key points to note:

  • Online rehearsal : Before the official launch, several simulation drills are conducted, including all steps such as environment preparation, data migration, and system configuration, to ensure the smoothness and repeatability of the online process.

  • Switch design : After going live, switch to the new system by refactoring the switch, and be ready to switch to the old system at any time.

  • Rollback design : Before going online, prepare a rollback plan to deal with possible online problems. Monitor immediately after going online, and once a problem is found, implement a rollback plan immediately to ensure system stability.

  • Monitoring and logging : Before going online, prepare a monitoring and logging system to ensure that online problems can be discovered and resolved in a timely manner.

  • Reserve redundant resources : Before going online, reserve certain redundant resources, such as servers, bandwidth, database connections, etc., to ensure that after the system goes online, it can still operate normally under high concurrency and abnormal conditions.

4. Summary of Refactoring Experience

team dimension

We deeply understand the importance of the three elements of "choosing the right person, doing the right thing, and doing the right thing".

1. Choose the right person, this is the premise

To win a battle, not only depends on the ability of team members, but more importantly, the cooperation and tacit understanding among members. Therefore, choose those who are capable, cooperative, and have a common goal.

For example, when we select the core members of the architecture department, we recruit people with excellent skills, rich experience, strong sense of responsibility, and hard work.

2. Do the right thing, this is the direction

Even the right people will not achieve the results you want if things are not right. Because if you work hard in the wrong direction, you will only get further and further away from the goal.

For example, when we choose a breakthrough for refactoring, we don't choose the overall system, or a small business scenario.

3. Do things right, this is the method and the result

If the working methods are not correct, even if the right people are selected and the right things are done, the set goals will still not be achieved.

For example, when we determine the scope of refactoring, we adopt the strategy of "running in small steps" instead of completing it in one step.

For another example, when we deal with historical issues, we adopt the strategy of "the general can't keep up with the rabbit" instead of grabbing at the eyebrows and beards.

In essence, these three elements are making choices, choosing who, what, and what method . Therefore, these three elements are indispensable, and it is difficult to accomplish anything without any one element.

Therefore, when making a choice, we need to pay special attention: the choice is not to have both, but to focus .

If you choose wrong, your efforts will be in vain.

technical dimension

System reconstruction is essentially to change the system from chaos to order through a series of reconstruction methods.

Chaos means high complexity

The characteristics of high complexity are: difficult to understand, difficult to maintain, and prone to failure. As the business iterates, the complexity will increase. For software, complexity can be divided into two aspects: business complexity and technical complexity. Especially when the two are mixed together, the complexity and development difficulty will increase.

So, when doing architecture design, where is the difficulty? It is difficult to control the complexity.

Although complexity is difficult to resolve, complexity can be controlled, and complexity is not terrible as long as it is controlled. Therefore, it is necessary to confirm the refactoring scope and boundaries in advance.

How to effectively control the complexity?

Share a few design principles:

- Isolation design: clear structure. Based on the principle of separation of concerns, business logic and technical implementation are isolated to ensure that the entire system has a clear structure.

- Divide and Conquer: Control the scale. Breaking down a huge system into multiple small components allows for more efficient resolution of fine-grained problems in the problem space.

- Abstract design: responsive to change. The best way to deal with changes is to limit the changes within a controllable range. We can isolate changes of different dimensions and granularities through abstract methods such as design patterns. When changes occur, only minimal adaptations are required.

In the process of software development, it is necessary to pay attention to good design and implementation in order to avoid unnecessary technical debt, and to refactor in time to ensure the healthy iteration of the system.

at last

Refactoring has never been just a technical activity.

The system reconstruction from 0 to 1 is not only an upgrade of the technical architecture, but also a self-remodeling of the team organization.

Documentation

Technology breaks through, performance soars tenfold: the secret of the reconstruction of the billion-level e-commerce platform (the article is first published on WeChat public account: Weige Technology Team)


 recommended reading

Series sharing

------------------------------------------------------

------------------------------------------------------

My CSDN homepage

About me (personal domain name, more information about me)

My open source project collection Github

I look forward to learning, growing and encouraging together with everyone , O(∩_∩)O thank you

If you have any suggestions, or knowledge you want to learn, you can discuss and exchange with me

Welcome to exchange questions, you can add personal QQ 469580884,

Or, add my group number  751925591 to discuss communication issues together

Don't talk about falsehood, just be a doer

Talk is cheap,show me the code​

Guess you like

Origin blog.csdn.net/hemin1003/article/details/129882084