Detailed explanation of the actual combat theory of full-link stress testing for performance testing

Foreword
To talk about the most popular words in the current research and development field, full-link stress testing will definitely not run. In the last few conferences, there were also many topics about the full link. A friend was also asked during the interview process what is full-link stress testing and how to effectively carry out full-link stress testing. Today we will talk about full-link stress testing, but this article will not involve the specific technology stack, and mainly talk about the theoretical issues of full-link practice.

In fact, full-link stress testing still requires high technical requirements for the entire company. It is best not to rashly try full-link stress testing for companies without a certain amount of technical experience, because if it is not done well, the production environment may be shut down. Therefore, For companies without certain technological capabilities, try not to rush to follow the trend and implement full-link stress testing.

01 Why is full-link stress testing required?
Let me talk about why you need a full link first. With the development of business, the technical architecture has developed from the original monolithic architecture to the current microservice architecture, and more and more applications have brought more and more difficulties to R&D personnel to locate problems. In the period of monolithic architecture, you only need to check the log of an application to get a general idea of ​​where the problem lies. But under the microservice architecture, based on the error information returned by the front end, how do you find the wrong application from such a long application link? Can't find a specific application, how do you check the error log?

Maybe you are familiar with the business and can roughly guess where the problem is, but there is uncertainty after all. In this scenario, we need a service management platform to help us display the full link call relationship of the business, and can query the flow process of a request in the business platform through a certain ID. The service governance platform mentioned here must at least include the following functions: service registration and discovery, observability of service status, and traffic management. The current mainstream service governance frameworks are: spring-cloud framework, dubbo framework and service mesh framework. Based on service governance, we can specifically observe the flow of requests between different applications. Combined with the unified log platform, we can quickly locate which microservice has a problem, and then conduct targeted investigations. This is Full link tracking is also the first basis for full link stress testing.

After clarifying why the full link is needed, let's talk about the different requirements for performance testing under different architectures. In different architectural stages, the requirements for performance testing are also different. Simply put, it can be divided into four different stages:

The full-link stress test we usually refer to refers to the fourth stage. When the business develops to this stage, it will face the following difficult problems:

The performance of a single service has been basically guaranteed, but it is not clear which link will cause problems on such a long link; the
traffic of different service modules is not exactly the same, how to ensure the resource allocation of the core link has become a The important point, but this cannot be effectively simulated in the test environment;
how to find out the performance shortcomings of the cluster and avoid the performance avalanche of the cluster caused by a service configuration problem or performance problem has become the top priority;
based on the above considerations, we The concept of full-link stress testing is introduced.

02What problems does the full-link stress test solve?
After introducing the full link pressure test, it helps us to solve the following problems:

Guarantee the system stability of major events: After introducing the full-link stress testing platform, we can effectively guarantee the system stability of the company's major events, because we are based on the configuration of the production environment and truly simulate user behavior. Therefore, after solving the problems found in the full link stress test, in theory, we are confident that we can guarantee the system stability during the event

Accurate capacity assessment: Based on online full-link performance stress testing and monitoring, we will clearly see the traffic situation of each business when the traffic peak comes, and we can make targeted capacity assessments to improve system resources. utilization rate.

End-to-end full link inspection, discovering faults and quickly locating problems at the first time: Based on full link stress testing, we can perform complete end-to-end inspections, discover performance bottlenecks in business clusters, locate and solve problems in a timely manner , does not produce legacy dead angle.

Establish the company's performance operation system, and evolve dynamic performance optimization into spontaneous daily performance optimization: when the full-link stress test system is established, it can be used as a routine test method for daily testing, making performance testing normal. normalized.

03 Which business scenarios are suitable
? I don’t know if you have noticed that the companies that have implemented full-link stress testing are basically e-commerce companies, and they all have high-intensity transactions and high-concurrency scenarios. Because building a full-link platform is a high-cost activity, we need to think about which scenarios are suitable for introducing full-link testing. There are mainly the following scenarios:

There are strong concurrent payment transaction scenarios: including various big promotion scenarios. Currently, the implementation of full-link stress testing is mostly based on such leading companies, such as Taobao, Youzan, Didi, Meituan, etc.
The requirements are normally iteratively completed, and the test is passed. If various system failures occur after going online, full-link stress testing can be appropriately introduced. This situation is generally caused by the large difference in hardware resource configuration between online and offline, and the usage of performance resources cannot be correctly evaluated offline.

04Basic technical components
Since full-link stress testing has so many advantages, can we vigorously promote it? This is why many interviewers like to ask this question. But we know that any technology is not a silver bullet that can solve all problems. We mentioned at the beginning of the article that full-link stress testing has high requirements for the entire company's technology, and requires the cooperation of all R&D personnel in the company to effectively implement it, otherwise it will be a castle in the air. When the team implements full-link stress testing, at least the following issues need to be considered:

① How to get support from the business department?

The full-link stress testing platform is not just a test department, or a test platform, it basically involves all the company's core businesses (if not, then there is no need to do it), which requires the technical cooperation and cooperation of the business department. Renovation, then, when the KPI is already very tight, how to persuade the business department to cooperate with you in the transformation? In some respects, this will not affect the KPI of their own department. If the transformation is not good, it will affect the business instead, and the risk is greater.

② How to do a good job of data isolation?

Pressure testing in the production environment must not affect the data of real users, so data isolation needs to be done. The system on the business side needs to be able to identify which is real traffic and which is stress testing traffic. At present, there are two common methods in the industry: traffic identification or shadow database, both of which require modification of the business code.

③ How to distribute traffic?

If you want to realize the full-link stress test, then the initiation of the stress cannot copy the single performance test, and initiate the stress test by writing your own script. It is necessary to develop a method with stronger concurrency and higher controllability to initiate traffic. At present, the mainstream method in the industry is to transform based on the Netty framework and initiate traffic through NIO. The source of traffic is generally to record real online requests and clean the data. This needs to be achieved by transforming the middleware.

④ Can the Mock service support

During the full-link stress test, you will inevitably come into contact with third-party services (SMS, payment, third-party interfaces, etc.), how to effectively intercept these services and return correct data. Moreover, the Mock service cannot be allowed to become a performance bottleneck in stress testing, and the performance requirements for the Mock service itself will be high.

⑤ Is data monitoring in place?

In the process of full-link stress testing, can an effective and comprehensive monitoring mechanism be established to detect problems in the first place? Is there a hierarchical and hierarchical monitoring scheme? When it is found that the TPS cannot be uploaded, can it be convenient to locate the approximate problem? Otherwise, full-link stress testing is meaningless.

⑥ Is the emergency team in place?

After all, it is a pressure test in production. If a service is overwhelmed, is there an adequate solution? If an irreversible failure occurs (middleware can easily cause problems, such as database downtime, MQ data accumulation, Redis wear-out transparency, etc.), can the operation and maintenance team provide effective support and restore business quickly?

From the above questions, it can be seen that the implementation of the full-link test involves various departments of R&D, and it is not a unilateral matter of the testers, and it can even be said that it has no direct relationship with the testers. When we want to implement the full link, we need to consider whether the team has enough underlying technology to support it.

05 Summary
Full-link stress testing is a practical scenario with high comprehensive technical requirements. It requires the overall IT team to work together after accumulating various technical reserves in the early stage. It is not a matter of a certain department or team. The overall coordination and overall planning can really be implemented. As testers, we need to understand what the full-link stress test is doing, and be able to roughly know how to do it, what technical capabilities are needed, and then combine the specific technical capabilities of the team to promote and implement the test step by step and selectively. landing. Instead of blindly pursuing the full-link stress test directly, at the same time, this is an activity that relies more on the collective. No matter how skilled you are, it is impossible to complete this project alone. It is necessary to distinguish between personal ability and company platform. more important. For the questions in the interview process, we can talk about the causes and consequences of implementing the full link, and clarify the technology stack and implementation ideas.

Guess you like

Origin blog.csdn.net/2201_76100073/article/details/131246037