Stress testing: How to design a full-link stress testing platform?

background

When your system traffic has a large increase, such as "Double Eleven" traffic, then you may be at a loss when you face performance problems. In order to solve this problem, you will need to understand which components or services in the system will become the bottleneck of the overall system when the traffic increases several times. At this time, you may need to do a full-link stress test .

In my past Internet project experience (e-commerce), I have encountered many times of double traffic growth, such as ten-fold growth, and I have done many full-link stress tests, and I have indeed experienced some pitfalls, so this article mainly focuses on Introduce the experience of how to design a full-link stress test platform, I hope it will be useful to you.

content

First, what exactly is a stress test? How to do a full-link stress test?

What is a stress test

I believe you must have heard the term stress test (referred to as stress test) many times in industry sharing. Of course, you may have done stress test during the project development process, so, for you, stress Testing is no stranger.

However, I want you to recall, how did you do the stress test? Are you like many students: first build a test environment with the same function as the formal environment, and import or generate a batch of test data, and then start multiple threads on another server to concurrently call the interface that needs to be tested (interface The parameters are generally set to be the same, for example, if you want to pressure test the interface for obtaining product information, you will use the same product ID during the pressure test). Finally, after recording the final stress test QPS by statistical access logs or checking the monitoring system of the test environment, do you directly cross it?

It is actually incorrect to do a stress test like this . The main mistakes are as follows:

  • First of all, when doing a stress test, it is best to use online data and an online environment. Because, you can't be sure whether the difference between the test environment you built and the formal environment will affect the results of the stress test.
  • Secondly, you cannot use simulated requests during stress testing but use online traffic. You can copy the online traffic to the stress test environment by copying the traffic. Because the access model of the simulated traffic is very different from the online traffic, it will have a relatively large impact on the results of the stress test. For example, when you obtain product information, the online traffic will obtain the data of different products. Some of the data of these products hit the cache, and some do not hit the cache. If the same product ID is used for stress testing, only the first request does not hit the cache, and the data in the database will be seeded back to the cache after the request, and subsequent requests will definitely hit the cache. This kind of stress test The data is no longer referential.
  • Do not initiate traffic from a server, as this will easily reach the performance bottleneck of this server, which will cause the QPS of the stress test to fail, and ultimately affect the results of the stress test. Moreover, in order to simulate user requests as realistically as possible, we tend to place traffic-generating machines closer to users, such as CDN nodes. If there is no such condition, it can be placed in a different computer room, so as to ensure the authenticity of the stress test results as much as possible.

The reason why many students have this problem is that they do not fully understand the concept of stress testing. They think that as long as multiple threads are used to request the service interface concurrently, it is considered as a stress test for the interface.

So what exactly is a stress test?

Stress testing refers to testing under high concurrency and large traffic conditions. Testers can find out the hidden performance risks in the system by observing the performance of the system under peak load.

Stress testing is a common way to discover problems in the system, and it is also an important means to ensure system availability and stability. In the process of stress testing, we can't just do stress testing for a certain core module, but need to integrate the access layer, all back-end services, databases, caches, message queues, middleware, and dependent third-party service systems and Its resources are included in the objectives of the stress test. Because, once the user's access behavior increases, the entire link including the above-mentioned component services will be impacted by uncertain large traffic. Therefore, they all need to rely on stress testing to discover possible performance bottlenecks. This type of stress testing for the entire calling link is also called "full link stress testing" .

In Internet projects, the function iteration speed is very fast, and the complexity of the system is getting higher and higher, and the newly added functions and codes are likely to become new performance bottlenecks. Perhaps a single machine could handle 1,000 requests per second during the stress test half a year ago, but now it is likely to handle 800 requests per second. Therefore, stress testing should be performed periodically as a routine means to ensure system stability.

However, usually to do a full-link stress test, it is necessary to cooperate with multiple teams such as DBA, operation and maintenance, dependent server, middleware architecture, etc., and the cost of manpower and communication and coordination are relatively high. At the same time, in the process of stress testing, if there is no good monitoring mechanism, it will also have an adverse impact on the online system. In order to solve these problems, we need to build an automated full-link stress testing platform to reduce costs and risks .

How to build a full-link stress testing platform

There are two key points in building a full-link stress testing platform:

One point is the isolation of traffic. Since the stress test is performed in a formal environment, it is necessary to distinguish between the stress test traffic and the formal traffic, so that the stress test traffic can be processed separately.

One point is risk control. That is to try to avoid the impact of stress testing on normal access users. Therefore, generally speaking, the full-link stress testing platform needs to include the following modules:

  • Traffic Construction and Generation Modules
  • Pressure measurement data isolation module
  • System health check and pressure measurement flow intervention module

The architecture diagram of the overall pressure testing platform can be as follows:

In order to allow you to understand more clearly how each module is implemented and to facilitate you to design a full-link stress testing platform suitable for your own business, I will give a more detailed introduction to each module of the stress testing platform. Let's first look at how the traffic of the stress test is generated.

Generation of pressure measurement data

Generally speaking, the ingress traffic of our system is the HTTP request from the client. Therefore, we will consider copying these ingress traffic during system peak periods, and after some traffic cleaning (such as filtering some invalid requests), store the data in NoSQL storage components such as HBase and MongoDB or In cloud storage services, we call it traffic data factory.

In this way, when we want to perform pressure testing, we can obtain data from this factory, divide the data into multiple parts, and send them to multiple pressure testing nodes. Here, I want to emphasize a few points that you need to pay special attention to:

  • First of all, we can use a variety of methods to realize the copying of traffic.
  1. The easiest way: directly copy the access log of the load balancing server, and the data will be written to the traffic data factory in the form of text. However, when initiating a pressure test for the data generated in this way, you need to write an analysis script to analyze the access log, which will increase the cost of the pressure test and is not recommended.
  2. Another way: use some open source tools to copy traffic. Here, I recommend a lightweight traffic copy tool GoReplay (or tcpcopy ), which can hijack the traffic of a certain port of the machine, record them in a file, and send them to the traffic data factory. During stress testing, you can also use this tool to perform accelerated traffic playback, so that you can implement stress testing on the formal environment.
  • Secondly, as mentioned above, when we deliver stress test traffic, we need to ensure that the node delivering the traffic is closer to the user, at least not in the same computer room as the service deployment node, so as to ensure the authenticity of the stress test data as much as possible.

In addition, we also need to color the pressure measurement traffic, that is, add pressure measurement marks. In an actual project, I will add a tag item in the HTTP request header, for example, is stress test. After the traffic is copied, add this tag item in batches to the request, and then write it into the data traffic factory.

How data is segregated

While copying the stress test traffic, we also need to consider modifying the system to separate the stress test traffic from the official traffic, so as to avoid the impact of the stress test on the online system as much as possible. In general, we need to do two things.

On the one hand, for requests to read data (generally referred to as downlink traffic), we will do Mock or special processing for certain services or components that cannot be pressure tested , for example.

In business development, we generally record user behaviors based on requests. For example, when a user requests a product page, we will record the behavior of browsing this product one more time, and these behavior data will be written into a separate big data In the log, it is then transmitted to the data analysis department to form a business report for the product or the boss to make business analysis and decision-making.

During the stress test, these behavioral data will definitely be added. For example, the browsing behavior of the product page in a day is 100 million times, but after the stress test becomes one billion times, this will have an impact on the business report and subsequent products. direction decision. Therefore, we do special processing for the user behaviors generated by these stress tests and no longer record them in the big data log.

For another example, our system will rely on some recommendation services to recommend some products that you may be interested in, but the display of these data has a characteristic that the products that have been displayed will no longer be recommended. If your pressure test traffic passes through these recommendation services, a large number of products will be requested by the pressure test traffic, and online users will no longer see these products, which will also affect the recommendation effect.

Therefore, we need Mock these recommended services, so that requests without stress test marks go through the recommended services, and requests with stress test marks go through the Mock service. When building mock services, you need to pay attention to one thing: these mock services are best deployed in the computer room where the real service is located, so that the real service deployment structure can be simulated as much as possible, and the authenticity of the pressure test results can be improved.

On the other hand, for the request to write data (generally referred to as upstream traffic), we will write the data generated by the pressure test traffic into the shadow library, which is a storage system completely isolated from the online data storage.

For different storage types, we will use different methods of building shadow libraries.

  1. If the data is stored in MySQL, we can create the same database table structure as the online one in the same MySQL instance but in different schemas, and import the online data into it.
  2. And if the data is stored in Redis, we add a unified prefix to the data generated by the pressure test traffic and store it in the same storage.
  3. There are also some data that will be stored in Elasticsearch. For this part of the data, we can put it in another separate index table.

Through special processing of downlink traffic and adding a shadow database to uplink traffic, we can isolate pressure test traffic.

How to implement stress testing

After copying the online traffic and completing the transformation of the online system, we can implement the stress test. Before that, a stress test goal is generally set, for example, the QPS of the overall system needs to reach 200,000 per second.

However, during the pressure test, the request volume will not be increased to 200,000 times per second at once, but the traffic will be gradually increased according to a certain step size (such as increasing 10,000 QPS for each pressure test). After increasing the traffic once, let the system run stably for a period of time and observe the performance of the system. If you find a bottleneck in a dependent service or component, you can reduce the stress test traffic first, for example, roll back to the QPS of the last stress test to ensure the stability of the service, then expand the capacity of this service or component, and then continue to increase the traffic pressure Measurement.

In order to reduce the cost of manpower input during stress testing, a traffic monitoring component can be developed, in which some performance thresholds are preset. For example, the threshold of the CPU usage of the container can be set to 60% to 70%; the upper limit of the average response time of the system can be set to 1 second; the proportion of slow system requests can be set to 1%, and so on.

When the system performance reaches this threshold, the traffic monitoring component can detect it in time, and notify the pressure measurement flow delivery component to reduce the pressure measurement flow, and send an alarm to the development and operation and maintenance students, and the development and operation and maintenance students can quickly check the performance Bottleneck, continue to perform pressure testing after solving the problem or expanding the capacity.

There are many explorations in the industry on full-link stress testing platforms. Some major companies such as Alibaba, JD.com, Meituan, and Weibo have full-link stress testing platforms suitable for their own businesses. In my opinion, these stress testing platforms are always the same. They all go through steps such as traffic copying, traffic coloring and isolation, suppression, monitoring and circuit breaking, which are all connected with the core ideas introduced in this article. Therefore, when you consider self-developing a full-link stress testing platform suitable for your project, you can also follow this mature routine.

final summary

I took you through the common misunderstandings of stress testing and the process of building an automated full-link stress testing platform. Here are the key points you need to understand:

  1. Stress testing is an important means of discovering system performance hazards, so formal environments and data should be used as much as possible;
  2. It is necessary to add tags to the pressure test traffic, so that the isolation of pressure test data and official data can be realized by means of Mock third-party dependent services and shadow libraries;
  3. During stress testing, system performance indicators should be monitored and alerted in real time, and resources or services with bottlenecks should be expanded in a timely manner to avoid impacting the formal environment.

This full-link stress testing system has three values ​​for us:

  • First, it can help us discover possible performance bottlenecks in the system, so that we can prepare plans in advance to deal with them;
  • Secondly, it can also do capacity assessment for us and provide data support;
  • Finally, we can also do plan drills during the stress test, because the stress test is generally arranged during low-peak periods of traffic, so that we can downgrade some services to verify the effect of the plan, and minimize the impact on online users.

Therefore, with the rapid growth of your system traffic, you also need to consider building such a full-link stress testing platform in time to ensure the stability of your system.

Recommended information

my column

At this point, all the introductions are over

-------------------------------

-------------------------------

My CSDN homepage

About me (personal domain name, more information about me)

My open source project collection Github

I look forward to learning, growing and encouraging together with everyone , O(∩_∩)O Thank you

Welcome to exchange questions, you can add personal QQ 469580884,

Or, add my group number  751925591 to discuss communication issues together

Don't talk about falsehood, just be a doer

Talk is cheap,show me the code

Guess you like

Origin blog.csdn.net/hemin1003/article/details/115208773