Yunji's road to full-link stress testing

" Yunji's Road to Full-Link Stress Test "

 

In fact, to tell the truth, our road to full-link stress testing is still quite difficult, starting from the initial struggle with which stress testing tool to use, to stressing framework, stressing single interface, stressing stress testing environment, and step by step to the online environment Fumbling and combining the experience and solutions of other friends and businessmen, until today, it took nearly a long time for more than a year, and we finally figured out a full-link stress test road that belongs to our chicken farmer in the dark. We grow and mature gradually through the process of trial and error.

 

1. What exactly is a full-link stress test?

When the traffic is not large, developers and testers can do a functional test offline, as long as the function can run normally, but as the user scale increases linearly and the traffic increases, we will gradually realize that light It seems that it is far from enough to rely on regular functional tests. When the traffic comes up, we must pay attention to the system performance. After all, no one wants their system to be ruthlessly defeated by the traffic. It is necessary to be aware of it ( clear the capacity and water level of the system, and have guidance). capacity planning ) . Therefore, at this stage, most enterprises will choose to perform stress testing on the framework, middleware, and storage layer offline to clarify their throughput. However, there is still a big gap between the offline and offline data of such stress testing results. After all, most companies The pressure measurement environment of the APP is not 1:1 with the production environment ( I have never seen such a local tyrant company ) , so the pressure measurement result data of the pressure measurement environment is purely for reference and cannot be used as a guide for the online environment. Data , the only way is to directly implement pressure testing in the online environment.

 

It is easy to say that stress testing is performed directly in the online environment, but there are great risks hidden behind it . In most cases, our system is accessed by users, especially during peak traffic, it must not cause system failures due to pressure measurement traffic, which will affect users' ordering, and even more cannot pollute online data . Imagine that 10 more IphoneXs suddenly appeared in User A 's order information. Do you want to send it or not? Or the balance suddenly decreases, which is absolutely unacceptable for users ( unless it increases, making a fortune in silence ) , and the online environment is often accompanied by various timed tasks. It is also included in the income, and the boss of the operation department is expected to invite the R&D students to drink tea.

 

Although there are many difficulties, it is the only way to detect the real capacity water level of the system, conduct capacity planning before the big promotion with guidance, and give a reasonable limiting water level. Therefore, the system needs to be able to intelligently and accurately distinguish which data is the real user traffic and which is the pressure measurement traffic, and then divert the traffic to the isolated environment for placement . I will talk about how to distinguish the pressure measurement traffic later.

 

So what exactly is a full-link stress test? I believe everyone is aware of the pressure test of a single interface. For example, when the website A interface is pressure tested, it is assumed that its QPS is 10W/s . However , when the website A interface is pressure tested, the B interface is also being pressure tested. At this time, the website A The QPS of the interface will no longer be 10W/s . This is because any interface in the system will not exist alone, and will be more or less restricted by some common resources. When the common resources become the bottleneck, the whole system will be affected . To put it simply, the so-called full-link stress test actually means that all the core links of the system are subjected to stress test at the same time . When the traffic of the entire system is hit, the performance bottleneck of the system will be exposed, and the system will be able to Detecting the real capacity water level of the entire system is the meaning of full-link stress testing and implementing full-link stress testing.

 

2. How does the system distinguish between real user traffic and pressure measurement traffic?

Before talking about how to distinguish the pressure measurement traffic, let's first talk about a more sensitive topic, whether the business system needs to be invasively transformed, and clearly tell everyone, yes! Because the full-link stress test task is not performed in the early stage of the service, it is basically needed in the later stage of the service, especially when the service becomes more and more complex, the transformation will be more difficult. However, the infrastructure team of the enterprise should realize that most of the traffic differentiation work should be done in the middleware and base components . Of course, if there is no intrusion at all, this is almost unrealistic.

 

The key points of implementing full-link stress testing are:

1. Distinguish pressure measurement flow data;

2. The pressure measurement flow data should be placed in an isolated environment.

  

In the earliest days, we were always doing stress testing of the read interface in the online environment, and we did not dare to implement the read / write parallel stress testing directly from the very beginning , because we had no idea! This is true, but when we know how to distinguish the stress test traffic and how to isolate the stress test data, we can start to boldly carry out the real online full-link stress test.

 

First, let's talk about how to distinguish the pressure measurement flow:

1. The pressure measurement traffic will be marked on the URL uniformly ;

2. After the access layer receives the request, the Filter intercepts and identifies the pressure measurement identifier, and puts it into ThreadLocal ;

3. When the access layer calls the service, the call chain buried point terminal obtains the pressure measurement identifier from ThreadLocal , and writes it into the context to transmit it downwards ;

4. When placing or storing, rely on the Base component to obtain the pressure measurement identifier from the service context to distinguish the data direction.

 

Some businesses need to call a third-party interface ( the most typical one is to call the bank payment interface ) . In this case, the business system needs to determine if it is marking traffic, then directly mock it, as shown in Figure 1 :

Figure 1 Pressure measurement flow identification

 

After we figure out how to distinguish between real user traffic and pressure measurement traffic, the next question is, since the pressure measurement data cannot pollute the online environment, where should it fall? The approach we chose is from two dimensions ( physical isolation and logical isolation coexist ) . After all, we are not local tyrants, and there is no way to build a 1:1 storage system with the production environment .

 

For those data that really need to be placed (for example: order data ) , we will write the pressure measurement data into the shadow library to completely isolate the online environment, which is the safest, but the cost is relatively high, after all, it is different For example, it is still very necessary for security reasons, which is physical isolation. The logical isolation refers to some intermediate data, such as the pressure measurement data that needs to be written to MQ , Redis , etc., we adopt the method of logical isolation. For example, when writing to Redis , the key of the pressure measurement data is uniformly added with the pressure measurement logo; write When entering MQ , we will write to different topics ( due to the business characteristics and the implementation mechanism of RocketMQ itself, we chose to route from NameServer to several fixed MQ machines for pressure testing ) .

 

3. How to deliver the pressure measurement flow and how to construct the pressure measurement data

In the beginning, the pressure measurement tools used by R&D students and test students seemed to be in full bloom, such as: Jmeter , Apache AB and other conventional testing tools, but such tools could not instantly initiate super-large-scale pressure measurement traffic ( difficult to achieve distribution ). Later, we also considered Ngrinder , but this product is also difficult to perform full-link stress testing, because the Controller is too cumbersome, and the number of Agents that can be managed is extremely limited. JD.com has verified this, and we have also There is no need to take another detour, so the idea of ​​building a full-link stress measurement system by ourselves was born directly.

 

The chicken farm's full-link stress testing military exercise system is called TItan , and its overall architecture is as follows:

Figure 2 The overall architecture of the Yunji full-link stress test military exercise system (Titan)

 

This article will not focus on a detailed introduction to Titan , because we expect it to be officially open sourced around the first half of next year , and we can guarantee that the open source branch version will be synchronized with the internal version of Yunji, and will not castrate its functions .

 

The distribution of pressure measurement traffic is completed by Titan , so how is the pressure measurement data constructed? In the earliest days, we adopted the method of manually constructing some dynamic parameter scripts. I believe everyone knows that this is a very painful thing, especially for some dynamic parameters that can only be used once. If the scale of the pressure test is large, then This will be painful and error-prone, so our next approach is to build a stress test data factory. The stress test data is constructed from the stress test data factory , and then Titan executes the traffic distribution, and one road is opened. .

 

It is not easy to code words. If you think the article is helpful to you, please like it and indicate the source for reprinting, thank you!

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326310107&siteId=291194637