Knowing things by learning | One article to understand content detection performance test automation construction

NetEase Yidun content detection service often conducts performance tests online. What pain points did you encounter during the performance test? For these pain points, how to try to improve through automated means?

01 Introduction to performance testing

1.1 What is performance testing

Performance testing is the process of applying pressure to the system under test in a specific way according to a certain strategy, and obtaining performance indicators such as response time and throughput of the system to detect whether the system can meet user needs after it goes online. Based on the definition of performance testing, it is not difficult to see that the core indicators of performance are QPS and RT, which are often talked about.
insert image description here

1.2 Why performance testing is required

After a brief understanding of performance testing, some friends will definitely ask why performance testing is required. For this problem, I intend to give two examples from life to illustrate.

(1) 12306——Every Spring Festival travel is a big test

insert image description here

As the Spring Festival approached, workers wanted to grab train tickets to return home. On December 23, 2019, a large number of netizens reported that the 12306 website suddenly crashed. When buying tickets, they could not log in, could not buy tickets, failed to load the train number, and the page was stuck. In other cases, it is suspected that the server crashed due to excessive access traffic.

(2) Weibo - the melons of top stars are not delicious

insert image description here

The news of the divorce of a well-known celebrity was suddenly announced, and the news quickly became a trending topic on Weibo. After such news was announced, some netizens reported that the Weibo client experienced a short-term downtime and network errors.

02 Pain points of performance testing

2.1 The pressure test execution process is cumbersome

The execution of pressure test has the largest proportion in the whole process of Yidun performance test, and the online pressure test of Yidun generally adopts gradient pressure test. What is a gradient pressure test? Simply put, it is to split the large target of the stress test into multiple small targets. Start with small goals and work your way up to the ultimate goal. Assuming that the target value of the Yidun content detection pressure test is that the QPS reaches 200, it usually starts with a pressure value of 20, and then gradually increases the pressure at 40 and 60 until the target value of 200 is finally completed.

insert image description here

How about the specific implementation steps? We first use the pressure value of 20 to create a pressure measurement task on the NPT pressure measurement platform and execute it. Generally, a round of pressure measurement is performed for 10 minutes. If the target value of 20 is not reached, the pressure test stops to locate the performance problem; if the target value of 20 is reached, the next stress test task with a target value of 40 is created and continued. Repeat the previous judgment logic until the stress test is terminated or the stress test target is completed.

insert image description here

Some people may ask, this is not asking for trouble, just use 200 to stress test to see if it works or not. For online stress testing, ensuring online security is always the first priority, and we must be cautious. If the online stress is suspended due to stress testing, it will seriously affect the customer experience and even cause financial losses.

2.2 Monitoring dependents

In the process of performance testing, monitoring & analysis is a difficult point. Where is the difficulty? Human experience is needed here, and one must be familiar with the "system under test". Only when you are familiar with it can you know which monitoring indicators you need to look at, whether these indicators are normal, and then give a judgment conclusion.

At present, online stress testing will arrange students on duty to observe and monitor. This kind of monitoring that relies on people is difficult to achieve in real time, because people have limited energy and cannot take care of all monitoring at all. In addition, very few students stared at the monitor from beginning to end, and everyone paid attention to it when they called the police. There are some risks in this way. Some alarms may not be configured, or the configuration is modified and not restored in time, resulting in problems on the line, and no alarms are sent out.

In the process of stress testing, we often encounter the situation that the concurrent QPS cannot be increased. At this time, there is usually a performance bottleneck, but where the bottleneck is needs to be located and analyzed based on the monitoring data. When we locate, we are nothing more than using the three axes of experience, whether the system resources such as CPU/memory/disk/network card are normal, whether the request amount and RT are reasonable, and whether there are abnormal errors such as timeout.
insert image description here

2.3 Pressure measurement data is not isolated

insert image description here

Since there is no data isolation between stress test traffic and real traffic, full-link stress testing cannot be performed online, and some business scenarios cannot be covered. For example, in the Yidun storage scenario, once the data is put into inventory and there is a performance problem, the Kafka data will be backlogged. At this time, Kafka's topic contains both real traffic and pressure test traffic data. Even if the pressure test is stopped immediately, it will affect the consumption speed of online real traffic data and affect user experience. Therefore, data storage will be turned off for normal online stress testing.

2.4 Pressure test data loss

Yidun has some external suppliers. In the previous stress testing process, due to the incomplete evaluation of the plan, the supplier was omitted, which resulted in some additional costs for online stress testing. This kind of stress testing capital loss is intolerable.

03 Performance test automation practice

3.1 One-key execution of pressure test

When creating a stress test task, multiple gradient stress test subtasks are automatically created. Still using Yidun content detection as an example, we create a stress test task with a QPS target value of 200. This stress test task is split into 5 gradients, and the stress test target values ​​of each gradient are 40 and 80 respectively from small to large. , 120, 160 and 200. These 5 gradients correspond to 5 pressure test target values ​​on the NPT platform, which are 40, 80, 120, 160 and 200 pressure test subtasks.

insert image description here

After clicking to start the stress test task, the system will sequentially execute the stress test subtasks with QPS target values ​​of 40, 80, 120, 160, and 200 on the NPT platform. When the stress test subtask reaches the target value, the next subtask will be automatically executed. If the target value is not reached, the stress test task will be automatically terminated.

insert image description here

3.2 Monitoring & Analysis Quantification

For monitoring and analysis, the clear idea is to quantify.

The first is monitoring. We define in advance which applications need to be included in the monitoring, and which indicators each of these applications need to monitor. After these two elements are clarified, the rest is to obtain the monitoring data through the API provided by Sentinel and write it into the database.

insert image description here

For the monitoring of special business scenarios of external suppliers, we include the indicators of supplier request volume in the monitoring market. When the supplier's request volume is abnormal year-on-year or month-on-month, the pressure test should be stopped in time.

insert image description here

3.3 Access to full-link stress testing components

How to solve the problem that the storage scenario pressure test cannot cover? First, the "system under test" is connected to the full-link stress testing component, which realizes data isolation between real traffic and stress testing traffic.

insert image description here

Secondly, the shadow queue consumption provides a switch. When Kafka data is backlogged, the switch is turned off to ensure that only real traffic data is consumed.

04 Performance test automation platform

For the improvement points in the above performance test, we have built our own performance test automation platform to carry it.

4.1 Overall Architecture

insert image description here

4.2 Landing situation

The Yidun SaaS service pressure test has completed the trial run of small traffic, which has obvious advantages compared with the previous conventional pressure test.

insert image description here

Guess you like

Origin blog.csdn.net/yidunmarket/article/details/129620796