Software stress testing and performance testing analysis methodology

Pressure testing and performance analysis methodology

　　Basics of performance testing

　　Common Classifications of Performance Testing

　　Performance Testing. It is used to verify whether the performance of the system meets the design expectations. Generally speaking, the pressure on the system will be relatively small, and the system will not be crushed. It is just a simple verification.

　　load test. By continuously applying load pressure, we seek the optimal processing capacity and the best performance state of the system to achieve the maximum performance index. Generally speaking, the results of load testing are a little higher than the results of performance testing.

　　Stability test. It can be considered as a subset of load testing, applying pressure unevenly for a long time, and then checking whether all indicators of the system are normal.

　　Stress test : It is common for us. Generally, we refer to this as stress test, which is used to determine the maximum capacity that the system can withstand. The stress test generally presses to the maximum point that the system can withstand, and then draws a peak conclusion .

　　Pressure measurement type and pressure application mode

　　There are generally two types of stress testing: single-service stress testing and full-link stress testing.

　　There are two common ways of applying pressure:

　　Concurrent mode (simulating user mode from the user's perspective)

　　Concurrency refers to the number of concurrent users. From a business perspective, the number of concurrent online users is simulated to achieve the expected amount of concurrency. To calculate throughput, a conversion is required. But in some scenarios, it is more in line with the expectations of the scene

　　RPS mode (simulates throughput mode in terms of requested throughput)

　　RPS (Requests Per Second) refers to the number of requests per second. RPS mode is "throughput mode". By setting the number of requests sent per second, from the perspective of the server, the throughput capacity of the system can be directly measured, eliminating the cumbersome conversion from concurrent to RPS, and it can be done in one step.

Jmeter advanced performance test practice

　　Concurrent mode and RPS mode have no advantages or disadvantages, and each has its own applicable scenarios.

　　Common stress testing tools

　　The commonly used pressure measurement tools are as follows:

　　· wrk: https://github.com/wg/wrk

　　· ab: https://httpd.apache.org/docs/2.4/programs/ab.html

　　· webbench

　　Performance

　　Common performance indicators

　　Business indicators: concurrency, throughput, response time

　　 concurrent number. It refers to the number of requests processed by the system at the same time. For Internet systems, it generally refers to the number of users who access the system at the same time.

　　 Throughput (maximum value of QPS): refers to the number of requests processed by the system per unit time, reflecting the processing capacity of the system. We generally use indicators such as TPS and QPS to measure. Throughput can also be divided into average throughput, peak throughput, and minimum throughput.

　　Response time: The processing time of a transaction. It usually refers to the time interval from when a request is sent, to when the server returns after processing, to when the response data is received. Generally, there are average response time, P95, and P99.

　　The response time and throughput must reach a balance point. As the throughput increases, the response time will first maintain a certain point, and then begin to increase rapidly, followed by the throughput is difficult to increase. We have requirements for response time, so we can't just pursue throughput, we must find the maximum throughput within a reasonable response time.

　　The response time must be based on the success rate. If there is a failure, the response time is invalid. The success rate is generally 100%.

　　The relationship between them is:

　　QPS (TPS) = concurrent number / average response time

　　Theoretical value of throughput = concurrent number/average response time

　　Concurrent number = QPS * average response time

　　System resources: CPU idle rate, memory usage, network IO, disk read and write volume, number of handles, etc.

　　Performance counters refer to some indicator data of server or operating system performance, including system load System Load, number of objects and threads, memory usage, CPU usage, disk and network I/O usage and other indicators . These indicators are important parameters for system monitoring, reflecting some key indicators of system load and processing capability, and usually these indicators are strongly correlated with performance. These indicators are high, become a bottleneck, and usually indicate that there may be performance problems.

　　The optimal way is to use percentage

　　It is unreliable to refer to the average value. The most correct statistical method is to use percentage distribution statistics. The best practice is to use percentages. For example, the Top Percentile (TP) indicator, TP50 means that 50% of the requests are less than a certain value, and TP90 means that 90% of the requests are less than a certain time.

　　Pressure measurement observation indicators

　　Regardless of the type of stress test, the indicators to be observed in the stress test generally need to include:

　　· Success rate, failure rate

　　· System resources (CPU, memory, bandwidth, IO)

　　Response time, average response time, P95/P99 response time, you must pay attention to P95 and P99, not just the average time, P99 time can better judge the time experience of online users

　　· Throughput (QPS/TPS)

　　An example of basic pressure measurement data is as follows:

Uploading... Reupload Cancel

　　Generate rigorous stress test reports

　　When we analyze system performance problems, we need to find the key points. This requires our stress test report to be really effective, very rigorous, and clear. We must analyze the bottleneck step by step, and we must understand why the bottleneck is reached, and then how to optimize it. ? Therefore, we are required to output a rigorous stress test report. Here are some experiences:

　　During stress testing, it is necessary to find a performance inflection point; if the pressure reaches the bottleneck as soon as the pressure is up, then it is necessary to go back a little until an optimal performance inflection point is found. System performance is in a parabolic shape. Continuing to apply pressure after reaching the performance peak will lead to performance degradation. Therefore, the most important thing for our stress test is to find the best performance inflection point. Therefore, the pressure is gradually applied during the entire pressure application process, and the pressure is continued after reaching the performance peak. If the performance does not increase but decreases after continuing to apply pressure, it means that the inflection point has been reached.

　　How to analyze the performance bottleneck and find the reason why the QPS cannot be improved?

　　QPS will not rise all the time, after a certain point, it will be flat or even decline, and there will be a performance inflection point. At this time, it is necessary to start analyzing the reasons.

　　The specific method is to first capture the profile situation (cpu, block, io, memory) that has not reached the limit, then grasp the one that has just reached the limit, and finally grasp the one that has exceeded the limit, and then analyze which system resources are in these situations , or the external interface is causing performance problems.

　　If a certain component or external service is the performance bottleneck point, then further analysis is needed. Is the usage posture of the component wrong? Is it not handling the number of connections? It cannot be said that the problem is over once a certain component is found, and further and deeper examination is needed.

　　Know the performance and inflection points that stand - alone and cluster can carry respectively

　　What is the maximum QPS of a single machine?

　　What is the QPS after parallel expansion? Is it a linear growth? (It will definitely not grow linearly. After a certain level, there will be bottlenecks in related resources. The key is to find the corresponding bottleneck points)

　　How to analyze system resources, take CPU as an example

　　First look at the CPU. If the CPU is not running full, it means that the problem is not the CPU, so you don't need to care about the CPU, and then you need other resources such as io, swap, memory, network card, etc.

　　If there are multiple CPU cores, observe the cpu usage of each core instead of the overall CPU usage.

　　If the CPU is running at full capacity, grab the CPU profile and observe which calls are more time-consuming.

　　Do a good job of capacity estimation

　　Before the system goes online, it is necessary to be able to have an estimate/assessment, and then pass the pressure test verification to understand every detail, including resources, dependencies, deployment, computer room distribution, downgrading strategy, disaster recovery plan, backup plan

　　Capacity estimation is a must for a large-scale system to go online, because only by making a reasonable capacity estimation can we better design our system according to the load level of the system. Capacity planning needs to be done with the least amount of machine resistance. Live more traffic; after the plan is ok, we need to use some performance stress testing methods to verify whether it meets expectations. After having a reasonable capacity planning and evaluation, we can only know how much we need to stress when we go to the pressure test system before going online. Then, capacity estimation is not a slap in the face. Capacity evaluation needs to consider the following points:

　　1. Obtain business indicators and evaluate the total visits.

　　Inquire about products and operations to get some uv, pv and other indicators.

　　2. Evaluate the average visit QPS.

　　86400 seconds a day, it is generally considered that the request occurs during the day, that is, 4w seconds.

　　The total amount is divided by the total time, and one day is counted as 4w seconds.

　　3. Assess peak QPS.

　　When planning system capacity, we should not only consider the average QPS, but the peak QPS.

　　According to the business curve diagram.

　　Generally, the peak QPS is 3-4 times the average QPS.

　　4. Assess the relevant indicators of each module and subsystem under the entire business system.

　　5. Evaluate the system, stand-alone limit QPS, and evaluate how many machines are needed.

　　Perform stress testing and data analysis.

　　6. Appropriate redundancy. For the results obtained from the stress test, we need to make some redundancy after the actual launch, so as to avoid the fact that the actual online pressure is too large to cause rapid expansion.

　　Do a good job of analysis and summary

　　Do a good job of analyzing and summarizing, such as:

　　After this system goes online, will it really be able to withstand it? In addition to having stress test data, you also need to have your own estimates. In your own system, which aspects may have bottlenecks, which will cause problems after going online? There must be sufficient preparation and overall evaluation/estimate before the system goes online.

　　After the system goes online, what should I do if I can’t handle it? Is there a flow-limiting scheme? Is there a downgrade option?

　　What is the current situation of the system with 100,000 users? So if the situation is 10 million users, is it a linear growth? What considerations need to be made?

　　Pressure measurement methods for some specific cases

　　Test data preparation

　　High-quality test data should be able to truly reflect the user's usage scenarios. We generally choose online real data as the data source, after sampling, filtering, and desensitization, as the test data for performance testing. But before testing with real data, you must first simulate the test data offline, at least verify the basic performance requirements of the entire system before using real data for performance testing.

　　Stress testing method for storage layer ( database and cache)

　　For stateless services, it is easy to improve the concurrency capability, and the capacity can be expanded without thinking. But for a stateful storage system, the maximum number of concurrency it can support is not infinitely scalable, so we must be able to understand how much our data storage layer can withstand, and the pressure test for this kind of storage cluster is generally:

　　· Press test on a single machine first

　　Then analyze the overall capacity of the cluster. It should be noted that the capacity that the cluster can carry is not the cumulative value of a single machine. Generally, each time a machine is added to the cluster, it can be roughly evaluated by 80% decreasing method.

　　Finally , it should be noted that the overall capacity of the cluster needs to achieve a reasonable configuration according to the actual situation, not that the more machines in the cluster, the better. Press down to a value that meets expectations.

Jmeter advanced performance test practice

Fiddler interface packet capture artifact tutorial

Mobile test series of software testing

Software stress testing and performance testing analysis methodology

Guess you like