Pressure test plan design..

01 Why do we need pressure testing?

1. What is stress testing?

Continuously apply pressure to the object under test to test the performance of the system under stressful conditions.

2. What is the purpose of stress testing?

The test determines the system's extreme performance indicators , thereby giving reasonable commitment values ​​or capacity warnings;

Identify system performance bottlenecks and optimize performance;

Test the stability of the system under high load conditions;

Verify the system's current limiting and downgrading plans in case of overload;

3. What problems will occur if no pressure testing is performed?

Online capacity assessment is inaccurate, traffic increases, and services are suspended.

No stress test was performed before the upgrade. After the upgrade, performance deteriorated and availability decreased;

Unable to give an accurate commitment value, resulting in the cluster water level being too low, resources being wasted, or the cluster water level being too high, causing stability bugs in the system;

02 Pressure test plan design

The stress testing environment can be simply divided into module-level stress testing and link-level stress testing. Their main features and differences are as follows:

1. Module level pressure test

Application scenario: Compare the performance before and after the change to see if there is any degradation in performance; locate the performance bottleneck of the module itself.

Environmental requirements: It is not required to be completely consistent with the online environment. It only needs to ensure that the two pressure tests before and after the change are in the same environment.

Industry solution: Maintain a fixed offline environment and conduct periodic and normal stress tests.

2. Link-level stress testing

Application scenario: Evaluate the capacity of the entire link; evaluate the overall availability of the system.

Environmental requirements: It is required to be consistent with the online environment as much as possible. Only such stress test data can be used as a reference.

Industry solution: Use an online environment and use different solutions according to different isolation methods:

No traffic isolation is performed. Stress test traffic and business traffic coexist. Since there is no isolation, pressure testing can only be performed during off-peak periods.

Logical isolation uses traffic scheduling or offloading to route stress testing traffic to a stress testing environment. Stress testing traffic and business traffic run in the same computer room, but do not hit the same business instance.

Physical isolation uses the multi-active feature of remote locations to cut out business traffic from one computer room, leaving an empty computer room for stress testing.

The first solution is closest to the real online environment, but there are some security risks; the latter two solutions are much more secure, but do not fully utilize the entire online architecture, and there is a certain degree of distortion.

3. How to ensure the safety of online pressure testing?

Traffic isolation, as described above. But traffic isolation alone is not enough. Even physical isolation will modify online data, so data isolation is also required.

When the stress test traffic passes through the middleware, it is marked and marked as a stress test. For example, http traffic can be configured with a special header.

Perform data isolation on traffic tags in the business cluster, such as writing the logs generated by the pressure test traffic to another path (some systems will do some analysis and statistics on the logs); in terms of storage/caching, store the data generated by the pressure test traffic in the shadow Table, normal traffic access normal table;

Message shielding. If the message queue cannot identify the stress test messages, it will cause the accumulation of online messages and affect the online traffic. Therefore, the stress test messages need to be shielded.

Mock third-party services that do not support stress testing.

03 Stress test model

What scenarios should the stress test cover? How are stress test requests and data constructed? How to simulate business traffic patterns? The above three questions correspond to the business model, data model and traffic model in the stress testing model respectively.

1. Business model

What business scenarios need to be covered by stress testing?

It is necessary to sort out the core business scenarios, which must include core interfaces and high-traffic interfaces. High-traffic interfaces may be interfaces that are not exposed to users and are frequently used internally.

How to simulate business scenarios?

It is necessary to clarify the relationship between interfaces. For some simple query interfaces, there are no before and after dependencies, and you only need to pay attention to the traffic ratio; for some complex business scenarios, you need to restore the business processing flow and clarify the interface series logic. It can be sorted out through scene recording and scene playback.

2. Data model

Transformation based on online data

In the request part, online traffic can be recorded directly, the request can be marked with stress testing, and key IDs can be modified; the underlying data can directly copy the online storage data to another stress testing table.

Model-based construction

By analyzing online logs and requests, we can sort out the data characteristics and request characteristics that have an impact on performance, and construct data based on these characteristics. The underlying data needs to be constructed through real business applications.

Method based on online data transformation

The solution is very simple and data construction is fast, but it cannot cope with new scenarios based on the existing data of the system, and the model adjustment is inflexible, so it is suitable for online stress testing of old services;

Model-based construction

It does not rely heavily on online data and can manually construct new scenarios. The maintenance cost is low. You only need to adjust the interface. There is no need to perceive changes in the online storage table. The model can be flexibly adjusted. However, the solution is complex and the data construction is slow. It is difficult to use The scenarios are relatively wide, and both online and offline new and old services are available.

Special case of stress testing model: traffic recording, playback as it is

Features: No need to simulate business scenarios, no need to construct data; can only record services and interfaces with existing online traffic; can only be played back in the online environment, and can only play back read-only interfaces; only suitable for stress testing of old service read interfaces ;

Traffic recording can record the traffic during off-peak, flat-peak and peak periods to avoid missing measurements.

3. Traffic model: simulate business traffic patterns

There is traffic online

Observe online traffic patterns.

Most of the open source monitoring and management on the Internet takes more than 5 seconds. The ideal situation can reach the ms level, which can be achieved by analyzing logs.

No traffic online

Analyze user behavior or caller behavior.

Common business traffic patterns can be divided into two types, one is continuous incremental type, and the other is pulse type (such as grabbing red envelopes)

4. Traffic estimation

The flow pattern is a simulated online flow curve. In addition, we also need to estimate the flow and calculate the magnitude of the pressure measurement.

Taking the Double Eleven event as an example, we can divide the interfaces into three categories:

Background interface

The traffic does not change with the activity. The stress test is only used as the background traffic, and the recent peak value can be taken;

Common attention interface

Traffic varies with activity, calculated using a general model

Chongbao interface

For example, the transaction interface takes the peak value of historical promotions.

04 Analysis of pressure test results

1. Observation indicators

System indicators

qps/tps, the maximum tps must be stable. If there is jitter, then there is already a problem with the system.

Response time, the entire process time from the client initiating the request to receiving the request

Error rate, determined according to SLA

Resource indicators

CPU utilization is generally lower than 80%, and avg lower than 60% is safer.

It is safer to keep the memory usage below 80%, otherwise it may fall into a GC death loop.

Disk throughput/network throughput

Characteristic indicators are determined according to specific business

Connection pool usage

Message queue accumulation

pps

2. Simulation analysis: Are the stress test results valuable?

Comparing the similarity of service performance between pressure test scenarios and online real scenarios under the same water level, indicators that can be used for simulation analysis:

  • Traffic, traffic proportion, interface coverage

  • link coverage

  • Machine resources, cpu utilization, memory utilization

  • Availability metrics, latency, error rate

  • business metrics

Assemble these indicators into a vector and compare them with the online indicators. The smaller the difference, the higher the degree of simulation.

05 Development Trend of Stress Testing

Existing pain points:

  • Need to observe and monitor at any time, need oncall standby

  • Insufficient security

  • The plan is complex and costly

Future trends:

  • Intelligent

  • Unattended

Finally, I would like to thank everyone who reads my article carefully. Reciprocity is always necessary. Although it is not a very valuable thing, if you can use it, you can take it directly:

This information should be the most comprehensive and complete preparation warehouse for [software testing] friends. This warehouse has also accompanied tens of thousands of test engineers through the most difficult journey. I hope it can also help you!

Guess you like

Origin blog.csdn.net/NHB456789/article/details/135084201