Dry goods sharing, the design of the internal pressure measurement scheme of the big factory

01 Why Do Pressure Tests

1. What is a stress test?

Continuously apply pressure to the object under test to test the performance of the system under stress.

2. What is the purpose of stress testing?

Test to obtain the limit performance index of the system, so as to give a reasonable commitment value or capacity warning;

Find out the performance bottleneck of the system and optimize the performance;

Test the stability of the system under high load conditions;

Verify the current limiting and degradation plans of the system under overload conditions;

3. What problems will arise if there is no pressure test?

Online capacity assessment is inaccurate, traffic increases, and services are suspended

No pressure test was done before the upgrade, and the performance deteriorated and the usability decreased after the upgrade;

Unable to give an accurate commitment value, resulting in low cluster water level, waste of resources or high cluster water level, and system stability bugs;

02 Pressure measurement scheme design

The stress testing environment can be simply divided into module-level stress testing and link-level stress testing. Their main features and differences are as follows:

1. Module-level stress testing

Application scenario : Compare the performance before and after the change to see if the performance has deteriorated; locate the performance bottleneck of the module itself.

Environmental requirements : It is not required to be completely consistent with the online environment. It is only necessary to ensure that the two pressure tests before and after the change are in the same environment.

Industry solution : maintain a fixed offline environment and conduct periodic and normal stress testing.

2. Link level stress test

Application scenario : To evaluate the capacity of the entire link; to evaluate the overall availability of the system.

Environmental requirements : It is required to be consistent with the online environment as much as possible. Only such stress test data can be used as a reference.

Industry solution : use the online environment, and use different solutions according to different isolation methods:

  • No traffic isolation, pressure test traffic and business traffic coexist, because there is no isolation, the pressure test can only be performed during off-peak periods 

  • Logical isolation, through traffic scheduling or distribution, the pressure test traffic is sent to a pressure test environment. Pressure test traffic and business traffic run in the same computer room, but they do not hit the same business instance.

  • Physical isolation, taking advantage of the multi-active feature in different places, cuts out business traffic from one computer room, leaving an empty computer room for stress testing.

The first solution is the closest to the online real environment, but there are some security risks ; the latter two solutions are much more secure, but they do not fully utilize the entire online architecture, and there is a certain degree of distortion .

3. How to ensure the safety of online stress testing?

  • For traffic isolation, perform traffic isolation as described above. However, traffic isolation alone is not enough. Even physical isolation will modify online data, so data isolation is also required.

  • When the pressure test traffic passes through the middleware, mark it and make a pressure test mark. For example, http traffic can be configured with a special header.

  • Data isolation is performed on traffic marking in the business cluster, such as writing the logs generated by stress testing traffic to another path (some systems will analyze and count the logs); in terms of storage/caching, store the data generated by stress testing traffic in the shadow Table, normal traffic access normal table;

  • Message shielding, if the message queue cannot recognize the pressure test message, it will cause the accumulation of online messages and affect the online traffic, so the pressure test message needs to be shielded.

  • Mock third-party services that do not support pressure testing.

03 Pressure measurement model

What scenarios should the pressure test cover? How are pressure test requests and data constructed? How to simulate business traffic patterns? The above three questions correspond to the business model , data model , and traffic model in the stress test model .

1. Business model

What business scenarios do stress tests need to cover?

It is necessary to sort out the core business scenarios, which must include core interfaces and high-traffic interfaces. High-traffic interfaces may be interfaces that are not exposed to users and are frequently used internally.

How to simulate business scenarios?

The relationship between interfaces needs to be clarified. For some simple query interfaces, there is no front-to-back dependency, and you only need to pay attention to the traffic ratio; for some complex business scenarios, you need to restore the business processing flow and clarify the interface series logic. It can be sorted out through scene recording and scene playback.

2. Data model

Transformation based on online data

The request part can directly record the online traffic, mark the request with pressure test, and make cheap for the key ID; the basic data can directly copy the online storage data to another pressure test table.

model-based construction

By analyzing online logs and requests, sort out the data characteristics and request characteristics that have an impact on performance, and construct data based on these characteristics. The underlying data needs to be constructed through real business applications.

The way of transformation based on online data

The solution is simple and the data structure is fast, but the existing data in the system cannot cope with new scenarios, and the model adjustment is not flexible, so it is suitable for online pressure testing of old services;

model-based construction

It does not rely heavily on online data, and can manually construct new scenes, with low maintenance costs. It only needs to adjust the interface, and does not need to perceive changes in the online storage table. The model can be adjusted flexibly, but the scheme is more complicated and the data structure is slow. Use The scene is relatively wide, and both online and offline new and old services are available.

A special case of the pressure test model: traffic recording, playback as it is

Features: No need to simulate business scenarios, no need to construct data; only record services and interfaces with existing online traffic; only play back in the online environment, only read-only interfaces; only applicable to pressure testing of old service read interfaces ;

Traffic recording can record the traffic of low peak period, flat peak period and peak period to avoid missed detection.

3. Traffic model: Simulate business traffic patterns

  • There is traffic online

    Observe online traffic patterns.

    Most of the open source monitoring management on the Internet is more than 5s. Ideally, it can reach the ms level, which can be realized by analyzing logs.

  • no traffic online

    Analyze user behavior or caller behavior.

    Common business traffic patterns can be divided into two types, one is continuous incremental type, and the other is pulse type (such as grabbing red envelopes)

4. Traffic forecast

The flow pattern is an analog online flow curve. In addition, we need to estimate the flow and calculate the magnitude of the pressure measurement.

Taking Double Eleven as an example, we can divide the interface into three categories:

  • background interface

    The traffic does not change with the activity, and the pressure measurement is only used as the background traffic, and the recent peak value can be taken;

  • common concern interface

    Flow varies with activity, calculated with a generic model

  • Chongbao interface

    For example, the transaction interface, taking the peak value of the historical promotion

04 Analysis of pressure test results

1. Observation indicators

System indicators

  • qps/tps, the maximum tps must be stable, if there is jitter, there is already a problem in the system.

  • Response time, the whole process time from when the client initiates the request to when the request is received

  • Error rate, according to sla

  • resource index

  • CPU utilization, generally lower than 80%, avg lower than 60% is safer

  • Memory usage, less than 80% is safer, otherwise it may fall into a GC death cycle

  • disk throughput/network throughput

  • Feature indicators, determined according to the specific business

  • Connection pool usage

  • message queue accumulation

  • pps

2. Simulation degree analysis: Are the stress test results valuable?

Comparing the service performance similarity between stress test scenarios and online real scenarios at the same water level, the indicators that can be used for simulation analysis:

  • Flow, flow ratio, interface coverage

  • link coverage

  • Machine resources, cpu utilization, memory utilization

  • Availability metrics, latency, error rate

  • business indicators

Assemble these indicators into a vector and compare them with the online indicators. The difference between the two is smaller. The higher the degree of simulation.

05 Development Trend of Stress Measurement

Existing pain points:

  • Need to observe and monitor at any time, need oncall standby

  • lack of security

  • The scheme is complex and expensive

Future trends:

  • intelligent

  • Unattended

Finally:  In order to give back to the die-hard fans, I have compiled a complete software testing video learning tutorial for you. If you need it, you can get it for free【保证100%免费】

insert image description here

These materials should be the most comprehensive and complete preparation warehouse for [software testing] friends. This warehouse has also accompanied tens of thousands of test engineers through the most difficult journey. I hope it can help you too!

How to obtain the full set of information:

Guess you like

Origin blog.csdn.net/weixin_50829653/article/details/130410949