How much pressure can resist, Didi's full-link stress test simulation measurement system construction

In order to ensure the stability of the online system before major holidays and events, Didi needs to do multiple rounds of risk investigation and capacity acceptance through full-link stress testing. We often hear such voices, "Your full-link stress testing and online business scenarios How big is the difference?", "Is it true that the pressure test reaches the target line and can withstand such a large amount?", "A certain module of mine feels that the pressure is much greater than the line during the pressure test", etc. We lack a set of means to see the difference between stress test coverage and real system traffic, and subjective verification has large errors and irrationality. Therefore, we build a pressure test simulation measurement system to scientifically evaluate the pressure test coverage and real system. System differences.

Starting from 2020, the online car-hailing stress test team began to focus on building a stress test simulation measurement system, and realized engineering application. This article will systematically introduce the construction process of Didi's online car-hailing full-link stress test simulation measurement system.

background

As the online business scenarios of Didi’s online car-hailing business become more and more complex, as well as the tidal characteristics of the online car-hailing business and the sudden increase in traffic during holidays, the stability guarantee challenges are huge. Full-link stress testing is to deal with the current complex distributed environment. The many uncertainties faced may lead to the core method of system availability problems. Full-link stress testing is to simulate the real requests and data of a large number of users through technical means, test and verify the entire business link in various scenarios, and continuously discover online systems Potential stability risks, assisting in verifying whether the system can withstand the flow estimation of major activities and events that have been formulated, and assisting service parties to carry out risk investigation and resolution. The following figure shows the full-link stress test construction diagram of the online car-hailing service:

30a50c8be70530ca4c42d48828bb9b2b.png

For online car-hailing full-link stress testing, the biggest challenge is that there are a lot of uncertainties in simulating real scenarios:

  • Uncertainty in time and space. Passengers are required to receive orders within a fixed time after issuing orders. The matching conditions for drivers and passengers depend on the distance between drivers and passengers.

  • Due to the uncertainty of the order distance and the pick-up time of the driver, the stress test driver pick-up and drop-off cannot fully simulate the real road scene online, such as traffic jams, etc. If the pick-up and drop-off speed of the stress test driver is too fast, the transport capacity will be too sufficient, and if it is too slow, the transport capacity will be insufficient , resulting in a deviation between the pressure test and the online real scene;

  • Uncertainty in order completion, drivers receive orders completely rely on the order matching strategy, the same driver at the same location may receive different orders;

  • There are also such things as uncertain scenarios, uncertain categories, and resource preemption of public modules by other business line traffic.

Under the blessing of so many uncertain factors, we need to ensure that the full-link stress test is sufficiently simulated to achieve the goal of "as much pressure as possible". The simulation measurement system is a key way to measure the credibility of the full-link stress test. It has two meanings: one is to clearly see the current situation of full-link stress test coverage, so that students related to stability can understand their corresponding systems, services, and chains. The road stress test covers the real situation, improving everyone's trust in the full link stress test. The second is to discover the weak points of stress test coverage, guide the stress test maintenance students to improve in a directional way, form a positive closed loop, and ensure the stress test effect.

The process of Didi building a simulation measurement system mainly has four stages, as shown in the figure below:

03d3f3049e79e294a8396645f8d6dea5.png

  • Measurement requirements refer to the measurement range and degree of measurement used to measure the simulation system, and the measurement range is used to clarify the boundaries of the measurement system;

  • The construction of the measurement system is to decompose the measurement requirements and perform systematic calculations on the defined measurement range, including the acquisition of measurement data, the construction of scoring models, and the selection of weight factors;

  • The measurement effect verification is to clarify the measurement system, through multiple rounds of measurement experiments, the formula has been calibrated, and finally a model that is most consistent with the expected results is sought;

  • Measurement architecture construction is the engineering implementation of the measurement system, including architecture options, architecture construction, and business logic implementation.

Clarify measurement requirements

85656148d088e966800cdf469d8698a3.png

The process of improving the simulation degree of full-link stress testing can be divided into three major nodes in summary:

  1. The coverage is improved . The coverage of the early full-link stress test focuses on 8 dimensions. The measurement standard depends on the gap value of the comparison between the real peak traffic and the online traffic. Driven by business needs, there are many subjective verifications, and the goals are mostly based on the number of interfaces and categories. Mainly cover.

  2. After seeing the problem clearly , after the coverage of different dimensions has increased to a certain level, we began to pay attention to the effect of stress testing and analyze the weak points of stress testing, so as to make directional improvements. During this period, we made a series of visualization products based on 8 dimensional data to help us see the stress testing clearly question.

  3. Goal-oriented is not enough to see the problem clearly. We also need a set of indicator measurement system to quantify the current degree of simulation, and set a reasonable annual target value based on the current situation to form traction. This also leads to the measurement system of the degree of simulation. Within the scope of the original improvement dimension standard, based on the current infrastructure situation and technical implementation cost considerations, we finally determined five measurable dimensions and performed a second disassembly. The disassembly is as follows:

  • Interface coverage (pressure test interface coverage, traffic achievement simulation degree [router, inrouter])

  • Scenario coverage (scenario definition, scenario traffic, traffic comparison)

  • Category coverage (category complete works, category traffic, traffic comparison)

  • Link coverage (interface dimension, traffic dimension) 

  • Module coverage (module traffic coverage, module interface coverage, resource dimension (cpu\memory)) 

Measurement system construction

Survey of Metrics

data normalization

Before doing simulation measurement, it is necessary to consider normalizing the data first, the purpose is to:

  • Positive (consistent processing): refers to the unification of the types of evaluation indicators, so that the attribute value is positively correlated with the measurement accuracy

  • Dimensionless: different indicators, different units

  • Normalization: different indicators, different values

Method selection

ba304c1bc59a9babeee179fe3af32fe0.png

Scoring Formula Argument Selection

For the index of interface traffic achievement, from the attribute point of view, it is suitable for intermediate transformation. We expect that the closer the total traffic is to the target traffic, the better the completion of this index. At this time, we have two calculation methods to choose from. .

In the first one, the absolute value of the difference between the total flow and the target flow can be used as the independent variable; in the second, the ratio of the total flow to the target flow can be used as the independent variable. However, judging from the performance of traffic data, using the absolute value of the difference, the difference between the sample points is too large, causing the data after normalization to be used to be concentrated and scattered at one point, most of which are 100 points, as shown in the following figure:

ba2135598ce4e4370b3a028c0ac2ba98.png

Therefore, we use the ratio as an independent variable, and the problem is improved, as shown in the following figure:

9183c85b07fafe7a7f37cb174fcd46d7.png

Metric Data Preprocessing                                          

Pressure measurement flow prediction

Generally, the peak flow rate of the bill issued by the pressure test will rise by a certain percentage based on the peak value of the historical holidays. However, since the peak value of the pressure test has never appeared online, if the online business system can predict the water level and flow rate of the pressure test, it can greatly improve the simulation. degree of precision.

There is a special estimate inside Didi. According to the historical traffic growth, factors such as weather and solar terms are introduced to estimate the traffic in the next month. Based on this supply and demand forecasting model, we consider holiday factors and growth fluctuations to minimize the target Calculate the error between the function and the real flow curve to obtain the most pressure-tested flow growth curve and interface peak value. Multiple rounds of verifications have been done on actual holidays, and the error of core business indicators is below 5%.

b87bebc3902365bdf104023193e37c08.png

Traffic forecasting model implementation and effect verification results

Traffic forecasting is also a relatively large system engineering, so I won’t discuss it in detail here. What needs to be emphasized here is that traffic forecasting is the prediction of the ideal state of online traffic, but the actual situation online is much more complicated. Any online Changes or online abnormalities will affect traffic prediction results, such as:

  • During the peak period of holidays, the traffic is too large, the business execution is downgraded and other operations, and the online traffic model changes greatly;

  • In the order matching business, if the underlying service at the back is rammed/dead/avalanche, the front link will receive an impact much larger than the estimated traffic.

Data Noise Reduction

After using proportion as an independent variable, it is still found that the score distribution is not uniform, and a large number of scores are given to 95 to 100 points. After analyzing the data, it is found that there are individual outliers with a large proportion, which affects the overall distribution. Remove these points Then, look at the distribution.

f059ac43abd01f049f3a9e3f7305e5d1.png

After removing outliers, the overall sample is evenly distributed on the percentile scoring rule. Therefore, for these outliers whose proportion value is much higher than most of the data, we cannot include it in the sample data for calculation. According to the specific business situation, the data should be removed, or the data should be added to the whitelist for further calculation. separate consideration.

Generally, we treat this type of noise as noise, and the denoising method mainly uses traffic filtering, black and white lists, setting thresholds or data preprocessing .

traffic filtering

For some data whose flow rate is too small, it is directly filtered. Because the flow rate of these data is too small, under normal pressure, the ratio is extremely high. If the ratio is too high, it affects the worst index of the formula and will cause other The indicator is not so good, and a high score is given. From Figure 2 below, it can be seen that a large number of scores are above 95, which makes users feel less objective.

f8de3ca0d9027ed81ec1fdff535eee7a.png

figure 1

b2c419ff1d0f272eb3d4d1dc7371bcb4.png

figure 2

Therefore, router<100 and inrouter<500 are filtered, and Figure 4 is obtained after filtering. The current scoring situation is more in line with expectations.

6b7f83a25da55dca1c5f6f3f3310a953.png

image 3

e39977b2b365d6524db95c454d083bbf.png

Figure 4

Black and white list

For the data of some scenarios, the traffic cannot be obtained, or for some reason, the pressure test cannot be performed, such as non-network car-hailing business scenarios or some offline task calls, etc. These data are added to the blacklist, and are not included in the calculation of the score calculation formula; for low-traffic but core interfaces, this type is processed by adding white to avoid being identified as low-frequency traffic and being filtered out.

Data Thresholding and Preprocessing

For the final scoring result, we hope to have a fixed score, which will not affect the scoring of previously scored items due to the increase of new samples. Therefore, for the max|xk| item of the intermediate transformation formula, a threshold is set, x=7, and when x is greater than 7, the score is 0 points.

In the observation of the scoring results, it is found that the scores of the two cases of 0<x<2 and 2<x<7 are quite different. Therefore, for the case of 0<x<2, let x=1/x, This solves the problem that the scores of the two cases of 0<x<2 and 2<x<7 have a large difference.

Analyzing the link coverage data, we found that some interfaces have no traffic on the line, or the traffic is very low. These interfaces should not be included in the link coverage module. Therefore, we have designed filtering rules, using the traffic size of the main link entrance as a reference, custom filtering rules, such as less than 1% or 2% of the main link entrance traffic as the filtering condition, and setting a maximum threshold at the same time, when the main link When the road traffic is relatively large, the traffic filter condition will be increased, so we set the filter condition to a maximum of 100qps, and when it exceeds 100, it will be processed as 100.            

3a6c92f149e16e72f0ddd927dcdf65b3.pngBefore filtering

48d3b887af2ec29894567cda3be3dabf.pngAfter filtering

Metric model building

interface coverage

Interface coverage is defined as looking at the stress test coverage from the router and inrouter dimensions. It is mainly divided into two parts: one is the coverage of the stress test model on the interface of the access layer; the other is whether the stress test traffic reaches the expected target, reflecting the stress test model Or whether the scene configuration is reasonable. The pressure measurement target setting uses the capability in traffic forecasting, and the interface coverage calculation formula is

807b0f74a2e243dce96bf457b0cbfa69.png

Among them 0c9e9f729860d2779aab9a6fb17eb9d7.png, it represents the traffic achievement degree, which is cb166c4659ebce29b2b2c28971ffb9d8.pngthe formula for calculating the interface achievement, and 202f7e3dc0c0569814873bfe0c564cc1.pngis the ratio of the total traffic (pressure measurement traffic + online traffic) to the target traffic, k is the optimal expectation of x, and y is the worst threshold of x. Weight factor, set different weights according to different levels of the interface, 44be95378de9eacf2d7151098344cfdd.png representing weight calculation. 19b40b9b6bbd09c2436634b061fddda6.pngFor interface coverage, a weight factor is generally introduced, and finally the actual value of interface coverage is calculated by the cross product of two dimensions  d7b8b4093655fd762dc5b76d0953d1c4.png . The final effect is as follows:

3140c5d639de72b1caecbe928a92e352.pngInterface Coverage Effect

b25d122e060a94ddb9a51ded0efe0ee8.pngPressure measurement flow and forecast flow curve

module coverage

For module coverage, the core idea is to analyze the correlation between changes in business orders, machine hardware resource consumption, and module traffic changes during peak periods or holiday peaks, and use the stress test model for fitting. The core needs to consider the fit of business volume vs. module traffic, and the fit of business volume vs. resource usage.

Here, the resource usage mainly considers the CPU usage, and the module traffic is divided into two dimensions, the coverage rate of the module's pressure test interface and the fit between the module's total pressure test traffic and the online traffic. The degree of fit algorithm is as follows:

Fitting degree of module interface, module traffic and business volume:

  • Perform linear fitting on the online peak period (such as the Friday night peak) compared to the time billing volume and the traffic of each interface, link, and module, and generate an equation, an extensive equation: y = w'x+e

  • Calculate the R-square value by using the actual flow of the peak period of the module and the predicted flow calculated by the previous equation. The algorithm is: , the   9eaf7c5ebb8b39e907ea0bb306736ce8.pngresult value is [-∞,1], the mapping is [0-100], and the result is less than 0 ( The effect is worse than the average difference), and the default is 0 directly.

Fitting degree between module CPU usage and business orders:

  • Perform linear fitting on the online peak period (such as the Friday evening peak) and the actual number of CPU cores used by each module to generate an equation and an extensive equation: y = w'x+e

  • Calculate the R-square value by using the number of CPU cores compared with the peak hours of each module and the predicted traffic calculated in the previous step 13226978dc8c508000af0d322a942002.png. (the effect is worse than the average difference), the default is 0 directly.

The higher the fitting degree, the greater the relationship between the representative and the business volume, and the more accurate the traffic forecasting results. Therefore, we divide the module coverage cores into two categories:

The first category: the degree of fit is greater than 0.7

  • Compute predicted values ​​using the fit equation (y = w'x+e in goodness-of-fit calculations)

  • Using the variant calculation of MSE, MSE is currently recognized in the industry as a strict algorithm for evaluating the degree of fit r = 1- 1/n∑((Yy)/y)²

The second category: the degree of fit is less than 0.7

  • Calculate the average online peak period, peak traffic and cpu, get FlowAver, FlowMax, CpuAver, CpuMax

  • Calculate the average value, peak flow and cpu of the pressure measurement time end, and obtain PreFlowAver, FlowMax, etc.

  • R = 0.5*abs(PreFlowAver - FlowAver) / FlowAver + 0.5*abs(PreFlowMax - FlowMax) / FlowMax

Calculation method of simulation degree of this dimension

70d9d631e4012608649fdaad8dff8914.png

The actual landing effect is as follows:

1af2c3cd0bb9bd1dfbe2a41840b2d829.png

efac51cc6ccf290c3f99c894c4ef38ef.png

link coverage

Link coverage refers to the fit between the link stress test traffic direction and the online traffic. The core purpose is to see the coverage of the core link from the entrance access layer to the final storage layer, so as to assist the stress test students to discover the link. Weak points in road coverage, the following are several common scenarios that affect the score:

  • The stress test scenarios and user characteristics are not fully covered, and some logic has not been implemented;

  • Factors such as link anomalies, service stress test configurations and online configurations are inconsistent, resulting in inconsistencies with the online logic;

  • Special processing has been done for the stress test link.

The calculation principle of the link coverage score is the same as that of the module interface coverage. By calculating the fitting degree between the business flow and each interface, the pressure measurement water level flow value of the interface is finally calculated to determine whether the interface is covered. When calculating the total score of the link, the weight factor is introduced to calculate the coverage of the link:

d38878ea177769f8bcd285b37a4f6a15.png

There are multiple links in the link dimension, and then calculate the simulation score of the link dimension according to the link weight:

c584f9ce0ea233ff079dee4a5d5db787.png

9c8316d7e22216228412960dacdbfa44.png

Single link pressure test coverage topology diagram

6842f135169e3e5fc2e3a87e2555b515.png

Link pressure test coverage front-end display renderings

Category & scene coverage

For categories and scenarios, it is the real ratio on the reference line before the stress test, including the expected stress test water level value, which is relatively clear. It is only necessary to adjust the pressure test model in advance before the pressure test, and calculate the category, scene pressure test coverage and target achievement. The weight factor is determined by the traffic. The greater the traffic, the greater the weight.

668eab4dac2b97e68b97756d9fa1f34a.png

Among them 9d705f0d9a106f1f60584dc265c1fe60.pngis the flow, 13240dad21a34b24e906aeab227423dc.pngand is the weight. The weight factor of scene coverage is mainly determined by the size of the flow. Therefore,  0b4bbb610b8c78728ccbcdc1185b7f56.pngthe ratio of the flow of the current item to the total flow is used as the weight  c89c779995569d7d2887c48b1abed4f4.png. The higher the flow, the greater the proportion.

521a5b2b08b25e1c7bdf9e7a3efcd3b6.png

Overall Effect Demonstration 

9a34e5c5e2faecd688b7f13867b33f76.png

  Simulation Measurement Effects (Beta)

Summary and Outlook

Based on Didi's business scenarios and the implementation of full-link stress testing, the construction of simulation degree is an indispensable part of stress testing closed-loop links. By quantitatively evaluating the degree of simulation in the five dimensions of interfaces, links, modules, categories, and scenarios, it is possible to explain the stress test coverage with objective data to a certain extent, and to improve the stress test model calibration and stress test coverage for the students on duty. provide greater assistance.

However, the online environment is complex and varied, and these dimensions alone are not enough. We need to continue to explore other dimensions of data, polish the accuracy of simulation, and tap the value of simulation, using simulation as a set of platform capabilities. Empower peripheral businesses.

In terms of accuracy improvement and value mining of simulation degree, we will start from the following aspects:

  • Accuracy improvement: expand the dimension boundary, access to the dimensions of the total resource usage of the IDC computer room, network usage (such as leased line bandwidth), and the overall traffic of the access layer.

  • Value mining: intelligent adjustment of the stress test model, assisting intelligent stress testing, and reducing stress testing costs; using traffic forecasting and capacity planning capabilities to assist in the calibration of peripheral business current limit values ​​and pre-assessment of capacity risks; sensing major changes in business traffic and giving timely warnings etc.

The construction of the simulation degree is inseparable from the cooperation of all parties within Didi. Thanks to the group pressure testing platform, data team, business team and basic service team for their help and support, and look forward to the joint efforts of all parties to improve the simulation degree construction , enabling business.

 END 

Department introduction 

This article comes from the quality middle-stage team of online car-hailing travel technology. As a research and development team of online car-hailing business, travel technology builds an end-user experience platform, C-end user product ecology, B-end capacity supply ecology, travel safety ecology, and service governance. Ecology and core guarantee system to create a travel platform that is safe, reliable, efficient, convenient, and user-reliable.

Job Offers

We are recruiting for the backend of the team and testing requirements. Interested partners are welcome to join. You can scan the QR code below and submit your resume directly. We look forward to your joining!

R & D Engineer

Job description:

1. Responsible for background research and development of related business systems, including business architecture design, development, complexity control, and improvement of system performance and research and development efficiency;

2. With business sense, through continuous technical research and innovation, iteratively improves the core data of the business together with products and operations.

8e227efaba283e69d682c80b76f287a0.png

Test Development Engineer

Job description: 

1. Build a quality assurance system applicable to the online car-hailing business, formulate and promote the implementation of relevant quality technical solutions, and continue to ensure business quality;

2. In-depth understanding of the business, establish communication with various roles in the business, summarize business problems and pain points, create value for the business in an all-round way, and work without fixed boundaries;

3. Improve business code quality and delivery efficiency by applying relevant quality infrastructure;

4. Precipitate efficient testing solutions, and provide generalized solutions to support landing applications in other business lines;

5. Solve difficult problems and complex technical problems in business quality assurance;

6. Forward-looking exploration in the field of quality technology.

138da26244abdb92ccfb0a9f09f0b8eb.png

Guess you like

Origin blog.csdn.net/DiDi_Tech/article/details/132353278