Exploration and practice of full-link stress testing-are you hungry?

Background: Since 2015, with the rapid development of the Internet industry, Ele.me’s business has also entered a stage of rapid expansion. Ele.me’s online food delivery platform has reached 260 million users, covering more than 2,000 cities across the country.

Insert picture description here

The takeaway business itself has the following characteristics:

  • Timeliness: From the user's order to the merchant receiving the order to the logistics and delivery to the home, the entire process must be controlled within a certain time range, and the requirements for timeliness are very high;

  • High concurrency: Tens of millions of orders generated by a large number of users are concentrated in the two time periods of noon and evening, and the impact on the entire system can be imagined;

  • Spike activity: In order to make full use of the machine resources in idle time, the spike activity will be carried out at several hourly hours. The instantaneous traffic generated by the hourly hour activity can even exceed the maximum value of the noon peak;

  • Regularization: The impact of this large flow is not accidental, but a process of normalization, which places extremely high requirements on the stability of the system;

Based on these factors, coupled with the continuous occurrence of capacity-induced problems on the line, it is imperative to perform full-link stress testing on the overall system.

Bitter journey

Ele.me’s full-link stress testing is performed in an online environment. As for why it is not performed in a testing environment, it is mainly based on the following two reasons:

  • 1. The hardware resources of the test environment and the pressure test data are too different from online, and the data indicators obtained by the pressure test have little reference value;

  • 2. The dependencies between services are intricate, and the test environment is difficult to simulate and not stable enough;

The pressure test of the entire link is not accomplished overnight, and it has mainly gone through three stages.

The first stage: shrink the server

In the low peak period, the number of servers in the cluster is reduced by one, so that the request volume of a single server is continuously increased, so as to evaluate the current cluster capacity and estimate the number of servers required as the number of orders increases.

优点:

  • Real traffic, the estimated capacity is the most realistic;

  • There is no need to write stress test cases and prepare stress test data, saving a lot of time for stress test preparation;

  • No dirty data is generated;

缺点:

  • The risk is relatively high, once the bottleneck is found, it must be restored immediately;

  • Due to the constant request volume, the capacity of the underlying basic components, such as DB, MQ, etc., cannot be estimated by reducing the number of servers, which will result in inaccurate capacity evaluation;

  • There are certain requirements for access traffic, if the traffic is too small, the bottleneck will not be reached;

The second stage: individual business stress test

Conduct stress testing for a single business during low peak periods, which requires stress testing personnel to have a certain degree of sensitivity to the business and a deeper understanding of the business and system architecture.

To give an example, the following is a simple architecture diagram of the merchant system.
Insert picture description here
Merchant read requests call API, and write requests call EOSAPI. EOSAPI is the basic order service. The pressure measurement flow comes from the merchant system and then the capacity is evaluated, so that the data obtained by the pressure measurement is better than the actual The actual online situation will be better, because the online EOSAPI is not only called by the merchant system, but the order placement service is also called at the same time.

Based on the above situations, we finally decided to conduct a full-link stress test online.

The third stage: online full link pressure test

Simulate orders placed on take-out platforms, place orders on open platforms (such as hand-made), merchant orders and logistics distribution, simulate a large number of user query operations, and cover all critical path interfaces. By constantly increasing pressure from various entrances, both The capacity of each service can be evaluated, and the performance indicators of the underlying services (including each middleware) can also be observed, and the capacity evaluation of the entire business system is relatively accurate;
Insert picture description here

Of course, this will also bring some problems, the most typical is how to deal with dirty data generated by write requests. At first, we wanted to write the pressure test traffic to an isolated location through identification, but later found that there were too many places to be transformed (including various businesses, middleware, etc.), and we urgently needed to solve the online capacity The problem was that it was finally decided to logically isolate the pressure test data, that is, to distinguish the pressure test data from the real data with a special mark. Later, for big data statistical analysis, clearing and settlement, etc., the pressure test data will be filtered out through special marks.

Implement

How to do a full-link stress test? Mainly include the following aspects.

1. Sorting out the business model

The business model combing needs to combine the business itself and the business system architecture. This step is very important. Whether the combing is perfect is directly related to whether the final stress test results have reference value.

For example: the following is a system architecture diagram of a certain service in the full-link stress test.

Among them, not only need to pay attention to the performance of the API provided by the X Service service itself, but also need to pay attention to the consumption capacity of the X Service Cache MQ Consumer. The combing of business models requires very high sensitivity to the business structure of stress testers.

The specific structure combing mainly includes the following aspects:

  • Critical path

  • Business calling relationship

  • List of provided interfaces of the business

  • Interface type (http, thrift, soa, etc.)

  • Read interface or write interface?

  • Proportional relationship between interfaces

2. Data model construction

The general principle of data model construction: closely follow the business scenario and simulate the requests of real users as much as possible.

Example 1: Write request

Stress test scenario: User orders

Pressure measurement method:

The number of users, merchants, dishes, etc. is scaled in proportion to online;

Special mark for pressure measurement flow;

Mock payment, SMS and other links according to the pressure test mark;

Data cleaning according to pressure test marks;

Example 2: Read request

Stress test scenario: Merchant list and keyword query

Pressure measurement method:

Pull online logs and replay them according to the real interface ratio

Example 3: No log new service

Stress test scenario: Merchant qualification query

Pressure measurement method:

Build pressure test data so that the cache hit rate is 0%, service interface performance, database performance;

When the cache hit rate is 100%, the service interface performance;

Service interface performance when the cache hit rate reaches the business estimate;

Of course, I have encountered some pits because of the pressure test data (different from the real scene):

Pressure test user data does not consider sharding distribution, resulting in overheating of a single point of DB

The number of users is too small, resulting in too many orders for a single test user

The number of merchants is too small, resulting in fierce competition for reducing inventory locks

It can be seen that the construction of the stress test data model is not accomplished overnight. This is a process of continuous adjustment and continuous optimization, and the authenticity, timeliness, and safety of the data need to be considered at the same time.

3. Selection of pressure measurement tools

At present, Ele.me stress measurement tool is mainly based on Jmeter, mainly based on the following points:

Open source and lightweight, able to understand and even modify the implementation of each control

Facilitate the development of plug-ins, which can support types of requests such as thrift

Rich support (such as RemoteServer, set meeting point, etc.)

The pressure test result data is consistent with the company's monitoring indicators

4. Monitoring and collection of pressure measurement indicators

Application level

  • Error rate

  • Throughput

  • Response time (median line, 90 line, 95 line, 99 line)

  • GC

Server resource

  • CPU utilization and load

  • RAM

  • Disk I/O

  • Network I/O

  • Number of connections

Basic services

  • SQM

  • Redis

  • DB

  • Other middleware

important point

  • Do not use average response time for response time, pay attention to line 95;

  • Linking throughput and response time

  • Throughput and success rate linked


Those who are interested in software testing can also follow my official account: Programmer Erhei, focusing on software testing sharing, mainly sharing testing foundation, interface testing, performance testing, automated testing, TestOps architecture JmeterLoad, Runner, Fiddler, MySql, Linux , Resume optimization, interview skills, and actual video materials for large-scale test projects. Those who are interested can pay attention to it

Share the wonderful content with your friends

Guess you like

Origin blog.csdn.net/m0_52650621/article/details/113399252