How to calmly conduct performance testing in the face of a major sales event

Author: Zhao Jiajia

During the consumer carnivals of Double Eleven, Christmas, and Spring Festival sales every year, we can see that during peak hours, brand live broadcast rooms accommodate millions of people at the same time posting barrages, grabbing goods, and grabbing red envelopes online. In brand stores, there are also Consultation, additional purchases, orders, payments, etc. of the same scale. The increasingly large number of users, increasingly high-frequency interactions, and irregular pulse traffic scenarios pose considerable challenges to application services.

Before facing the performance challenges, we first briefly sort out the user journey and business logic of retail e-commerce. Taking e-commerce websites & APPs as an example, application services will encounter such a scenario when faced with a sudden influx of users: some users are constantly querying product information, some are registering accounts, and some are modifying shopping cart information. , some users are placing orders and paying, etc. The e-commerce platform includes components such as Web applications, middleware, and databases. Among them, the front-end Web application is responsible for receiving and processing HTTP requests from users, and generating Web page feedback to users to interact with users; the middleware application is responsible for executing the business logic; the back-end database and storage are responsible for reading and storing users and products. information and status. In order to improve access experience and service performance, some e-commerce platforms will deploy data caching devices in front of the database. Peripheral deployment load balancing is responsible for load sharing between massive user access and multiple servers. As mentioned before, during peak hours, brand live broadcast rooms must accommodate millions of people to post comments, grab goods, and grab red envelopes online at the same time. If the service goes down, it will cause serious consequences and cause huge trouble to users. Moreover, the larger the number of users, the wider the impact will be, not only affecting user reputation, but also directly affecting business revenue.

What challenges will we encounter when preparing for stress testing?

(1) Difficult to predict traffic scale

The application services of e-commerce websites usually include two parts. The first is recommendation & search, which mainly recommends various products to users and brings convenience to users in selecting products ; the second is transaction payment, that is, additional purchases, ordering, payment and other links. The traffic models of the two parts are completely different. The traffic of the recommendation & search part shows a slowly rising curve. For services, the traffic pressure is gradually increasing; while the traffic of the transaction part rises sharply, especially during the Double Eleven shopping event. , the pressure will increase to the peak in an instant, and there will be no hesitation time for production, research and services. This is also an important reason why e-commerce products have higher usability requirements than other Internet products.

But what the peak will be and when it will arrive are difficult for the production and research team to predict in advance through the business team’s statements. If you want to simulate real users for testing, the costs caused by different stress tests are different. If the stress test is set high, the cost will be high, but if it is set low, it will have no effect. Therefore, it is very important to estimate traffic during event preparation. This is also the biggest difficulty in preparing for Double Eleven over the years: how to evaluate the actual carrying capacity of core pages and transaction payments in the entire chain from user login to completion of purchase. Since the first Double Eleven in 2009, the business scale of Double Eleven has grown rapidly every year, and the uncertainty caused by the zero-point peak traffic has increased.

(2) Diverse scenario stress testing

After estimating the traffic scale, how to form a business scenario that is closest to the real thing based on the user's actual usage process is the second challenge. E-commerce application services usually have many service interfaces, and users will call some of them during the use process. One method of stress testing is to perform unified stress testing on all interfaces, add the same pressure to calculate the service capacity, and then expand to relevant clusters based on the service capacity. Another method of stress testing is to conduct scenario-based stress testing. Since each user's usage process is different, different service interfaces carry different pressures, so that the minimum service resources can be used to support the maximum traffic.

How to choose the stress testing solution that best matches your business characteristics

When designing and selecting stress testing solutions, two problems are often encountered: stress testing will affect online business and affect users who normally access the system. Stress testing will pollute online data, and the stress testing data will be written to the online database. In order to solve these two problems, the industry generally adopts the following solutions. Each of the following solutions has its own advantages and disadvantages, and the applicable scenarios are also different. You can choose flexibly according to the stage of your business.

Build stress testing scripts and stress models

After selecting the stress testing plan, you can start configuring the stress testing script and pressure model. Commonly used stress testing tools in the industry include JMeter, performance testing PTS, etc. Without exception, these tools need to compile the stress testing business API into a stress testing script. The focus of this step is to confirm the APIs for stress testing, making sure there are no omissions, and that the order of the APIs is in line with the user's usage logic. For e-commerce business stress testing, if the login authentication API is omitted in the script, then subsequent order, logistics, inventory and other APIs will report errors in the permission verification step, making it impossible to execute normal business logic, and therefore unable to simulate real business scenarios. .

At the same time, the most appropriate pressure model is selected according to the actual business scenario. For example, the pulse model can simulate a sudden increase in traffic in an instant, and is often used in flash sales and rush purchase business scenarios; the incremental model can simulate the continuous increase in the number of users within a certain period of time, and is commonly used Used to simulate preheated business scenarios. After determining the pressure value and incremental model, it is also necessary to determine the regional distribution of pressure flow, and try to fit the real user distribution to ensure that the test results are authentic and credible. For regional online businesses, it is understandable that the pressure machines are distributed in the same local computer room. If it is a nationwide online business, pressure machines should also be deployed in various regions across the country according to user distribution.

Observability during stress testing

After completing the above preparations, the formal stress test can begin. In this process, we focus on three core indicators: request success rate, request response time (RT), and system throughput (QPS). The request success rate not only depends on the global request success rate, but also the success rate of some core APIs to avoid the situation where the overall success rate reaches the standard and the core API success rate is insufficient. For request response time, you need to pay attention to whether some key quantile indicators such as 99, 95, 90, 80... are in line with expectations. The average response time does not have much reference significance, because the stress test needs to ensure the experience of most users. When the degree of dispersion is unclear, the average value can easily lead to misjudgment. System throughput is an indicator of how much access the system can withstand and is an indispensable standard for stress testing.

When the three core indicators reach an inflection point, it can be considered that the service has experienced a performance bottleneck, and the stress test can be stopped and preparations can be made to locate/analyze performance problems. If the three core indicators are very stable throughout the stress test process, it means that the service has met the availability expectations. But we can also continuously increase the pressure value at a ratio of 10-20%, conduct a peak "touch high" pressure test for the service, and observe what the service limit value is, so as to truly have a bottom line.

About performance testing PTS

It can be seen that there are many things that need to be prepared in advance during the entire stress testing process. In order to help enterprises and developers conduct performance stress testing more efficiently and conveniently, Alibaba Cloud has launched Performance Testing PTS. Performance test PTS has been tempered by years of Double Eleven activities. Through scalability, it can easily initiate millions of concurrent traffic, eliminating machine and labor costs; through global regional traffic initiation and precise control of traffic models, real simulation can be achieved The user's traffic source. Complex user interactions can be easily simulated using functions such as zero-coding scene orchestration and traffic recording. After the stress test, the rich monitoring and problem diagnosis functions help the business quickly find bottlenecks and improve system performance and stability.

A cosmetics brand with more than 10 million fans on the Internet faced great pressure to ensure system stability when facing major sales in the past. Third-party interfaces and some slow SQL may cause serious online failures; system resources during major sales are inconsistent with daily life. Resources vary greatly, requiring frequent expansion and contraction; in addition, stress testing and system capacity assessment work are very frequent and require a normalized mechanism to support them; in order to solve these problems and support rapid business development, customers have made full use of PTS to Evaluate the system's single machine capability and overall capacity, and predict in advance the business volume that a single machine can carry and the overall business volume that it can carry, so that reasonable resource planning and cost predictions can be made for future business promotion needs.

A banking institution holds a mobile APP live broadcast event. It needs to support millions of users online at the same time, and has frequent interactions such as text messages, likes, and red envelopes. It has extremely high requirements for system performance, stability, emergency response, etc. Customers use PTS's traffic recording and multi-protocol stress testing capabilities to accurately simulate mobile phone users' online operations. Its powerful pressure and speed adjustment capabilities can simulate end-user traffic sources and models. This helps customers accurately plan system capacity, rehearse possible emergencies, and ensure the stability of the entire system. At the same time, PTS greatly shortens the preparation time of the stress test engine, and the time to bring up the engine to simulate one million users online is shortened to 40 seconds; during the stress test process, it provides a variety of stress test traffic monitoring methods to help locate problems; the stress test ends After that, the pressure measurement flow is immediately stopped and multi-dimensional and multi-angle pressure measurement reports and problem diagnosis tools are provided, which greatly improves the efficiency of pressure measurement. PTS helped the client successfully complete the live broadcast ceremony. The number of views that night exceeded 2 million and the number of likes exceeded 17 million.

The author of a well-known open source project lost his job due to mania - "Seeking money online" No Star, No Fix 2023 The world's top ten engineering achievements are released: ChatGPT, Hongmeng Operating System, China Space Station and other selected ByteDance were "banned" by OpenAI. Google announces the most popular Chrome extension in 2023 Academician Ni Guangnan: I hope domestic SSD will replace imported HDD to unlock Xiaomi mobile phone BL? First, do a Java programmer interview question. Arm laid off more than 70 Chinese engineers and planned to reorganize its Chinese software business. OpenKylin 2.0 reveals | UKUI 4.10 double diamond design, beautiful and high-quality! Manjaro 23.1 released, codenamed “Vulcan”
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3874284/blog/10344490