With millions of daily active users, how can Tencent products avoid the risk of server downtime in advance?

As we all know, excellent application performance is a solid foundation for a good user experience, and products with slow server response, freezes, and crashes, no matter how beautifully designed, cannot retain the hearts of users.

On February 28, 2017, Baidu made a small joke with users. From 20:54 to 21:24 that day, Baidu search was down for 30 minutes. Many netizens joked that the 30 minutes Minutes became Baidu's most meaningful 30 minutes, but from Baidu's public relations articles later, it can be seen that it mentioned "missing hundreds of millions of search requests from everyone". It was a big impact.

Coincidentally, today's headlines also experienced downtime in January this year. The system did not respond for more than 30 minutes, and the editing background of headlines could not be accessed. These phenomena have brought great troubles to users, and the number of users The bigger it is, the wider the impact will be, not only affecting user reputation, but also affecting product revenue.

If the monthly income of the product is divided into every minute, you can calculate the specific loss amount through 30 minutes, 60 minutes, or even 12 hours, 24 hours of service downtime, plus the resulting user loss and brand loss. word of mouth influence.

When a well-known foreign game hit the second place in the iOS free list at the beginning of its launch, because it was not well prepared to deal with the influx of players, the server was stuck, down, and the flashback made players disappointed and chose to leave. The download ranking once dropped to 475, it took two months of server optimization to save the situation.

There are many such examples. With heavy games and heavy products, more and more products pay more and more attention to the optimization of server performance. This article will share some methods and ideas based on the experience of the Tencent WeTest team in performing server stress testing for Tencent games and products.

What are the core indicators of server performance 

There are many indicators for server stress testing. To make it easier for everyone to understand, here is an example from real life:

You go to Haidilao for lunch. We can regard the restaurant "Haidilao" as a system under test . When you go to eat, you initiate a request to the system under test , which puts a certain load on the system . The more people you bring, the busier the restaurant will be, so to speak, the greater the load on the restaurant.

You start ordering. At this time, the people at the table next to you also start ordering. Then the two of you have concurrent requests to the system. At the same time, some of the other tables are eating and some are waiting for the food. These are all concurrent transactions . A complete meal transaction can be defined as including four steps: ordering, placing an order, serving, and paying. For a C/S system, it can correspond to: establishing a connection, sending a request, receiving a response, and disconnecting .

An important factor that affects the quality of a restaurant's business is the speed of serving food. Serving speed is reflected in two aspects:

1. The time-consuming processing of a customer request, the waiting time from placing an order to serving the food, we call it response time .

2. The frequency with which the restaurant serves multiple customers at the same time, we call it throughput .

How many customers come is beyond the control of the restaurant itself, but the speed of serving food and the number of seats in the restaurant will restrict the flow of customers. There must be a peak passenger flow . When the number of customers exceeds this peak, these customers will wait for a seat, or the speed of serving food is so slow that the guests cannot tolerate it. Capacity testing is to use tools to simulate enough customers to eat, hoping to find such a customer flow that will generate a certain load on the restaurant. At this time, the restaurant can receive the most customers and guarantee the shortest waiting time. What's more, it is also possible to optimize the restaurant's staffing and table settings, in order to achieve an optimal resource utilization and efficiency.

The flow of customers is related to the number of customers coming in, and it is also related to the reception capacity of the restaurant. Unilaterally increase the number of customers who come to eat, the greater the possibility of complaints and the greater the possibility of wrong dishes.

There are many performance indicators, and it is impossible to look at them all, so what are the core indicators?

1. 90% response time

It means that the response time of all users is sorted from small to large, and the 90% response time is one of the important indicators used to evaluate the system capacity.

2. TPS performance, pay attention to the service ability of the server.

The number of transactions (passed, failed, and stopped) processed by the system per second. It allows to determine the temporal transaction load of the system at any given moment.

3. The maximum number of online users supported.

Refers to the maximum number of people who log in to the site at the same time or the maximum number of downloads that the server receives at the same time.

4. Changes in the total CPU, memory, etc. of the server's own stress testing process.

CPU utilization refers to: CPU execution time of non-system idle processes/total CPU execution time; memory usage refers to the memory consumed by this process.

5. Transaction success rate

Transaction success rate = successfully processed transactions / all transactions * 100%, which is an important indicator for detecting the success probability of server processing transactions.

What server stress testing methods are available on the market?

In order to help users obtain the core data of the server more quickly, various pressure testing methods have been produced in the market, but there are also various problems:

1. Live network data estimation

According to part of the data in the stress test process, the model predicts the situation of a large number of user access in the future.

There is a problem: it is only suitable for simple server fitting, and complex server data is not very accurate.

2. Real person pressure test

By inviting a certain number of real users to play the game, a test effect can be achieved on the server.

Existing problems: The exposed performance problems are limited , and the number of beta testers is usually too small. Although there are hundreds or thousands of users playing, the concurrency is not enough to expose server performance problems; in addition, it is not suitable for tuning , and real people cannot Repeating exactly the same behavior makes it difficult for the server to regress.

3. Interface test

Select a few representative functions to evaluate the overall server performance by seeing the big from the small. There are problems: it is impossible to traverse the interface of the entire server, and it is difficult to avoid some minor problems.

4. Record playback

By capturing the data packets, the protocol during the game is obtained, and then the captured protocol is resent to the server, and the protocol level is amplified by the tool to achieve the purpose of performance testing.

There are problems: in the face of complex protocol interactions, simply amplifying data packets cannot generate enough pressure.

5. Robot simulation

By highly restoring the user behavior of real players and simulating high-concurrency scenarios, the test effect is similar to that of many people playing games at the same time.

These methods have their own advantages and disadvantages. Tencent generally uses the "robot simulation" method for stress testing, and the "robot simulation" stress testing method requires sufficient testing time and a lot of manpower input. For this reason, Tencent has developed a more general The test process is used to improve the efficiency of stress testing.

Introduction to the testing process of Tencent's internal server performance

According to the use requirements of Tencent's internal games and products, the Tencent WeTest team first sorted out a general stress testing process for pages with http and https protocols.

1. Determine the stress test scenario, such as logging in, obtaining information lists, etc.

The first step for testers is to confirm the test plan, which is mainly to simulate the scenarios involved in the actual business and the user behavior in the scenarios in advance. Usually, the following points need to be confirmed:

1) Confirm the user's login status, whether the user's login status will continue to change

2) Context relationship between access paths after user login

3) Parameter transfer relationship between access paths

2. Testers write test cases

Writing test cases is the process of concretizing the above simulation scenarios, including confirming the number of people to be tested, the logic of increasing the number of people, the specific interfaces that need to be pressed, and the parameter transfer between interfaces, etc.

3. Start the robot for testing and gradually increase the number of robots

After confirming the test plan, this step is the process of execution, gradually increasing the number of stressed people according to the estimated number of stressed people in the test plan.

4. Record and analyze data and transaction processing, and check changes in server load and the current carrying capacity of the server.

The previous step mentioned that robots should be gradually increased, so why should robots be gradually increased? Because in the process of increasing server concurrency, it is necessary to continuously monitor the core data of the above server, constantly challenge the limit of server processing capacity, and avoid using an excessively high concurrency number directly exceeding the limit of server processing capacity, thus failing to achieve performance optimization purposes. Generally speaking, in the process of increasing the number of robots, the sudden fullness of the CPU and the instantaneous increase in the response time may be the bottleneck of the server. Therefore, stress testers need to monitor the server status changes during the stress test rise in real time, so as to locate the problem.

5. Adjust the configuration, iteratively test, estimate the carrying capacity of the server and possible performance bottlenecks

After discovering basic test problems, testers need to locate the problem through continuous debugging, and then re-initiate stress testing until the final test purpose is achieved.

According to this test process, Tencent internally summarized the characteristics that some stress test products need to have.

1. Easy to use

The business scenarios of the product are changeable, but a good stress test product should make the configuration process of this scenario easy to use. Users can test each interface by simply entering the URL that needs to be tested. Most test configurations are recommended to provide A default value, users can freely configure these parameters after they have a better understanding of the function.

2. Perfect advanced functions

In addition to being simple and easy to use, it is also necessary to provide users with some advanced functions. On the basis of simply entering URLs, it can support user-defined variables, read variables from files, and even obtain variable values ​​​​from the return values ​​​​of other URLs. , which can more realistically simulate the real scene and avoid a single request variable.

3. Provide distributed presses for pressure measurement

Due to the limitations of a single machine, stress testing products can use a distributed stress testing framework to dynamically allocate multiple stress testing machines according to the number of robots configured by the user, greatly increasing the upper limit of pressure.

4. Detailed test data statistics

The pressure test master will record multiple data during the test process, including changes in the number of online users, TPS changes, response time, sending and receiving packet traffic, server CPU memory status, press hardware load, test result statistics, etc., which can quickly locate the capacity of the server and bottleneck.

Based on these requirements, the Tencent WeTest team has developed a product "Stress Test Master" that focuses on server stress testing, which simplifies the stress testing configuration process. Users can deploy online, debug online, and view reports online, helping users become the most efficient " Master of Pressure Measurement".

Test case: Real simulation of "NOW Live" event scene

Tencent NOW live broadcast is a live broadcast application that is developing very rapidly at present. During an online event, it is necessary to conduct a stress test on all interfaces of the event to expose and solve problems in advance to ensure the smooth implementation of the event. To this end, the NOW live broadcast team chose to use the "Stress Test Master" to conduct a complete set of scenario tests on the event.

"Pressure Test Master" includes three functions: "Page Test", "URL Test" and "Advanced Mode".

"Page test" is applicable to HTTP and HTTPS protocols, and can perform stress tests on Web, H5 and other pages. It mainly tests the pressure data of static resources on the page, helping to improve the stability of the website when the official website promotes large-scale operation activities;

In "URL Test", users can set user concurrent growth form, context parameter configuration of different URLs and server monitoring to realize up to 16 user scenarios at the same time and realize more scenario configurations;

The "advanced mode" is applicable to HTTPS and other protocols, including custom protocols, etc. It supports stress testing for games and product protocols, and users can enable protocol stress testing according to their own needs through code configuration.

During the testing process, the NOW live broadcast team first sorted out the testing ideas. On the one hand, through the single-interface stress test, the problems of the core module were exposed in advance; persuasive.

During the test, "NOW Live" used the URL test function of "Pressure Test Master" to test the "send message", "like", "pull announcement", "register", "read room information", "enter room" and other independent tasks. Behavior conducts single-interface pressure testing, and designs the interface's pressure situation by setting the initial number of people for pressure testing, increasing number of people in each stage, and maximum number of people (see the figure below).

"Number of people setting" in the URL test of "Pressure Test Master"

In addition, for different access behaviors of users, "NOW Live" has carried out a multi-scenario stress test on "registration-room information-room entry" (see the figure below), read a user's "login status" through a GET request, and pass The functional interface randomly generates robots with different behavioral logics to simulate real QQ users; then executes specific business behaviors sequentially through POST requests, and finally selects and invokes different functional interfaces through the "context configuration" in the URL test, so as to discover the difference between functions. Logical problems that arise.

"Context Settings" in "Pressure Test Master" URL Test

After several days of intensive testing, the data of each scene of NOW live broadcast activities has been greatly improved, among which the response time of the scene "user enters the room" has been reduced by nearly half; the TPS of the scene "user sends a message" and "like" has been improved Four times, providing a solid guarantee for the stable development of activities.

Whether it is a game or a product, Tencent has experienced countless server tests. Faced with these tests, Tencent has gradually summed up a set of general application performance management solutions, allowing users to use real business scenarios and user behaviors before the game goes online. Conduct stress tests to discover server-side performance bottlenecks and perform targeted performance tuning. The above content is also some experience summarized based on countless stress tests of Tencent products. The Tencent WeTest team also hopes to use products such as "Stress Test Master" to continuously simplify the server stress test process and improve the work efficiency of stress testers.

Practical case

Optical theory is useless, you have to learn to follow along, and you have to do it yourself, so that you can apply what you have learned to practice. At this time, you can learn from some actual combat cases.

If it is helpful to you, please like and collect it to give the author an encouragement. It is also convenient for you to quickly find it next time.

If you don’t understand, please consult the small card below. The blogger also hopes to learn and progress with like-minded testers

At the right age, choose the right position, and try to give full play to your own advantages.

My road of automated test development is inseparable from the plan of each stage along the way, because I like planning and summarizing,

Test and develop video tutorials, study notes and receive portals! ! !

Guess you like

Origin blog.csdn.net/m0_59868866/article/details/131516044