The optimization experience of a production accident

After a normal event promotion, customer service began to report that some users could not open the webpage or APP when rushing for bids, and the bids had already been robbed when they were opened. Is that so, isn't that the case when robbing Xiaomi phones? As the activity continued, more users protested strongly. Users who received interest rate hike coupons or cash coupons could not grab the bids, believing that the platform deliberately refused to use it to save resources.

Analysis process

In fact, there have been continuous user feedbacks in the past, and customers have been fooled by the example of Xiaomi stealing mobile phones. This time, the user feedback was too strong, so we paid more attention. We have a total of three front-end products, app, official website, and H5. Among them, the app is the most used, and the official website is second. H5 is usually used very little, but the traffic will skyrocket during the event (events are generally H5 games, and H5 is also convenient for promotion and marketing. ), the three front-end products all use lvs to load to the two back-end web servers (as shown in the figure below). This time, the user feedback is basically on the web and app sides, so focus on these four servers.

First of all, I doubt whether the network bandwidth is full, and find a network engineer to monitor it through tools. The highest bandwidth utilization rate is only about 70% when bidding for a bid, so I will rule it out; I again doubt whether the web server can’t stand it anymore, and use the top command to check the official website load. The two servers of the app will soar to around 6-8 at the moment of bidding, and slowly return to normal after the bidding. The two servers of the app peak at 10-12, and then return to normal.

Tracking the business log of the web server, it is found that no new database connections are reported at the database update layer or the database connections have been used up. It is considered that the maximum number of connections of the database is too small, so the maximum number of connections of the MySQL database is adjusted to 3 times the previous number; I continued to observe the business log when I was bidding for the second time, and found that the error related to the database link was no longer reported. However, many users still reported that the page could not be opened when bidding for the bidding.

Continue to track the web server, use the command ( ps -ef|grep httpd|wc -l) to check that the number of connections to httpd is about 1,000 during bidding, and randomly check that the maximum number of connections set in the apache configuration file is 1024 (the default maximum number of connections for apache is 256). During the period, the number of connections has reached the maximum number of connections, and many users have not been able to obtain the http connection during the bidding process, resulting in the page being unresponsive or the app being waiting. So adjust the maximum number of connections in the apache configuration file to 1024*3.

Continue to observe during the bidding process. The number of apache connections can still soar between 2600-2800 during bidding. According to customer service feedback, there are still many users who report the problem of bidding, but it is slightly better than before, but there are sporadic User feedback has grabbed the target, and finally gave it back. Then continue to observe the database server, and use the top command and MySQL Workbench to view the loads of the mysql master library and slave library (as shown in the figure below). The indicators of the mysql server master library have reached their peaks, while the slave library is almost not too big. pressure.

The tracking code found that the business codes of the three terminals are all connected to the main library, and the slave library is only used by the background query business, so the transformation was started immediately; except for the query in the bidding process, all the queries of other pages or services were transformed into query slaves. The library, observed after the renovation, found that the pressure of the main library was significantly reduced, and the pressure of the library began to rise. As shown below:

According to the feedback from the customer service, there is almost no problem of grabbing the bid and going back after the renovation. The problem that the page cannot be opened or is opened slowly during the bid grabbing process has been alleviated to a certain extent, but there are still some users who have reported this problem. According to the analysis results of the above items inferred:

  • Both servers with 1 load have reached the processing limit, and more servers need to be configured to load.
  • 2 The pressure on the mysql master library has been significantly reduced, but the pressure on the slave library has increased. It is necessary to change the current one-master-one-slave model to one-master-multiple-slave mode.
  • 3 To solve these problems completely, it is necessary to comprehensively consider the overall optimization of the platform, such as: business optimization (remove hot spots in the business), increase the cache, and make some pages static (you can use the front-end optimization rules of Yahoo and Google, and there are many online test sites that can evaluation) and so on.

An optimized report was written based on these circumstances, see below:

Optimization report

1 Background

With the continuous development of the company's business, the surge in business volume and user volume, the official website pv has also increased from the original xxx-xxx to xxx-xxxx, and the active users of the APP have increased significantly; therefore, it has also improved the current technical structure of the platform . big challenge. Especially with the recent shortage of platform bid sources, the time for full bids is getting shorter and shorter. The pressure on the server is also increasing; therefore, the current system architecture needs to be upgraded to support a larger user volume and business volume.

2 Schematic diagram of user access

At present, the platform has three products facing users, the platform's official website, the platform's APP, and the platform's small webpage; the platform's official website and the platform's APP are under relatively high pressure.

3 existing problems

When users bid for bids,  the problems are concentrated in the following aspects 
: 1. The webpage or APP cannot be opened 
; 2. The website or APP is opened slowly  ; The number of connections is used up, resulting in the failure to add investment records after the bid is full, and the progress of the bid is rolled back

4 Analysis

Through in-depth analysis of recent server parameters, concurrency, and system logs, it is concluded that: 
1. The platform's official website and platform APP are under great pressure during the bidding process. Among them, the platform APP problem is more prominent. During the peak bidding period, a single machine The maximum number of connections of the APP server apache is close to 2600, which is close to the maximum processing capacity of apache

2. The database server is under great pressure. The database pressure is mainly prominent in two periods 
: 1) When the platform is doing activities, the number of visits to the official website, small web pages, and APPs increases dramatically, resulting in a huge increase in the volume of data queries. When the database processing limit is reached, it will show that the website is opened. 2) When the user bids 
for bids, the pressure of users to bid for bids is divided into two stages: before bid bids and during bid bids. Before bidding, users open the bidding page in advance and refresh continuously because of the high speed of bidding, so the query pressure of the database will continue to increase. If the number of users who are bidding is very large, the number of database connections will be used up before bidding. ;In the bid rush, a single purchase will involve about 15 tables for modification and query. Each bid of 10 million will probably have about 100-200 people to complete the full bid each time. Calculated based on the median value of 150 people, in a few seconds The data needs to be updated 2000-3000 times (only updates, not including queries) within the time limit, resulting in a large number of concurrency, which may cause update failure or connection timeout, thus affecting user bidding and the normal system full bid.

5 Solutions


1. Schematic diagram of a single user accessing web services in the web server solution 

At present, both the website and the platform APP use two services for balance responsibility. Apache is installed in each server for server-side processing. Each apache can handle about 2,000 connections at most. Therefore, in theory, the current website or APP can handle more than 4000 user requests. If you want to support 10,000 requests at the same time, you need 5 apache servers to support, so there are currently 6 web servers missing. 
Schematic diagram of access after upgrading the server 

2. Database solution 
The deployment plan of the current database 

1) The master-slave separation solves 80% of the query pressure of the master database. At present, the platform's official website and APP are connected to the mysql main database, which doubles the pressure on the main database. Migrating all the queries in the service to the secondary database can greatly reduce the pressure on the main database.

2) Increase the cache server. When the query from the database reaches the peak, it will also affect the synchronization of the master and the slave, thereby affecting the transaction. Therefore, the query frequently used by the user is cached to reduce the request pressure of the database. Three new cache servers need to be added to build a Redis cluster. 

3. Other optimizations 
1) The homepage of the official website is static. According to the analysis of cnzz statistics, the homepage accounts for about 15% of the overall traffic of the website. For the data that does not change frequently on the homepage, it is processed statically to improve the fluency of the official website.

2) Optimization of apache server, enable gzip compression, configure a reasonable number of links, etc.

3) Remove the update hot spot in the investment process: the target schedule. Each time the bidding succeeds or fails, the target schedule needs to be updated, and problems such as optimistic locking will occur when multi-threaded updates are performed. The update in the process is removed, and the target progress information is only saved in the target schedule after the target is full, which optimizes the pressure on the database during the investment process.

6 Server upgrade plan

1. The biggest pressure on the platform comes from the database. It is necessary to change the current one master and one slave to one master and four slaves. A large number of queries generated by the official website/app/small webpage are distributed to three slave libraries by virtual IP, and the background management query goes to another slave library. The database needs to add three servers 
to the schematic diagram after the database is upgraded 

2. To increase the cache to reduce the pressure of data, it is necessary to add two cache servers with large memory 

3. Three new web servers need to be added to decompose user access requests

The app needs to add two new servers 
. During the bidding process, the app server has the most pressure. Two new servers need to be added. The schematic diagram after the configuration is completed. 

The official website needs to add a  
server  The official website also has a certain pressure in the bidding process, and a new server needs to be added. The schematic diagram after completion is as follows:

After the total, 8 servers need to be purchased, two of which require large memory (above 64G)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326419460&siteId=291194637
Recommended