A perfect "seckill": API-accelerated business logic

About the author: Cong Lei
Baishan Partner and Vice President of Engineering
Mr. Cong Lei joined Baishan in 2016 and is mainly responsible for the design and implementation of the cloud chain system.
Mr. Cong Lei worked in Sina from 2006 to 2015. He was the former founder of SAE (SinaAppEngine), and served as the general manager and chief architect. Since 2010, he has led the Sina cloud computing team to engage in technical research and development in cloud-related fields. (Note: SAE is the largest public cloud PaaS platform in China, with 700,000 users.)
Mr. Cong Lei holds two invention patents and is currently a judge of trusted cloud service certification by the Ministry of Industry and Information Technology.

Early one morning, I was awakened by a customer call. The customer was very anxious and asked if CDN could help them solve the "seckill" problem. They just carried out the "seckill activity on the hour" yesterday, but the concurrency was too large, resulting in service downtime , user complaints.

In order to clarify my thoughts, I asked the other party three questions:

(1) What is the performance of service downtime?
(2) What is the basic structure of the business?
(3) What is the peak concurrency of spikes?

Following these clues, we first restored the application scenario together:

An e-commerce business structure diagram

The company is a P2P wealth management website, and there are often "hourly spikes" where users snap up high-interest-rate wealth management products on the hour. As shown in the figure above, end user requests first pass through the front-end load balancing, and then reach the Web Server that runs the actual e-commerce logic; the lower layer is 8 Redis running on the VM, which are responsible for storing business-related Cache data, such as user Profile, Financial product information, user billing information, etc. The actual landing data is stored in MySQL, which only performs simple sub-database sub-table and read-write separation.

When performing a "seckill", the risk control and operation personnel first select financial products, and then mark them in the database; the activity is released by the product personnel, and end users snap up.

The company's business mainly comes from the mobile terminal, with less traffic at ordinary times, but a large amount of traffic will be generated instantaneously during the "Seckill" event, with a peak concurrency of more than 100,000 (which may include bots). Such large concurrency is mainly concentrated in the following two categories interface:

(1) For the refresh interface of financial products, similar to GET /get_fprod.php?uid={$1}&pid={$2}&sid={$3}, this type of interface has the most requests, accounting for 90%.

(2) For the order interface of wealth management products, similar to GET /order_fprod?uid={$1}&pid={$2}&oid={$3}&sid={$4}, the request volume of such interfaces is small, accounting for less than 1%, but there are lots of 504 timeouts.

Where uid is the user ID, pid is the financial product ID, oid is the order number, and sid is the random token identifier that changes with the client user.

scene interpretation

Based on the scenarios communicated with customers, the following conclusions were initially drawn:

(1) Customers are mainly mobile services, and the product renders the UI on the client side through API. There are almost no static resources in the product, and the bandwidth traffic is not high. Traditional CDN cannot achieve the effect of offloading pressure;

(2) During the spike, a large number of 502/504 timeout requests are generated, indicating that the user request has exceeded the service carrying capacity of the server at this time, and capacity expansion is urgently needed.

Based on the above two points, I did not recommend the company to purchase CDN services, but recommended service expansion. However, with our deeper analysis of the business, we gradually discovered some strange things.

"weird" phenomenon

(1) The master-slave load of the database is extremely unbalanced. Through the MySQL management tool, it is found that the query volume of the master database is as high as 80%;

(2) The load of Redis Cache nodes is extremely unbalanced. By viewing the redis info, it is found that one of the Redis requests has a huge amount of requests, accounting for more than 90%, while the other Redis requests are very low.

The above anomalies aroused the interest of technicians on both sides, and this may be the crux of the matter! As the analysis deepened, the first phenomenon emerged: when the company used the database, it did not use the database middleware layer to route and distribute MySQL requests like some large e-commerce platforms, but on the business code side, using The framework at the language level completes the separation of reading and writing. This brings two drawbacks:

(1) Programmers bypass the language layer framework development and do not really implement read-write separation;

(2) Product personnel require real-time display effects, forcing developers to modify business logic, which will sacrifice read-write separation, so that data is read and written in the main library.

Cache penetration diagram

Then, the reason for the second phenomenon gradually became clear: during the spike, a large number of users accessed a very small number of financial products. When the pids of these products happened to be hashed to the same Redis, it would cause the hot spot of the cache node to be unbalanced, and all requests would eventually Focus on one Redis, and this Redis is the bottleneck of the business!

Prescribe the right medicine

1. Use database middleware for read-write separation and horizontal expansion control

Using database middleware can bring many benefits, the most important of which is that it can hide some database details from the business layer and better control the business. Of course, the introduction of the database middle layer also has obvious disadvantages. Adding a layer of components to the overall business architecture violates the design principle of "simple and effective". For many Internet companies, it is normal to not have a database middle tier in the early or even mid-term. But when the business develops to a certain stage, the advantages of introducing the database middle layer outweigh the disadvantages.

Based on experience, we recommend customers to use MySQL Route, which can basically meet simple needs, such as: connection multiplexing; load balancing; read-write separation.

MySQL Route Architecture Diagram

The above figure is the official architecture diagram of MySQL Router. As you can see, the advantage of MySQL Router lies in the plug-in design, and a series of plug-ins are officially provided for use.

In addition to MySQL Router, there are many open source database middleware that can be used in China, such as Ali and Meituan.

Using the database middle layer can not only solve performance problems, but also play a role in security, such as auditing, traffic restrictions, etc., and even intercept SQL injections and poor-quality SQL statements.

2. Use API acceleration services to ease server pressure

Cache service imbalance is a more difficult problem. During "Seckill", users frequently access information on a few wealth management products. When their Cache data happens to be allocated to the same node, a large number of requests will be instantly concentrated on one or a few nodes, which is the essential reason for the imbalance of the Cache service. Not only in the e-commerce "seckill" scenario, but other business types with instant hotspot access will also have this problem. Taking Weibo as an example, the interface was slow or even the service was down due to celebrity hot events. In the final analysis, this is the reason. At the moment of "revelation", a Weibo will be widely spread in a short period of time, the Weibo ID will be opened at the same time, and all traffic will be concentrated to a Redis node.

How to solve this problem? First of all, Cache usually uses the key of a certain data structure as the dimension for hash storage. When a large number of users only access one or several keys, the load of Redis Cache nodes will be unbalanced. Whether this will definitely affect the service depends on the concurrency situation. Yes, but this is a huge hidden danger. In response to this problem, the customer proposed a solution: disassemble the cache data of a wealth management product, and turn one key into multiple, reducing the probability of keys being assigned to the same cache node. But this method has big drawbacks:

(1) The code needs to be modified. The logic that can be completed by one get request can only be assembled by replacing it with multiple lines;

(2) The time consumption of all daily get/set operations will increase exponentially, because 1% of hot events increases the time of 99% of regular operations, which seriously violates the 28th rule.

Based on the above problems, we recommend customers to use the "API acceleration" of Baishan Cloud Aggregation CLN-X to solve this problem.

API acceleration

API acceleration is completely different from link acceleration of traditional CDNs. API requests are optimized by caching API returned content and combining with TCP WAN optimization technology. Baishan API Acceleration caches the response data of each API in the edge nodes of the entire network in milliseconds, and the response data in the node memory is exchanged with the LRU (Least Recently Used) algorithm. In the "hot event", the hottest information is continuously stored in the edge node. When the client accesses the API, the edge node can directly return the result without returning to the source site. The whole structure is as follows:

API acceleration architecture diagram

The API acceleration service provides API acceleration capabilities at network edge nodes, including: API return result caching capability, API request back-to-source network acceleration capability.

The traditional view is that dynamic resources (APIs) cannot be cached, but Baishan proposed that "any resource can be cached, but the expiration time is different". For common static resources, the cache expiration time is long; while the API is not uncacheable, but the expiration time is very short. For example, an API for querying stock prices can set the expiration time to 50 milliseconds; the reaction time of a 100-meter runner is 100-200 milliseconds, and 50 milliseconds will not affect the user experience of the PC or mobile terminal.

When there is no cache, if 10,000 users access at the same time within 1 second, the back-end will bear 10,000 concurrent accesses; if the cache time is set to 50 milliseconds, the back-end concurrency can theoretically be reduced to 20 (1 second/50 milliseconds = 20) , the back-end load is reduced to 15%, and other requests are directly returned to the user by the cache server.

To sum up, Baishan API Acceleration provides customers with millisecond-level cache, which improves the response speed of end users without affecting the user experience, and reduces the business load pressure on the server.

API acceleration also supports custom caching rules to make them closer to business, including QueryString, Header, Path three types, according to the scene, set the following rules:

GET /get_fprod.php?uid={$1}&pid={$2}&sid={$3}, each financial product has an independent ID, and the product information does not change with the user ID and client random information, so the cache key can be ignored in the URI The parameters {$1} and {$3}, /get_fprod.php?pid={$2} are the millisecond-level cache keys stored at edge nodes.

How to determine the expiration time of the cache? Business-related, this requires analysis of the desensitization logs provided by customers. The expiration time can be initially set to 500 milliseconds. Finally, the RTT correction value needs to be considered to adapt to the WAN environment; RTT is automatically captured by the API acceleration service and updated in real time. .

actual effect

By configuring the API acceleration service for the main bottleneck interface of the customer, and at the peak time, compare the effect of the API acceleration service when the API acceleration service is turned on and off from the following two dimensions:

(1) The average response time of end user requests and the ratio of response code 200

(2) Average load of service cluster

The final effect is as follows:

Figure A

Figure B

Figure C

As shown in Figure A, the average response time of end user requests during the peak period was compressed from about 3 seconds to within 40 milliseconds; as shown in Figure B, the proportion of all requests with response codes of 200 during the peak period increased from about 70% to 100%; Figure C said that during the peak period, the back-end CPU Idle increased from about 10% to about 97%. The measured comparison data shows that API acceleration is very effective in reducing the average response time and improving user experience, and the effect is even more obvious in reducing the load on the back-end server. The back-end CPU idle using API acceleration can be maintained at more than 91%.

Follow-up advice

The database imbalance and cache Redis imbalance have been resolved, but in addition to the above problems, there are still many aspects that can be improved:

1. Queue service asynchronous request

At present, the customer finally requests the database request directly to MySQL without queue buffering. It is recommended to use the queue service to queue up to handle peak requests. Secure the database backend.

2. API firewall blocks malicious Bots

The user logs contain a large number of obvious and regular traces of scanning software, such as sqlmap, fimap, etc. Although it has not yet caused a great impact on the business, it makes the server resources occupied. It is recommended to block scanning behavior at the forefront of load balancing to improve security and service efficiency. In addition to malicious bots, behaviors such as grabbing orders and swiping orders will also affect the service. It is recommended to use the API protection service to identify and block them.

3. The product layer considers service degradation design

In the overall business of this customer, there is no service degradation design, and product function priorities are not divided, resulting in the mixing of important database, cache and other basic services. Once the "seckill" causes serious problems such as database penetration, the overall service will be unavailable. In this case, the business units should be reorganized, and the basic services should be divided according to the priority. The first screen, product list, purchase, order and other information have the highest priority; followed by non-important functions, such as comments, bills, etc.; if the backend load is large, When necessary, secondary functions can be directly discarded, thereby reducing backend load and ensuring service stability.

Summarize

Solving a scenario similar to the "seckill activity on the hour" is a complex system project. The problems exposed by customers such as uneven database load and uneven cache load can be solved by adopting technologies such as database middle layer and API acceleration. In the end, the desired effect can be achieved.

The above "seckill" case is just a typical application scenario of API acceleration. Next, I will write an article to analyze the API acceleration problem more systematically. Industry colleagues are also welcome to contact us through the WeChat background (baishancloud) to discuss API-related topics together.

A perfect "seckill": API-accelerated business logic

Guess you like