Java spikes the road to business architecture design

1. Why is it difficult to do a spike business?

In IM systems, such as QQ or Weibo, everyone reads their own data (friend list, group list, personal information).

In the Weibo system, everyone reads the data of the people you follow, and one person reads the data of multiple people.

In the seckill system, there is only one inventory. Everyone will read and write these data at a centralized time, and multiple people will read one data.

For example, in the seckill of Xiaomi mobile phones every Tuesday, there may be only 10,000 mobile phones, but the instantaneous incoming traffic may be hundreds of millions. Another example is 12306 to grab a ticket, the ticket is limited, one stock is in stock, the instantaneous traffic is very large, and all read the same stock. Read-write conflicts and locks are very serious, which is where it is difficult to kill businesses in seconds. So how do we optimize the structure of the seckill business?

2. Optimization direction

There are two optimization directions:

  1. Try to intercept requests upstream of the system (don't let lock conflicts fall on the database). The reason why the traditional seckill system hangs is that the requests overwhelm the back-end data layer, the data read-write lock conflicts are serious, the concurrency is high, and the response is slow. Almost all requests time out. Although the traffic is large, the effective traffic for a successful order is very small. Taking 12306 as an example, there are actually only 2,000 tickets for a train, 2,000,000 individuals come to buy it, and basically no one can buy it successfully, and the request validity rate is 0.
  2. Make full use of the cache and buy tickets in seconds. This is a typical application scenario of reading more and less. Most of the requests are train query, ticket query, order and payment are write requests. There are actually only 2,000 tickets for a train, 2,000,000 people buy it, and up to 2,000 people place an order successfully. The others are all querying the inventory. The write ratio is only 0.1%, and the read ratio accounts for 99.9%. It is very suitable for using cache to optimize. Okay, let’s talk about the method of “intercepting requests upstream of the system as much as possible” and how to use the “cache” method, and talk about the details.

3. Common spike architecture

The common site structure is basically like this (especially the site structure with hundreds of millions of traffic):

 

  1. The browser side, the top layer, will execute some JS code
  2. The site layer, this layer will access the back-end data and return the HTML page to the browser
  3. The service layer, which shields the underlying data details from the upstream and provides data access
  4. Data layer, the final inventory is here, MySQL is a typical (and of course cache)

Although this picture is simple, it can vividly illustrate the seckill business structure with large traffic and high concurrency. Everyone should remember this picture.

Later, we will analyze how to optimize each level in detail.

Fourth, each level of optimization details

The first layer, how to optimize the client (browser layer, APP layer)

Let me ask you a question. Everyone has played WeChat's Shake to grab red envelopes, right? Every time you shake, will a request be sent to the backend? Looking back at the scene where we placed an order and grabbed a ticket, after clicking the "Query" button, the progress bar in the system was slow. As a user, I would click "Query" again unconsciously, right? Go on, go on, go on... Does it work? The system load is increased for no reason. A user clicks 5 times, and 80% of the requests are so many. How to fix it?

  • At the product level, after the user clicks "Inquiry" or "Purchase Ticket", the button will be grayed out to prohibit the user from submitting requests repeatedly;
  • At the JS level, users are restricted to submit a request only once within x seconds;

At the APP level, you can do similar things. Although you are shaking WeChat crazily, it takes only x seconds to initiate a request to the backend. This is the so-called "intercept requests upstream of the system as much as possible". The more upstream the better, the browser layer and the APP layer will be blocked, so that 80%+ requests can be blocked. This method can only block ordinary users (but 99 % of the users are ordinary users) is unstoppable for high-end programmers in the group.

As soon as FireBug captures the packet, you know what HTTP looks like. JS can never stop programmers from writing for loops and calling HTTP interfaces. How to deal with this part of the request?

The second layer, request interception at the site level

How to stop it? How to prevent programmers from writing for loop calls, is there a basis for deduplication? IP? cookie-id? ...It's complicated, this kind of business needs to log in, just use uid. At the site level, the uid is counted and deduplicated for requests, and it does not even need to store the count uniformly, and store it directly in memory at the site level (this way the count will be inaccurate, but it is the easiest). A uid is only allowed to pass 1 request in 5 seconds, which can block 99% of the for loop requests.

5s only goes through one request, what about the rest of the requests? Cache, page cache, the same uid, limit the frequency of access, do page cache, the requests that arrive at the site layer within x seconds, all return the same page. The query of the same item, such as the number of trains, is used for page caching, and requests that reach the site layer within x seconds all return the same page. Such a current limit can not only ensure a good user experience for users (404 is not returned) but also ensure the robustness of the system (using page caching to intercept requests at the site layer).

Page caching does not necessarily guarantee that all sites return consistent pages, and it is also possible to place them directly in the memory of each site. The advantage is simplicity, but the disadvantage is that HTTP requests fall to different sites, and the returned ticket data may be different. This is request interception and cache optimization at the site layer.

Well, this method stops programmers who write for loops and send HTTP requests. Some high-end programmers (hackers) control 10w broilers, have 10w uid in their hands, and send requests at the same time. No real-name system is required), what to do now, the site layer can't stop it according to the uid current limit.

The third layer of service layer to intercept (in any case, do not let the request fall on the database)

How to intercept the service layer? Brother, I am the service layer. I know clearly that Xiaomi has only 10,000 mobile phones, and I clearly know that there are only 2,000 tickets for a train. What's the point of going to the database with 10W requests? That's right, the request queue!

For write requests, make a request queue, and only go to the data layer through a limited number of write requests each time (place an order and pay for such a write business):

  • For 1w mobile phones, only 1w order requests go to db:
  • 3k train tickets, only 3k order requests go to db.

If all of them are successfully put down another batch, if the inventory is not enough, all the write requests in the queue will return "sold out".

How to optimize for read requests? Cache resistance, whether it is memcached or redis, should be no problem for a single machine to resist 10w per second. With such a current limit, only very few write requests and very few read cache mis requests will penetrate to the data layer, and 99.9% of the requests will be blocked.

Of course, there are also some optimizations on the business rules. Recalling what 12306 did, it sold tickets by time and segment. Originally, tickets were sold at 10 o'clock. Now, at 8 o'clock, 8:30, 9 o'clock, ... a batch is released every half an hour: the traffic is evenly distributed.

Secondly, the optimization of data granularity: you go to buy tickets. For the business of querying the remaining tickets, there are 58 tickets left, or 26 tickets. Do you really pay attention? In fact, we only care about whether there are tickets or no tickets? When the traffic is large, you can do a coarse-grained cache of "tickets" and "no votes".

Third, some business logic is asynchronous: for example, the separation of order business and payment business. These optimizations are all based on the business. I shared a point of view before that "all architecture designs that are separated from the business are hooligans". The optimization of the architecture should also be aimed at the business.

Finally the database layer

The browser intercepts 80%, the site layer intercepts 99.9% and caches the page, and the service layer creates a write request queue and data cache. Every request that penetrates to the database layer is controllable. There is basically no pressure on the db. Walking in the courtyard, it can withstand a single machine. Again, the inventory is limited, and the production capacity of Xiaomi is limited. It is meaningless to come to the database with so many requests.

All are penetrated to the database, 100w orders are placed, 0 are successful, and the request efficiency is 0%. Through 3k to the data, all are successful, and the request efficiency is 100%.

V. Summary

The above should be described very clearly, and there is nothing to summarize. For the seckill system, I will repeat the two architectural optimization ideas from my personal experience:

  1. Try to intercept requests upstream of the system (the more upstream the better);
  2. Commonly used caches that read more and write less often use caches (cache resists read pressure);

Browser and APP: Do speed limit. Site layer: Do speed limit according to uid, do page cache. Service layer: According to the business, the write request queue is used to control the flow, and it is used for data caching. Data layer: strolling in the courtyard. and optimize the business

六、Q&A

Question 1. According to your architecture, the site layer is actually the most stressed. Assuming that the number of real and effective requests is 10 million, it is unlikely to limit the number of request connections. How to deal with this part of the pressure?

Answer: The concurrency per second may not be 1kw. Assuming that there is 1kw, there are 2 solutions:

  1. The site layer can be expanded by adding machines, at least 1k machines.
  2. If the machine is not enough, discard the request and discard 50% (50% directly return to try again later). The principle is to protect the system and not allow all users to fail.

Question 2. How to solve the problem of " controlling 10w broilers, having 10w uid in hand , and sending requests at the same time ?

Answer: As mentioned above, the service layer writes the request queue control

Question 3 : Can a cache with limited access frequency also be used for search? For example , if user A searches for " mobile phone " and user B searches for " mobile phone " , the cached page generated by user A should be used preferentially ?

A: This is possible, and this method is often used in "dynamic" operation activity pages, such as pushing 4kw user app-push operation activities in a short time to do page caching.

Question 4 : What to do if the queue processing fails? What should I do if the broiler blows up the queue?

A: If the processing fails, it will return the order failed, and let the user try again. The queue cost is very low, and it is difficult to explode. In the worst case, after several requests are cached, subsequent requests will directly return "no ticket" (there are already 100w requests in the queue, all waiting, and it is meaningless to accept the request).

Question 5 : In the case of site-level filtering, is the number of uid requests stored separately in the memory of each site? If this is the case, how to deal with the situation where multiple server clusters distribute responses from the same user to different servers via a load balancer? Or put the filtering of the site layer before the load balancing?

Answer: It can be placed in memory. In this case, it seems that one server limits one request for 5s. Globally (assuming there are 10 machines), it actually limits 10 requests for 5s. The solution:

  1. Increase the limit (this is the suggested solution, the easiest)
  2. Do 7-layer balancing at the nginx layer, so that a uid request falls on the same machine as much as possible

Question 6 : If the service layer filters, is the queue a unified queue of the service layer? Or one queue per server serving? If it is a unified queue, do you need to perform lock control before the requests submitted by each server are queued?

A: It is not necessary to unify a queue, so that each service passes through a smaller number of requests (total votes/services), which is simple. Unifying a queue is more complicated.

Question 7 : How to control and update the remaining inventory in a timely manner after the payment is completed after the seckill, and the placeholder is cancelled without payment ?

A: There is a status in the database, not paid. If it exceeds the time, such as 45 minutes, the inventory will be restored again (the well-known "return to the warehouse"). The inspiration for us to grab tickets is that after starting the spike, try again after 45 minutes, maybe there are tickets again.

Question 8 : Different users browse the same product and show different inventory in different cache instances . How can the teacher make the cache data consistent or allow dirty reading?

A: With the current architecture design, requests fall on different sites, and the data may be inconsistent (page caches are different). This business scenario is acceptable. But the real data at the database level is no problem.

Question 9 : Even if the optimization considers  "3k train tickets, only 3k order requests go to db" , will there be no congestion for these 3k orders?

Answer: (1) The database is still ok to resist 3k write requests; (2) Data can be split; (3) If 3k can’t handle it, the service layer can control the number of concurrent passes through, according to the stress test situation, 3k just an example;

Question 10 : If the background processing fails at the site layer or the service layer, do you need to consider replaying the batch of failed requests? Or just throw it away?

Answer: Don't replay it, return the user's query failure or order failure. One of the architectural design principles is "fail fast".

Question 11 : For the seckill of large-scale systems, such as 12306 , there are many seckill activities at the same time, how to divert them?

Answer: vertical split

Question 12 : One additional question comes to mind. Is this process made synchronous or asynchronous? If it is synchronous, there should still be a situation where the response feedback is slow. But if it is asynchronous, how to control the response to return to the correct requester?

Answer: The user layer must be synchronous (the user's HTTP request is rammed), and the service layer can be synchronous or asynchronous.

Question 13 : Question from the spike group: At which stage is the inventory reduction reduced? What if a large number of malicious users place an order to lock the inventory without paying for it?

Answer: The amount of write requests at the database level is very low. Fortunately, there is no payment for placing an order. After the time is over, it will be "returned to the warehouse", as mentioned earlier.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325808839&siteId=291194637