[Spike] architecture optimize the road system -

Source: W3CSCHOOL architect road


 

First, why the difficult business spike


1) im system, such as qq or microblogging, everyone read their own data (buddy lists, group lists, personal information);
2) micro-blog system, each person read your attention to the person's data, a person read more than one person data;
3) spike systems, inventory only one, everyone will read and write data, more than one person to read data in a set time.

For example: Tuesday's spike millet phone, the phone may be only 10,000, but the instantaneous incoming traffic may be hundreds of tens of millions.

Another example: 12306 rush tickets, tickets are limited, a stock, instantaneous flow is very large, all read the same inventory. Write conflict lock is very serious, it is difficult to spike the business place . Then how do we optimize spike business structure?

 

Second, the optimization direction


There are two direction optimization (to talk about these two points today):

(1) will try to intercept the request in the system upstream ( do not let the conflict fell to lock up the database). The reason why the traditional spike system hang, requests overwhelmed the back-end data layer, data read-write lock serious conflict, complicated by high response slow, almost all requests time out, the flow is great, single-effective flow of success is very small. To 12306, for example, a train in fact, only 2,000 tickets, 200w people to buy, basically no one can buy success, the effective rate of 0 request.

(2) make full use of the cache , spike buy a ticket, which is a typical little reading and writing scenarios, most of the requests are trips inquiries, ticket inquiries, order and payment is the write request. In fact, a train, only 2,000 tickets, 200w individuals to buy up to 2,000 people under a single success, others are check inventory, write ratio of only 0.1%, accounting for 99.9% reading ratio, ideal for use caching to optimize. Well, talk about how a follow-up to "try to intercept the request in the system upstream" method, and how a "cache" method, talk about the details.

 

Third, the common architecture spike


Common basic site architecture is such that (definitely not draw architecture diagram flicker class)

Spike architecture

(1) the browser side , the top, performs some JS code
(2) the site level , this layer will access back-end data, spell html page returned to the browser
(3) the service layer , the details of the underlying data to the upstream shield to provide data access
(4) data layer , the final inventory exists here, mysql is a typical (of course there are caches)

This figure is simple, but the image can explain the large flow of high concurrent spike business structure, we have to remember a picture. Behind resolve how careful optimization of all levels.

 

4. Each level optimization details


The first layer, how to optimize client (browser level, APP layer)

Ask you a question, we all played a micro-letter shake shake grab a red envelope, right, shake every time, it will send the request back end? After reviewing our next single grab votes scenes, click on the "Search" button, the card ah, ah slow up the progress bar, as a user, I do not consciously go click on "Search", right? Continue to point, point to continue, little point. . . Useful? No reason to increase the load on the system, a user points five times, 80% of the requests are so many out, how the whole?

(A) the product level , the user clicks the "Search" or "tickets" button is grayed out, ban users who repeatedly submit the request;

(B) the JS level , to limit the user in x seconds only submit a request;

APP level, you can do similar things, although you crazy shake in micro-letters, in fact x seconds to initiate a request to the backend. This is called "try to intercept the request in the system upstream", the better the more upstream, the browser level, the APP layer gave stopped, so that we can block the request to + 80%, this approach can only be stopped by ordinary users (but 99 % of users are ordinary user) for the group within the high-end programmer is able to prevent. firebug a packet capture, http long-sawed all know, js is absolutely no stopping the programmer to write a for loop, calling http interface, how to deal with this part of the request?

The second layer, site-level request interceptor

How to intercept? How to prevent programmers to write for the cycle call, it has to re-basis? ip? cookie-id? ... want complicated, this type of business will need to log in, you can use uid. At the site level, to count and to request uid heavy, even without counting unified storage, memory storage layer directly to the site (this count will be allowed, but the easiest). A uid, 5 seconds through a request will only be allowed, but this stopped for 99% of the cycle requests.

5s through a request, the rest of the request only how to do? Caching, page caching, the same uid, restrict access frequency, do page caching, request arriving at the site level in x seconds, both return the same page. The same item queries, such as trips, do page caching, request arriving at the site level in x seconds, both return the same page. So limiting, both to ensure that users have a good user experience (no return 404) can guarantee robustness of the system (using the page cache, the request is intercepted at the site level).

Page caching is not necessary to ensure that all sites return consistent page directly in memory for each site is also possible. The advantage is simple, the downside is http request fell on a different site, tickets returned data may be different, this is a request interceptor site level optimization and caching.

Well, the way to stop the cycle of hair http requests to write for programmers, some high-end programmers (hackers) to control the broiler 10w, 10w hands of a uid, at the same time send a request (to not consider the issue of real-name system, grab millet phone It does not require real names), how to do this next, according to the site level uid able to prevent the current limit.

The third layer is the service layer to intercept (anyway, is not to let fall into the database request up)

Service Layer how to intercept? Brother, I am a service layer, I clearly know that only 10,000 millet phone, I clearly know a train ticket only 2000, I thoroughly 10w requests to the database what is it? Yes, the request queue!

For a write request, the request queue to do, through a time of limited write requests to the data layer (orders, payments such write operations)

1w phone only through a single request to 1w db

3k train tickets through only a single request to 3k db

If it is successful then down a group, if the queue is not enough inventory write request to return all "sold out."


For a read request, how to optimize? Anti-cache, either memcached or redis, a stand-alone anti-10w per second should all be no problem. So limiting, very few write requests, and requests very little read cache mis will go through to the data layer, there are 99.9% of the requests are stopped.

Of course, there are some optimization of business rules . Recall 12306 done, time segments ticket, the original ticket unity 10:00, now 8:00, 8:30, 9:00, ... every half an hour to release a group: the flow spread evenly.

Second, the optimization of data granularity : You go to buy tickets, I vote for this business inquiries, ticket left a 58, or 26, you're really concerned about what, in fact, we only care about the ticket and have no ticket? Large flow time, make a coarse-grained "have the votes" "no vote" Cache can be.

Third, some business logic asynchronous: for example, the separation of the single business payment services. These optimizations are combined to business, I had to share a point of view before the " all out of business architecture bullying " optimized architecture but also for business.

 

Finally, the fourth layer is the database layer

80% of the browser blocks, blocked the site level of 99.9% and made a page cache, the service layer and made a written request queue and the data cache, each request through to the database layer is controllable. db basically no pressure, and strolling, but also stand-alone Go On, then again, stock is limited, millet production capacity is limited, so many requests through the database does not make sense.

All through the database, 100w a single, successful 0, 0% request efficiency. Through a data 3k, all successful, the request 100% efficiency.

 

V. Summary


Described above should be very clear, nothing summed up, the spike system for the next two repeat once again my personal experience of architecture optimization ideas:
(1) try to intercept the request in the system upstream (the upstream better);

(2) reading and writing multiple use less conventional cache (cache read anti-pressure);

Browser and APP: do the speed limit

Site level: in accordance with the uid to do the speed limit, do page caching

Service Layer: do business in accordance with the write request queue control traffic, do data cache

Data Layer: strolling

And: do business combination optimization

 

六、Q&A


Question 1, according to your architecture, but is in fact the most stressful site level, assuming that there is a real and effective requests 10 million, less likely to limit the number of connection requests it, then how pressure this part of the process?

A: per concurrent might not 1kw, assuming 1kw, Solution 2:

(1) Site expansion layer is applied by machine, the most incompetent 1k machine to chant.
(2) If the machine is not enough to abandon the request to abandon 50% (50% direct return again later), the principle is to protect the system, so that all users can not fail.

Question 2, "control the broiler 10w, 10w hands of a uid, while requesting" how to solve this problem, huh?

A: The above said, the service layer write request queue control

 

Question 3: restrict cache access frequency, and whether can also be used for search? A user searches such as "mobile", B user searches for "mobile phone", using the generated priority after A search cache page?

A: This is possible, this method is often used in a "dynamic" operating activities page, for example, a short push 4kw user app-push operating activities, do page cache.

 

Question 4: If the queue processing fails, how to deal with? The queue is broiler Chengbao how to do?

A: The process failed to return under a single failure, allowing users to try again. Queue cost is very low, it is difficult to burst. The worst case, the cache after several requests, subsequent requests are returned directly "no vote" (queue has been requested 100w, all waiting for, then accept the request does not make sense)

 

Question 5: Site-level filtering, it is the uid number of requests stored separately to the memory of each site in it? If this is the case, how to deal with cases of multiple server cluster through a load balancer to the same user response distributed to different servers it? Or that the filter layer into the former site load balancing?

A: Yes, in memory, so if one server appears to limit the 5s a request for a global (assuming that there are 10 machines), it is actually limiting the 5s 10 requests, the solution:

1) increase the limit (which is the recommended solution, the easiest)
2) In doing nginx layer 7 layer balanced, let's try to request a uid falls on the same machine

Question 6: Service-level filtering, then the queue is a queue of unified service layer? Or each server to provide services each queue? If it is a unified queue, you do not need to be controlled before the lock request submitted by each server into the queue?

A: You can not have a unified queue, so each service through a smaller number of requests (the total number of votes / service number), so simple. A unified queue and complicated.

 

Question 7: Payment spike after completion, as well as unpaid canceled footprint, how to make timely updates to the remaining inventory control?

A: The state of a database, not paid. If over time, for example 45 minutes, will re-stock will resume (the familiar "back warehouse"), revealed to us is to grab votes, spike after the start, and then try after 45 minutes, maybe there are tickets yo ~

 

Question 8: different user views a fall different cache instances of the same merchandise display inventory is completely different to ask the teacher how to do the same or cached data is dirty reads are allowed?

A: The current architecture, the request fell on different sites, data may be inconsistent (page cache is not the same), this business scenario can accept. But the database level real data is no problem.

 

Question 9: Even in the optimization of business considerations "3k train ticket, only through a single request to 3k db" 3K orders that this would not happen congestion yet?

A: (1) database write request or anti 3k ok; and (2) can split the data; (3) if not carry 3k, the service layer may be controlled through the number of concurrent to, according to the case pressure measurement Come, 3k example only;

 

Question 10; if that fails processing site in the background layer or service layer, then, batch or need to consider this request replay failed to do? Directly or discard?

A: Do not replay the return to the user query failed or orders it fails, one of architecture design principles is "fail fast".

 

11. For spike problem of large systems, such as 12306, spike activity at the same time a lot, how to split?

A: Vertical Split

 

Question 12 extra thought a problem. This process is made synchronous or asynchronous? If it is synchronized, it should also have the presence of slow response feedback. But if it is asynchronous so, how can control the response of the requesting party to return the correct results?

A: The user level must be synchronized (user http request is rammed live), level of service can be synchronous or asynchronous.

 

Question 13, Question spike group: minus inventory is reduced at that stage it? If that is the single lock stock, then a large number of malicious users under a single lock stock without paying how to handle it?

A: The database level write request is very low, but fortunately, orders do not pay, and so on the premise of time had finished and then "back warehouse" of the past.

Guess you like

Origin www.cnblogs.com/clarino/p/11932814.html