Degradation stunts for high concurrency systems

When developing a high-concurrency system, there are three powerful tools to protect the system: caching, downgrading, and current limiting. There have been some articles about caching and throttling before. This article will talk about downgrading in detail. When traffic spikes, services experience issues (such as slow or unresponsive response times), or non-core services affect the performance of core processes, there is still a need to ensure that services are still available, even at a loss. The system can automatically downgrade according to some key data, or it can be configured with switches to achieve manual downgrade. This article will introduce some downgrade schemes that the author has encountered or seen in actual work for your reference.

 

The ultimate goal of downgrading is to keep core services available, even at a loss. And some services cannot be downgraded (eg add to cart, checkout).

 

downgrade plan

Before downgrading the system, you need to sort out the system to see if the system can be left behind to protect the commander; to sort out which ones must be protected to the death and which ones can be downgraded; for example, you can refer to the log level setting plan:

General: For example, some services occasionally time out due to network jitter or the service is online, and can be automatically downgraded;

Warning: For some services, the success rate fluctuates over a period of time (such as between 95 and 100% ), which can be automatically degraded or manually degraded, and an alarm will be sent;

Error: For example, the availability rate is lower than 90% , or the database connection pool is overwhelmed, or the access volume suddenly increases to the maximum threshold that the system can bear, at this time, it can be automatically downgraded or manually downgraded according to the situation;

Serious error: For example, the data is wrong for special reasons, and emergency manual downgrade is required at this time.

 

The degradation can be divided into automatic switch degradation and manual switch degradation according to whether it is automated or not.

The downgrade can be divided into: read service downgrade and write service downgrade according to the function.

The downgrade can be divided into: multi-level downgrade according to the system level.

 

The downgraded function points are mainly considered from the server link, that is, according to the service call link accessed by the user to sort out where to downgrade:

Page downgrade: In a big promotion or some special circumstances, some pages occupy some scarce service resources, and in an emergency, the entire page can be downgraded, so as to save the army;

Page fragment downgrade : For example, the merchant part of the product details page needs to be downgraded because the data is wrong;

Page asynchronous request downgrade: For example, there is a request for asynchronous loading such as recommendation information / delivery to the product details page. If the response of this information is slow or there is a problem with the back-end service, downgrade can be performed;

Service function degradation: For example, when rendering the product details page, you need to call some less important services: related categories, hot-selling lists, etc., and these services are not obtained directly under abnormal circumstances, that is, they can be downgraded;

Read degradation: For example, in the multi-level cache mode, if there is a problem with the backend service, it can be downgraded to a read-only cache. This method is suitable for scenarios that do not require high read consistency;

Write downgrade: For example, we can only update the Cache , and then asynchronously and synchronously deduct the inventory to the DB to ensure the final consistency. At this time, the DB can be downgraded to the Cache .

Crawler downgrade: During a big promotion, you can direct the crawler traffic to static pages or return empty data to downgrade and protect the scarce resources on the backend.

 

Automatic switch degradation

Automatic downgrade is based on system load, resource usage, SLA and other indicators.

Downgrade over time

When the accessed database /http service / remote call response is slow or slow for a long time, and the service is not a core service, it can be automatically downgraded after a timeout; for example, there is recommended content / evaluation on the product details page, but the recommended content / evaluation is temporarily unavailable . Impressions don't have a big impact on the user's shopping flow; it's possible to degrade over time for this kind of service. If it is to call someone else's remote service, define a service response maximum time with the other party, and automatically downgrade if it times out.

 

I have summarized some articles before, " Parameter settings, code writing, and existing risks that you must know when using httpclient " and "Summary of dbcp configuration and jdbc timeout settings". In actual scenarios, the timeout period and the number of timeout retries and mechanisms should be mainly configured.

 

Statistical failure count downgrade

Sometimes relying on some unstable APIs , such as calling an external ticket service, automatically downgrades when the number of failed calls reaches a certain threshold; and then detects whether the service is restored through an asynchronous thread, and cancels the downgrade.

 

downgrade

For example, if the remote service to be called hangs (network failure, DNS failure, http service returns an incorrect status code, rpc service throws an exception), you can directly downgrade. The downgraded processing solutions include: default value (for example, if the inventory service is suspended, return to the default spot), bottom-line data (for example, if the advertisement is suspended, return to some static pages prepared in advance), and cache (some cached data temporarily stored before).

 

Current limit downgrade

When we go to seckill or snap up some restricted products, the system may crash due to too much traffic. At this time, developers will use current limit to limit the amount of access. When the current limit threshold is reached, subsequent requests will be blocked. Downgrade; the processing plan after downgrade can be: queuing page (direct the user to the queuing page and wait for a while to try again), out of stock (directly inform the user that the stock is out of stock), error page (if the activity is too popular, restart it later try).

 

Manual switch degradation

During the big promotion period, some online services were found to have problems through monitoring. At this time, these services need to be temporarily removed; sometimes some services are called through the task system, but the database that the service depends on may exist: the network card is full, If the query is hung up or there are many slow queries, the next task system needs to be suspended to allow the server to process it; it is also found that the call volume is too large, and the processing method may need to be changed (such as synchronously converted to asynchronous); at this time, you can use the switch to Complete the downgrade. The switch can be stored in the configuration file, in the database, and in Redis/ZooKeeper ; if it is not stored locally, the switch data can be synchronized regularly (for example, once a second). Then decide whether to downgrade by judging the value of a KEY .

 

In addition, for the newly developed service, I want to go online for grayscale testing; but I am not sure whether the logic of the service is correct. At this time, a switch needs to be set. When there is a problem with the new service, it can be switched back to the old service through the switch. There is also a multi-machine room service. If a certain machine room hangs up, it is necessary to switch the service of one machine room to another machine room. At this time, the switch can also be completed.

 

There are also some functions that need to be temporarily blocked due to functional problems. For example, there is a problem with the product specification and parameter data, and the data problem cannot be solved by rollback. At this time, switch control is required to downgrade.

 

Read service downgrade

The strategies generally adopted for read service degradation are: temporarily switch reading (downgrade to read cache, downgrade to static), and temporarily block read (block read entry, block a certain read service). In "Applying Multi-Level Cache Mode to Support Massive Read Services", the read service was introduced, that is, access layer cache --> application layer local cache --> distributed cache --> RPC service /DB , we will access it in Layer, application layer setting switch, when there is a problem with distributed cache, RPC service /DB , it will be automatically downgraded to not called. Of course, this situation applies to scenarios that do not require high read consistency.

 

Page downgrades, page fragment downgrades, and page asynchronous request downgrades are all read service downgrades for the purpose of being out of service (for example, because these services also use core resources, or occupy bandwidth and affect core services) or temporarily block them due to data problems.

 

There is also a static page scene:

Dynamic downgrade to static: For example, the website can use dynamic rendering of the product details page, but when the big promotion comes, it can be switched to static to reduce the occupation of core resources and improve performance; there are others such as lists Pages, home pages, and channel pages can be played in this way; static pages can be regularly pushed to the cache or generated to disk through a program, and they can be cut directly when there is a problem;

Static is downgraded to dynamic: For example, when static is used to implement the product details page structure, static is usually used to provide services, but there is a problem with the static page due to special reasons, and it needs to be temporarily switched back to dynamic to ensure the correctness of the service. .

 

The above guarantees that if there is a problem, there is a plan, and the user can still use the website without affecting the user's shopping.

 

write service downgrade

The write service is not downgradable in most scenarios, but there are some roundabout tactics to solve the problem. For example, converting synchronous operations to asynchronous operations, or limiting the amount / proportion of writes.

For example, deducting inventory generally does this:

 

Scenario 1 :

1. Deduct the DB inventory, 2. Update the inventory in Redis after the deduction is successful ;

Option 2 :

1. Deduct Redis inventory, 2. Deduct DB inventory synchronously , and roll back Redis inventory if the deduction fails ;

The first two schemes are very dependent on DB . If the DB performance cannot keep up at this time, there will be problems in deducting inventory; therefore, we can think of scheme 3 :

1. Deduct Redis inventory, 2. Deduct DB inventory synchronously normally, downgrade to sending a message to deduct DB inventory when the performance is unbearable, and then perform DB inventory deduction asynchronously to achieve final consistency;

Sending deduction DB inventory messages in this way may also become a bottleneck; in this case, we can consider option 4 :

1. Deduction of Redis inventory, 2. Deduction of DB inventory synchronously normally, downgrade to write deduction of DB inventory message to the local machine when the performance is unbearable , and then the local machine achieves eventual consistency by asynchronously deducting DB inventory.

 

That is to say, under normal circumstances, inventory can be deducted synchronously, and it can be downgraded to asynchronous when the performance is unbearable; in addition, if it is a seckill scenario, it can be directly downgraded to asynchronous to protect the system. There are also the following order operations that can temporarily downgrade and write the order data to Redis during the big promotion , and then synchronize back to the DB after the peak has passed. Of course, there are better solutions, but they are more complicated and are not the focus of this article.

 

There is also user evaluation, if the evaluation volume is too large, the evaluation can also be downgraded from synchronous writing to asynchronous writing. Of course, the evaluation button can also be opened proportionally (for example, some people cannot see the evaluation operation button). For example, after the evaluation is successful, some rewards will be issued, and when necessary, the synchronization will be downgraded to asynchronous.

 

Multi-level downgrade

The cache is the closest to the user, the more efficient it is; and the degradation is the closer to the user, the better the protection of the system. Because of the complexity of the business, the lower the QPS/TPS goes to the back end .

 

Page JS downgrade switch : It mainly controls the downgrade of page functions, deploys the function downgrade switch through JS script in the page, and turns on / off the switch at the appropriate time ;

Access layer downgrade switch : It mainly controls the downgrade of the request entry. After the request enters, it will first enter the access layer. The function downgrade switch can be configured at the access layer, and automatic / manual downgrade can be performed according to the actual situation; this can refer to "JD.com Product Details" Page Service Closed-Loop Practice", especially when there is a problem with the back-end application service, the access layer is downgraded to give the application service enough time to restore the service;

Application layer downgrade switch : mainly controls the downgrade of services, configure corresponding function switches in the application, and perform automatic / manual downgrade according to actual business conditions .

 

http://jinnianshilongnian.iteye.com/blog/2306477

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326943057&siteId=291194637