Degradation stunts for high concurrency systems

When developing a high-concurrency system, there are three powerful tools to protect the system: caching, downgrading, and current limiting. There have been some articles about caching and throttling before. This article will talk about downgrading in detail. When traffic spikes, services experience issues (such as slow or unresponsive response times), or non-core services affect the performance of core processes, there is still a need to ensure that services are still available, even at a loss. The system can automatically downgrade according to some key data, or it can be configured with switches to achieve manual downgrade. This article will introduce some downgrade schemes that the author has encountered or seen in actual work for your reference.

 

The ultimate goal of downgrading is to keep core services available, even at a loss. And some services cannot be downgraded (eg add to cart, checkout).

 

downgrade plan

Before downgrading the system, you need to sort out the system to see if the system can be left behind to protect the commander; to sort out which ones must be protected to the death and which ones can be downgraded; for example, you can refer to the log level setting plan:

General: For example, some services occasionally time out due to network jitter or the service is online, and can be automatically downgraded;

Warning: For some services, the success rate fluctuates over a period of time (such as between 95 and 100% ), which can be automatically degraded or manually degraded, and an alarm will be sent;

Error: For example, the availability rate is lower than 90% , or the database connection pool is overwhelmed, or the access volume suddenly increases to the maximum threshold that the system can bear, at this time, it can be automatically downgraded or manually downgraded according to the situation;

Serious error: For example, the data is wrong for special reasons, and emergency manual downgrade is required at this time.

 

The degradation can be divided into automatic switch degradation and manual switch degradation according to whether it is automated or not.

The downgrade can be divided into: read service downgrade and write service downgrade according to the function.

The downgrade can be divided into: multi-level downgrade according to the system level.

 

The downgraded function points are mainly considered from the server link, that is, according to the service call link accessed by the user to sort out where to downgrade:

Page downgrade: In a big promotion or some special circumstances, some pages occupy some scarce service resources, and in an emergency, the entire page can be downgraded, so as to save the army;

Page fragment downgrade : For example, the merchant part of the product details page needs to be downgraded because the data is wrong;

Page asynchronous request downgrade: For example, there is a request for asynchronous loading such as recommendation information / delivery to the product details page. If the response of these information is slow or there is a problem with the back-end service, downgrade can be performed;

Service function degradation: For example, when rendering the product details page, you need to call some less important services: related categories, hot-selling lists, etc., and these services are not obtained directly under abnormal circumstances, that is, they can be downgraded;

Read degradation: For example, in the multi-level cache mode, if there is a problem with the backend service, it can be downgraded to a read-only cache. This method is suitable for scenarios that do not require high read consistency;

Write downgrade: For example, we can only update the Cache , and then asynchronously and synchronously deduct the inventory to the DB to ensure the final consistency. At this time, the DB can be downgraded to the Cache .

Crawler downgrade: During a big promotion, you can direct the crawler traffic to static pages or return empty data to downgrade and protect the scarce resources on the backend.

 

Automatic switch degradation

Automatic downgrade is based on system load, resource usage, SLA and other indicators.

Downgrade over time

When the accessed database /http service / remote call response is slow or slow for a long time, and the service is not a core service, it can be automatically downgraded after a timeout; for example, there is recommended content / evaluation on the product details page, but the recommended content / evaluation is temporarily unavailable . Impressions don't have a big impact on the user's shopping flow; it's possible to degrade over time for this kind of service. If it is to call someone else's remote service, define a service response maximum time with the other party, and automatically degrade if it times out.

 

I have summarized some articles before, " Parameter settings, code writing, and existing risks that you must know when using httpclient " and "Summary of dbcp configuration and jdbc timeout settings". In actual scenarios, the timeout period and the number of timeout retries and mechanisms should be mainly configured.

 

Statistical failure count downgrade

Sometimes relying on some unstable APIs , such as calling an external ticket service, automatically downgrades when the number of failed calls reaches a certain threshold; and then detects whether the service is restored through an asynchronous thread, and cancels the downgrade.

 

downgrade

For example, if the remote service to be called hangs (network failure, DNS failure, http service returns an incorrect status code, rpc service throws an exception), you can directly downgrade. The downgraded processing solutions include: default value (for example, if the inventory service is suspended, return to the default spot), bottom-line data (for example, if the advertisement is suspended, return to some static pages prepared in advance), and cache (some cached data temporarily stored before).

 

Current limit downgrade

When we go to seckill or snap up some restricted products, the system may crash due to too much traffic. At this time, developers will use current limit to limit the amount of access. When the current limit threshold is reached, subsequent requests will be blocked. Downgrade; the processing plan after downgrade can be: queuing page (direct the user to the queuing page and wait for a while to try again), out of stock (directly inform the user that the stock is out of stock), error page (if the activity is too popular, restart it later try).

 

Manual switch degradation

During the promotion period, some online services were found to have problems through monitoring. At this time, these services need to be temporarily removed; sometimes some services are called through the task system, but the database that the service depends on may exist: the network card is full, If the query is hung up or there are many slow queries, the next task system needs to be suspended to allow the server to process it; it is also found that the number of calls is suddenly too large, and the processing method may need to be changed (such as synchronous conversion to asynchronous); at this time, you can use the switch to Complete the downgrade. The switch can be stored in the configuration file, in the database, and in Redis/ZooKeeper ; if it is not stored locally, the switch data can be synchronized regularly (for example, once a second). Then decide whether to downgrade by judging the value of a KEY .

 

In addition, for the newly developed service, I want to go online for grayscale testing; but I am not sure whether the logic of the service is correct. At this time, a switch needs to be set. When there is a problem with the new service, it can be switched back to the old service through the switch. There is also a multi-machine room service. If a certain machine room hangs up, it is necessary to switch the service of one machine room to another machine room. At this time, the switch can also be completed.

 

There are also some functions that need to be temporarily blocked due to functional problems. For example, there is a problem with the product specification parameter data, and the data problem cannot be solved by rollback. At this time, switch control is required to downgrade.

 

Read service downgrade

The strategies generally adopted for read service downgrade are: temporarily switch reads (downgrade to read cache, downgrade to static), and temporarily block read (block read entry, block a certain read service). In "Applying Multi-level Cache Mode to Support Massive Read Services", the read service was introduced, that is, the access layer cache --> application layer local cache --> distributed cache --> RPC service /DB , we will access it in Layer, application layer setting switch, when there is a problem with distributed cache, RPC service /DB , it will be automatically downgraded to not called. Of course, this situation applies to scenarios that do not require high read consistency.

 

Page downgrades, page fragment downgrades, and page asynchronous request downgrades are all read service downgrades for the purpose of being out of service (for example, because these services also use core resources, or occupy bandwidth and affect core services) or temporarily block them due to data problems.

 

There is also a static page scene:

Dynamic downgrade to static: For example, the website can use dynamic rendering of the product details page, but when the big promotion comes, it can be switched to static to reduce the occupation of core resources and improve performance; there are others such as lists Pages, home pages, and channel pages can be played in this way; static pages can be pushed to the cache or generated to disk regularly through a program, and they can be cut directly when there is a problem;

Static is downgraded to dynamic: For example, when static is used to implement the product details page structure, static is usually used to provide services, but there is a problem with the static page due to special reasons, and it needs to be temporarily switched back to dynamic to ensure the correctness of the service. .

 

The above guarantees that if there is a problem, there is a plan, and the user can still use the website without affecting the user's shopping.

 

write service downgrade

The write service is not downgradable in most scenarios, but there are some roundabout tactics to solve the problem. For example, converting synchronous operations to asynchronous operations, or limiting the amount / proportion of writes.

For example, deducting inventory generally does this:

 

Scenario 1 :

1. Deduct the DB inventory, 2. Update the inventory in Redis after the deduction is successful ;

Option 2 :

1. Deduct Redis inventory, 2. Deduct DB inventory synchronously , and roll back Redis inventory if the deduction fails ;

The first two schemes are very dependent on DB . If the DB performance cannot keep up at this time, there will be problems in deducting inventory; therefore, we can think of scheme 3 :

1. Deduct Redis inventory, 2. Deduct DB inventory synchronously normally, downgrade to sending a message to deduct DB inventory when the performance is unbearable, and then perform DB inventory deduction asynchronously to achieve final consistency;

Sending deduction DB inventory messages in this way may also become a bottleneck; in this case, we can consider option 4 :

1、扣减Redis库存,2正常同步扣减DB库存,性能扛不住时降级为写扣减DB库存消息到本机,然后本机通过异步进行DB库存扣减来实现最终一致性。

 

也就是说正常情况可以同步扣减库存,在性能扛不住时降级为异步;另外如果是秒杀场景可以直接降级为异步,从而保护系统。还有如下单操作可以在大促时暂时降级将下单数据写入Redis,然后等峰值过去了再同步回DB,当然也有更好的解决方案,但是更复杂,不是本文的重点。

 

还有如用户评价,如果评价量太大,也可以把评价从同步写降级为异步写。当然也可以对评价按钮进行按比例开放(比如一些人的看不到评价操作按钮)。比如评价成功后会发一些奖励,在必要的时候降级同步到异步。

 

多级降级

缓存是离用户最近越高效;而降级是离用户越近越能对系统保护的好。因为业务的复杂性导致越到后端QPS/TPS越低。

 

页面JS降级开关:主要控制页面功能的降级,在页面中通过JS脚本部署功能降级开关,在适当时机开启/关闭开关;

接入层降级开关:主要控制请求入口的降级,请求进入后会首先进入接入层,在接入层可以配置功能降级开关,可以根据实际情况进行自动/人工降级;这个可以参考《京东商品详情页服务闭环实践》,尤其在后端应用服务出问题时,通过接入层降级从而给应用服务有足够的时间恢复服务;

应用层降级开关:主要控制业务的降级,在应用中配置相应的功能开关,根据实际业务情况进行自动/人工降级。

 

http://jinnianshilongnian.iteye.com/blog/2306477

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326943093&siteId=291194637