The idea of system fault tolerance

1. Make good use of cache

  • For wealth management business, the user's income data is queried in the DB in real time. When the DB fails, if there is no backup data source, the user will not see the income. If redis is used as a cache, it can be used as backup data and displayed to the users, avoid user complaints
  • When using redis, if redis also hangs, you need to consider whether there is data locally. If the business scenario allows dirty data for a period of time, you can use local memory or secondary cache to hold it first. If the business does not agree with dirty data, it needs to fail quickly. , just
    tell the user that it failed and try again later
  • Usually, users will keep retrying after seeing the failure, but it will bring more pressure to the business background, resulting in an avalanche of message accumulation. In response to this situation, flow control must be done. The fast-failure scheme can be controlled by both the front and back ends. . The key point of system design is to consider how exceptions are handled.

2. Actively retry

  • Establish an automatic retry mechanism to avoid frequent manual intervention, especially for partners whose services are extremely unstable, so as to reduce our workload.

3. DB slow queries must be optimized

  • Slow query is very dangerous. It must be confirmed that there is an index in the table. If there is no index in the table, when the concurrency is too high, the DB cpu will soar and the program cannot respond. After the user finds that the request fails, it will continue to send requests, and finally service Avalanche.

4. Introduce asynchronous processing

  • For tasks that take a long time, you can split them into asynchronous operations, and quickly return them to the upstream, so that the upstream can continue to do other things, or after the upstream receives the response, try to query several times, and then continue to do the following things.
    In terms of user experience, it is much better than simply keeping users waiting.

5. Program execution order

  • Under abnormal circumstances, the execution order of the program will cause problems to the system. For example, there are applications that process retry tasks. If the code is designed to process the failed tasks immediately after the system restarts, it may block normal requests. , which leads to more failures.

6. Clear Logs

  • When it comes to calls between multiple systems, there needs to be a unified field to connect the entire transaction in series, which is convenient for operation and maintenance to find logs, locate problems, and monitor.

7. Flow Control

8. Downgrade switch

9. Bypass

10. Message-Driven Patterns

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325220564&siteId=291194637