Fuse mechanism: prevent a service failure from causing the entire system to collapse

I. Introduction

A system usually includes multiple services, and smooth data interaction between services is a basic requirement for the entire system to work properly.

But in practical applications, many people often encounter the situation that after a certain service works abnormally, the whole system collapses. How should we prevent and deal with this situation? The content of this chapter hopes to bring you some inspiration.


2. Scene introduction

2.1 Brief description of the interface

In a new retail architecture system, there is a general user service that is used by many pages, and it contains two interfaces.

The first interface: user mode interface
Interface function: return the location of the user's vehicle.
Use business: It will be used on the user information display page, such as the user information page in the customer service system.
Data source: connected to a third-party system, returned by the third party, and then the interface is returned to the interface caller.


The second interface: the user's operable permission list
Interface function: return a user's operable permission list, including a general permission and user-defined permissions.
Use business: It will be used every time the user opens the APP.
Data source: read from cache or database.


2.2 Problems encountered by the interface

The problem encountered in the first interface: the request is slow.
insert image description here
The specific process of calling the relationship between services is divided into 3 steps:
1) APP accesses the User API;
2) UserAPI accesses the interface/currentCarLocation of the basic data service;
3) Basic services communicate with third parties System interaction to obtain data.

Problem: The third-party response speed is very slow and sometimes convulsive, resulting in a longer response time, and the interface often times out and reports an error.



The problem encountered in the second interface: traffic flood peak cache timeout
insert image description here
The specific process of calling the relationship between services is divided into the following three steps:
1) APP orientation UserAPI;
2) UserAPI access interface/commonAccess of basic data services;
3) Basic data services Provide a list of common permissions. Because the permission list is the same for all users, we store the data in Redis. If the general permission cannot be found in Redis, we will look it up in the database.

question:

  • During the peak traffic period, Redis's general permission list times out. At that moment, all threads need to read data from the database, causing the CPU in the DB to soar to 100% immediately.
  • After the DB hangs, the Basic Data Service also hangs up immediately. Because all the threads are blocked, we cannot get the database connection, which makes the Basic Data Service unable to accept new requests.
  • However, the User API is blocked due to the thread calling the Basic Data Service, so that all threads of the User API service are also blocked, that is, the User API is also hung up, resulting in all operations on the App being unavailable.


3. Solutions

In order to solve the above two problems, the following two conditions need to be met:

(1) Thread isolation

For the first problem, the desired solution is: the maximum number of connections configured for each service in the User API is 1000, and the speed of each API call to /currentCarLocation of BasicDataService will be very slow.

Therefore, it is hoped that the number of call requests for /currentCarLocation can be controlled to ensure that there are no more than 50, so as to ensure that there are at least 950 connections available to handle regular requests. If the number of call requests for /currentCarLocation exceeds 50, we design some backup logic for processing, such as prompting the user on the interface.



(2) fuse

For the second question, because there was no deadlock in the DB at that time, the traffic flood peak cache timeout was simply due to too much pressure. At this time, we can use Basic Data Service to suspend a little time so that it does not accept new requests, so that Redis data It will be filled, the connection to the database will drop, and our service will be fine.

A brief summary of the solution technology needs to meet the following two requirements:
1) If you find that the request of a certain interface has been out of order recently, do not access the service of the interface;
2) If you find that the request of a certain interface has timed out, first judge whether the service of the interface is unbearable Overwhelmed, if overwhelmed, don't visit it yet.

Guess you like

Origin blog.csdn.net/locahuang/article/details/123935849