Remember a service unavailable fault in actual production

(1) Phenomenon:
Microservice A cannot be requested within a certain period of time, the connection service times out, and the service is available again in about 20 minutes (2) Service request link: client-nginx-microservice gateway-microservice A.
Troubleshooting process: Check the request for microservice A through nginx and the microservice gateway during the failure time period. There is no request log in microservice A, which means that the request did not reach the microservice during the failure time period.
(3) Analysis of the cause of the failure: There are generally two possible reasons for this phenomenon. One is that the microservice processing request time exceeds the service fuse time, causing the service to be fuse, and the microservice is unavailable during the fuse time. In another case, the microservice processing request time exceeds the request read timeout time of the load balancing configuration of the service, and then the request is loaded to other nodes of the microservice. As a result, the request processing exceeds the request read timeout time, and the load balancing mechanism will affect the other services of A The nodes are requested to load one by one, if the processing time exceeds the request read timeout time. Then, the gateway will tell the client that the service is not available. Check the system log of microservice A before the failure time period and find that there are suspicious requests that need to be processed for a long time. For example, download requests, or the request interface itself has serious performance problems and slow processing. Therefore, once this business processing scenario occurs, if the service fuse time or the request read timeout time of the service load balancing configuration is exceeded, the microservice will start the protection mechanism, modify the health status of the service to be unavailable, and make all the microservices Initiated requests cannot be reached to protect the service. This protection time is not permanent, it is a short time, about 20 minutes, and can also be modified through configuration.
(4) Optimization scheme: 1. Modify the service breaking time and load balancing request read timeout time to make it greater than the longest request processing time. The configuration formula is: longest request processing time <service breaking time <load balancing request read Take the timeout time (load balancing request read timeout time = load mixing request connection timeout time)

hystrix.command.default.execution.isolation.thread.timeoutInMilliseconds #服务熔断时间
ribbon.ReadTimeout #负载均衡服务请求读取超时时间
ribbon.ConnectTimeout #负载均衡服务连接超时时间

2. Optimize the time-consuming and slow interface processing time to a reasonable processing time range.
3. Port the time-consuming and slow interface to a separate microservice to avoid the availability of such interfaces and affect other interfaces in the same service Availability

If there is enough time to repair the fault, using the 1, 2, and 3 points in the optimization plan to optimize the system is the most thorough method to solve the fault.

Remember a service unavailable fault in actual production

Guess you like