Nepxion Discovery Study Notes 2 Service Avalanche and Fault Tolerant Solution

table of Contents

Note 1:

Service avalanche effect:

Note 2:

Fault tolerance scheme:

Note 3:

Fault-tolerant components:


Note 1:

Service avalanche effect:

1. In a distributed system , due to network reasons or its own reasons , the service generally cannot be guaranteed to be 100% available. If there is a problem with a service, thread blocking will occur when calling this service. If a large number of requests flow in at this time, multiple threads will be blocked, etc.

Waiting, which in turn leads to service paralysis. Due to the dependency between services, failures will propagate and cause catastrophic and serious consequences to the entire microservice system. This is the " avalanche effect " of service failures .

Note 2:

Fault tolerance scheme:

To prevent the spread of avalanches, we must do a good job of fault tolerance in services. The following introduces common service fault tolerance ideas and components. Common fault tolerance ideas include isolation, timeout, current limiting, fusing, and degradation :

isolation

1. It refers to dividing the system into several service modules according to certain principles , and each module is relatively independent and has no strong dependence. When a failure occurs, can the problem and the impact will be isolated within a module , without the risk of proliferation, do not spread to other modules, the system does not affect the overall service.

2. Common isolation methods are: thread pool isolation and semaphore isolation .

time out

1. When the upstream service calls the downstream service, set a maximum response time . If it exceeds this time and the downstream does not respond , it will disconnect the request and release the thread .

Limiting

1. Current limiting is to limit the input and output flow of the system to achieve the purpose of protecting the system. In order to ensure the stable operation of the system , once the threshold that needs to be restricted is reached , it is necessary to restrict the flow and take a few measures to accomplish the purpose of restricting the flow.

Fuse

1. In the Internet system, when downstream services respond slowly or fail due to excessive access pressure, upstream services can temporarily cut off calls to downstream services in order to protect the overall availability of the system . This measure of sacrificing parts and preserving the whole is called fusing .

2. Service fusing generally has three states:

      2.1. Fuse closed state (Closed) : When the service is not faulty, the state of the fuse does not impose any restrictions on the caller's call.

      2.2. Fuse open state (Open) : Subsequent calls to the service interface no longer go through the network, and the local fallback method is directly executed.

      2.3. Half-open state (Half-Open) : Try to restore the service call, allow limited traffic to call the service, and monitor the call success rate. If the success rate reaches the expected, it means that the service has been restored and enters the fuse-off state ; if the success rate is still low, the fuse-on state is re-entered.

Downgrade

1. Downgrading is actually to provide a backing plan for the service. Once the service cannot be called normally, the backing plan is used.


Note 3:

Fault-tolerant components:

Hystrix:  Thread pool isolation /semaphore isolation ; one of the  springcloud family components, is a delay and fault-tolerant library open sourced by Netflflix, used to isolate access to remote systems, services or third-party libraries to prevent cascading failures, thereby improving system availability And fault tolerance.

Resilience: Semaphore isolation ; I haven't used it, so I won't introduce it. If you are interested, you can find the information by yourself.

Sentinel: semaphore isolation (limiting the number of concurrent threads ); Ali Baba to open a circuit breaker implemented to  flow as the starting point , the flow control, fuse degraded load protection system to protect the stability of the dimensions and other services. It has been adopted on a large scale within Alibaba and is very stable . ( Recommended )

Guess you like

Origin blog.csdn.net/weixin_42585386/article/details/109219042