circuit breaker design pattern

Reprinted: http://www.cnblogs.com/yangecnu/p/Introduce-Circuit-Breaker-Pattern.html

If you have the impression, especially in summer, if the electrical load at home is too large, such as turning on a lot of household appliances, It will "auto trip" and the circuit will be disconnected. In the past, an older method was the "fuse". When the load was too large, or the circuit was faulty or abnormal, the current would continue to rise. In order to prevent the increased current from damaging some important components in the circuit or valuable device, burn out the circuit or even cause a fire. When the current abnormally rises to a certain height and heat, the fuse will cut off the current by itself, thus playing a role in protecting the safe operation of the circuit.

Similarly, in a large-scale software system, if the invoked remote service or resource cannot be used for some reason, if there is no such overload protection, the requested resource will be blocked and wait on the server, thus exhausting the system or server resources. In many cases, there may be only local and small-scale failures in the system at the beginning. However, due to various reasons, the scope of the failures has become larger and larger, eventually leading to global consequences. This overload protection in software systems is the occurrence of the circuit breaker
problem that this article will talk about.

In large distributed systems, it is usually necessary to call or operate remote services or resources. These remote services or resources The call to these remote resources fails due to reasons beyond the caller's control, such as slow network connection, occupied or temporarily unavailable resources. These errors usually return to normal at a later time.

However, in some cases, the results are unpredictable for unforeseen reasons, and the remote method or resource may take a long time to repair. Such errors are so severe that parts of the system become unresponsive or even the entire service becomes completely unavailable. In this case, using constant retry may not solve the problem, instead, the application should return immediately and report the error at this time.

Often, if a server is very busy, a partial failure in the system may result in a "cascading failure". For example, an operation may call a remote WebService, the service will set a timeout time, if the response time exceeds this time, an exception will be thrown. But this strategy will cause concurrent requests to call the same operation to block until the timeout expires. This blocking of requests may occupy valuable system resources, such as memory, threads, database connections, etc., and eventually these resources will be exhausted, causing the resources used by other unrelated parts of the system to be exhausted, thus dragging down the entire system. system. In this case, it may be a better option for the operation to return an error immediately rather than waiting for the timeout to occur. We only try when the call to the service is likely to succeed.
Two solutions The

circuit breaker mode prevents the application from constantly trying to perform operations that may fail, allowing the application to continue executing without waiting for errors to be fixed, or wasting CPU time waiting for long timeouts to occur. Circuit breaker mode can also enable the application to diagnose whether the error has been fixed, and if so, the application will try to invoke the operation again.

The circuit breaker pattern is like a proxy for operations that are prone to errors. Such a proxy can record the number of times an error occurred on the most recent call, and then decide to continue with the allow operation, or return the error immediately.

The Circuit Breaker

fuse can be implemented using a state machine, which internally simulates the following states.

    Closed state: A request to an application can directly cause a method call. The proxy class maintains the number of recent call failures. If a call fails, the number of failures is incremented by 1. If the number of recent failures exceeds the threshold of allowable failures within a given time, the proxy class switches to the disconnected (Open) state. At this time, the agent opens a timeout clock, and when the clock exceeds this time, it switches to the half-open state. The timeout setting is to give the system a chance to correct the error that caused the call to fail.
    Disconnected (Open) state: In this state, a request to the application immediately returns an error response.
    Half-Open state: Allows a certain number of requests to the application to call the service. If the invocation of the service by these requests is successful, then the error that caused the invocation to fail can be considered to have been fixed, and the circuit breaker is switched to the closed state (and the error counter is reset); if the certain number of requests have failed invocations, It is considered that the problem that caused the previous call to fail still exists, the fuse switches back to disconnected mode, and then starts to reset the timer to give the system time to correct the error. The semi-disconnected state can effectively prevent the recovering service from being dragged down again by a sudden large number of requests.

The transition between the various states is as follows:

Circuit Breaker State Change

In the Close state, the error counter is time-based. It resets automatically at a specific time interval. This prevents the fuse from going into an open state due to an accidental error. The failure threshold that triggers the fuse to go into the open state will only be generated if the number of errors reaches the threshold of the specified number of errors within a specific time interval. The consecutive successes counter used in the Half-Open state records the number of successful calls. When the number of successful consecutive calls reaches a specified value, it switches to the closed state. If a call fails, it immediately switches to the disconnected state. The timer for the number of consecutive successful calls returns to zero the next time it enters the semi-disconnected state.

Implementing the circuit breaker pattern makes the system more stable and resilient, provides stability when the system recovers from errors, and reduces the impact of errors on system performance. It improves the system's responsiveness to events by quickly rejecting attempts to invoke services that might cause errors, without waiting for the operation to time out or never return a result. If the circuit breaker design pattern emits an event every time the state is switched, this information can be used to monitor the running state of the service and can notify the administrator to handle when the circuit breaker is switched to the off state.

The circuit breaker pattern can be customized to fit some specific scenarios that may cause the remote service to fail. For example, you can use a growing strategy for timeouts in circuit breakers. The timeout can be set to a few seconds when the fuse begins to go into an open state, then if the error is not resolved, then set the timeout to a few minutes, and so on. In some cases, instead of throwing an exception, we can return some wrong default value in the disconnected state.
Three factors to consider

When implementing the circuit breaker pattern, the following factors may need to be considered:

    Exception handling: When calling a service protected by a circuit breaker, we must handle exceptions when the service is unavailable. These exception handling usually need to be based on the specific business situation. For example, if the application is only temporarily degraded, it may be necessary to switch to another alternative service to perform the same task or fetch the same data, or to report an error to the user and prompt them to try again later.
    Type of exception: There can be many reasons why the request fails. Some causes may be more serious than others. For example, a request may fail because a remote service has crashed, which may take several minutes to recover, or because the server is temporarily overloaded and timed out. The circuit breaker should be able to check for the type of error and thus adjust the strategy based on the specific error condition. For example, it may take many timeout exceptions to determine that it needs to switch to the disconnected state, and it only takes a few error prompts to determine that the service is unavailable and quickly switch to the disconnected state.
    Logging: A circuit breaker should be able to log all failed requests, as well as some requests that may attempt to succeed, allowing administrators to monitor the execution of services protected with circuit breakers.
    Test whether the service is available: In the disconnected state, the circuit breaker can periodically ping remote services or resources to determine whether the service is restored, instead of using a timer to automatically switch to the half-disconnected state. This ping operation can simulate previous failed requests, or it can be determined by calling the methods provided by the remote service to check if the service is available.
    Manual reset: It is difficult to determine the recovery time for a failed operation in the system. Providing a manual reset function enables the administrator to manually force the fuse to be switched to the closed state. Likewise, administrators can force the circuit breaker to open if the service protected by the circuit breaker is temporarily unavailable.
    Concurrency issues: The same circuit breaker may be accessed by a large number of concurrent requests at the same time. The implementation of circuit breakers should not block concurrent requests or burden each request invocation.
    Differences in resources: When using a single circuit breaker, care should be taken if a resource is distributed in multiple places. For example, a data may be stored on multiple disk partitions (shards), one partition can be accessed normally, while another may have temporary problems. In this case, if the different error responses are mixed up, there is a high probability of failure of the partitions in question that the application accesses, and the possibility of blocking those partitions that are considered normal.
    Speed ​​up the blowing operation of the circuit breaker: Sometimes, the error message returned by the service is enough for the circuit breaker to perform the blowing operation immediately and keep it for a period of time. For example, if the response from a distributed resource indicates an overload, it can be concluded that retrying immediately is not recommended, but should wait a few minutes before retrying. (HTTP protocol defines "HTTP 503 Service Unavailable" to indicate that the requested service is currently unavailable, and it can include other information such as timeouts, etc.)
    Repeated failed requests: When the circuit breaker is in the disconnected state, the circuit breaker can record each time The details of the request, rather than just returning failure information, so that when the remote service recovers, these failed requests can be re-requested again.

Four usage scenarios This pattern

should be used to:

    Prevent applications from directly calling remote services or shared resources that are likely to fail.

Unsuitable Scenarios

    For direct access to local private resources in applications, such as in-memory data structures, using the circuit breaker mode will only increase system overhead.
    It is not suitable as an exception handling substitute for business logic in the application.

Five implementations

According to the state switching diagram above, we can easily implement a basic circuit breaker. We only need to maintain a state machine internally and define the rules of state transition. Use the State pattern to achieve this. First, we define an abstract class CircuitBreakerState that represents a state transition operation:

public abstract class CircuitBreakerState
{
    protected CircuitBreakerState(CircuitBreaker circuitBreaker)
    {
        this.circuitBreaker = circuitBreaker;
    }

    /// <summary>
    /// The operation that is processed before the protected method is called
    /// </summary>
    public virtual void ProtectedCodeIsAboutToBeCalled() {
        //If it is disconnected, return directly
        //then wait for the timeout to transition to half-disconnected state
        if (circuitBreaker.IsOpen)
        {
            throw new OpenCircuitException();
        }
    }

    /// <summary>
    /// Operation after the circuit breaker-protected method is successfully called
    /// </summary>
    public virtual void ProtectedCodeHasBeenCalled()
    {
        circuitBreaker.IncreaseSuccessCount();
    }

    /// <summary>
    // /The operation after the method call protected by the circuit breaker has an abnormal operation
    /// </summary>
    /// <param name="e"></param>
    public virtual void ActUponException(Exception e)
    {
        //Increase the number of failures Counter, and save the error information
        circuitBreaker.IncreaseFailureCount(e);
        //Reset the number of consecutive success
        circuitBreaker.ResetConsecutiveSuccessCount();
    }

    protected readonly CircuitBreaker circuitBreaker;
}

In the abstract class, the state machine CircuitBreaker is injected through the constructor; when an error occurs, we increase the error counter and reset the continuous success counter. In the operation of increasing the error counter, the exception information of the error is also recorded.

Then implement the classes representing the three states of the fuse respectively. First implement the closed state CloseState:

public class ClosedState : CircuitBreakerState
{
    public ClosedState(CircuitBreaker circuitBreaker)
        : base(circuitBreaker)
    {
        //Reset failure counter
        circuitBreaker.ResetFailureCount();
    }

    public override void ActUponException(Exception e)
    {
        base.ActUponException( e);
        //If the number of failures reaches the threshold, switch to the disconnected state
        if (circuitBreaker.FailureThresholdReached())
        {
            circuitBreaker.MoveToOpenState();
        }
    }
}

In the closed state, if an error occurs and the number of errors reaches a threshold, the state machine switches to the open state. The implementation of the disconnected state OpenState is as follows:

public class OpenState : CircuitBreakerState
{
    private readonly Timer timer;

    public OpenState(CircuitBreaker circuitBreaker)
        : base(circuitBreaker)
    {
        timer = new Timer(circuitBreaker.Timeout.TotalMilliseconds);
        timer.Elapsed += TimeoutHasBeenReached;
        timer.AutoReset = false;
        timer.Start();
    }

    //The disconnection exceeds the set threshold and automatically switches to the half-disconnected state
    private void TimeoutHasBeenReached(object sender, ElapsedEventArgs e)
    {
        circuitBreaker.MoveToHalfOpenState();
    }

    public override void ProtectedCodeIsAboutToBeCalled()
    {
        base.ProtectedCodeIsAboutToBeCalled();
        throw new OpenCircuitException();
    }
}

The disconnected state maintains a counter internally. If the disconnection reaches a certain time, it will automatically switch to the version disconnected state, and, in the disconnected state In the open state, if an operation needs to be performed, an exception is thrown directly.

The last half-open Half-Open state is implemented as follows:

public class HalfOpenState : CircuitBreakerState
{
    public HalfOpenState(CircuitBreaker circuitBreaker)
        : base(circuitBreaker)
    {
        //Reset the continuous success count
        circuitBreaker.ResetConsecutiveSuccessCount();
    }

    public override void ActUponException(Exception e )
    {
        base.ActUponException(e);
        //As long as there is a failure, switch to the disconnected mode immediately
        circuitBreaker.MoveToOpenState();
    }

    public override void ProtectedCodeHasBeenCalled()
    {
        base.ProtectedCodeHasBeenCalled();
        //If the number of consecutive successes reaches the threshold, switch to the closed state
        if (circuitBreaker.ConsecutiveSuccessThresholdReached ())
        {
            circuitBreaker.MoveToClosedState();
        }
    }
}

When switching to the half-disconnected state, reset the count of consecutive successful calls to 0. When the execution is successful, the field will be incremented and changed automatically. When the number of successful consecutive calls is reached When the threshold is reached, it switches to the closed state. If the call fails, immediately switch to disconnected mode.

With the above three state switches, we have to implement the CircuitBreaker class:

public class CircuitBreaker
{
    private readonly object monitor = new object();
    private CircuitBreakerState state;
    public int FailureCount { get; private set; }
    public int ConsecutiveSuccessCount { get; private set; }
    public int FailureThreshold { get; private set; }
    public int ConsecutiveSuccessThreshold { get; private set; }
    public TimeSpan Timeout { get; private set; }
    public Exception LastException { get; private set; }

    public bool IsClosed
    {
        get { return state is ClosedState; }
    }

    public bool IsOpen
    {
        get { return state is OpenState; }
    }

    public bool IsHalfOpen
    {
        get { return state is HalfOpenState; }
    }

    internal void MoveToClosedState()
    {
        state = new ClosedState(this);
    }

    internal void MoveToOpenState()
    {
        state = new OpenState(this);
    }

    internal void MoveToHalfOpenState()
    {
        state = new HalfOpenState(this);
    }

    internal void IncreaseFailureCount(Exception ex)
    {
        LastException = ex;
        FailureCount++;
    }

    internal void ResetFailureCount()
    {
        FailureCount = 0;
    }

    internal bool FailureThresholdReached()
    {
        return FailureCount >= FailureThreshold; }     internal void
    IncreaseSuccessCount

    ()
    {
        ConsecutiveSuccessCount++;
    }

    internal void ResetConsecutiveSuccessCount()
    {
        ConsecutiveSuccessCount = 0;
    }

    internal bool ConsecutiveSuccessThresholdReached()
    {
        return ConsecutiveSuccessCount >= ConsecutiveSuccessThreshold     ; Variables that record the status, such as FailureCount and ConsecutiveSuccessCount record the number of failures and consecutive successes, and FailureThreshold and ConsecutiveSuccessThreshold record the maximum number of failed calls and the number of consecutive successful calls. These objects are read-only to the outside world.     A state variable of type CircuitBreakerState is defined to represent the current state of the system.








    It defines the methods IsOpen, IsClose, IsHalfOpen to get the current state, and the methods MoveToOpenState, MoveToClosedState, etc. to represent the state transition. These methods are relatively simple, and the purpose can be seen according to the name.

Then, the constructor can pass the maximum number of failures in the Close state, the maximum number of consecutive successes used in the HalfOpen state, and the timeout time in the Open state through the constructor:

public CircuitBreaker(int failedthreshold, int consecutiveSuccessThreshold, TimeSpan timeout)
{
    if (failedthreshold < 1 || consecutiveSuccessThreshold < 1)
    {
        throw new ArgumentOutOfRangeException("threshold", "Threshold should be greater than 0");
    }

    if (timeout.TotalMilliseconds < 1)
    {
        throw new ArgumentOutOfRangeException("timeout", " Timeout should be greater than 0");
    }

    FailureThreshold = failedthreshold;
    ConsecutiveSuccessThreshold = consecutiveSuccessThreshold;
    Timeout = timeout;
    MoveToClosedState();
}

In the initial state, the circuit breaker switches to the closed state.

Then, it can be called through AttempCall, passing in the proxy method that is expected to be executed, and the execution of this method is protected by a circuit breaker. Locks are used here to deal with concurrency issues.

public void AttemptCall(Action protectedCode)
{
    using (TimedLock.Lock(monitor))
    {
        state.ProtectedCodeIsAboutToBeCalled();
    }

    try
    {
        protectedCode();
    }
    catch (Exception e)
    {
        using (TimedLock.Lock(monitor))
        {
            state. ActUponException(e);
        }
        throw;
    }

    using (TimedLock.Lock(monitor))
    {
        state.ProtectedCodeHasBeenCalled();
    }
}

Finally, provide Close and Open methods to manually switch the current state.

public void Close()
{
    using (TimedLock.Lock(monitor))
    {
        MoveToClosedState();
    }
}

public void Open()
{
    using (TimedLock.Lock(monitor)) {     MoveToOpenState
    (
        ) ; , we can build unit tests for it. First we write a few helper classes to simulate the number of consecutive executions: private static void CallXAmountOfTimes(Action codeToCall, int timesToCall) {











    for (int i = 0; i < timesToCall; i++)
    {
        codeToCall();
    }
}

The following class is used to throw a specific exception:

private static void AssertThatExceptionIsThrown<T>(Action code) where T : Exception
{
    try
    {
        code() ;
    }
    catch (T)
    {
        return;
    }

    Assert.Fail("Expected exception of type {0} was not thrown", typeof(T).FullName);
}

Then, using NUnit, the following Case can be created:

[Test]
public void ClosesIfProtectedCodeSucceedsInHalfOpenState()
{
    var stub = new Stub(10);
    //Define the fuse and enter the disconnected state after 10 failures
    //Enter the half-disconnected state after 5 seconds
    //In the semi-off state, succeed 15 times in a row and enter the closed state
    var circuitBreaker = new CircuitBreaker(10, 15, TimeSpan.FromMilliseconds(5000));
    Assert.That(circuitBreaker.IsClosed);
    //Fail 10 calls
    CallXAmountOfTimes(() => AssertThatExceptionIsThrown<ApplicationException>(() => circuitBreaker.AttemptCall(stub.DoStuff)), 10);

    Assert.AreEqual(10, circuitBreaker.FailureCount);

    Assert.That(circuitBreaker.IsOpen);

    // Waiting to go from Open to HalfOpen
    Thread.Sleep(6000);
    Assert.That(circuitBreaker.IsHalfOpen);
    //Successfully called 15 times
    CallXAmountOfTimes(()=>circuitBreaker.AttemptCall(stub.DoStuff), 15);

    Assert.AreEqual( 15, circuitBreaker.ConsecutiveSuccessCount);
    Assert.AreEqual(0, circuitBreaker.FailureCount);
    Assert.That(circuitBreaker.IsClosed);
}

This Case simulates the state transition in the circuit breaker. When initializing first, the fuse is in the closed state, and then 10 consecutive calls throw an exception. At this time, the fuse enters the disconnected state, and then the thread waits for 6 seconds. At this time, at the 5th second, the state switches to half-break open state. Then it is called successfully 15 times in a row, and the state is switched to the closed state again.
Seven conclusions

In the application system, we usually call remote services or resources (these services or resources are usually from third parties), and the calls to these remote services or resources usually result in failure, or hang without response until timeout production. In some extreme cases, a large number of requests will block the calls to these abnormal remote services, which will lead to exhaustion of some key system resources, resulting in cascading failures, which will bring down the entire system. The circuit breaker mode takes the form of a state machine internally, so that these remote services that may cause the request to fail are packaged. When an exception occurs in the remote service, an error response can be returned to the incoming request immediately, and the system administrator will be notified. Errors are controlled in a local range, thereby improving the stability and reliability of the system.

This article first introduces the use of the circuit breaker mode, the problems that can be solved, and the factors that need to be considered, and finally shows how to implement a simple circuit breaker using code, and gives test cases, I hope these are helpful to you, especially It is helpful to improve the stability and reliability of the system when your system calls external remote services or resources and has a large number of visits.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326888954&siteId=291194637