Improve system availability with Hystrix

(Image source: https://github.com/Netflix/Hystrix/wiki)

However, the availability of any service is not 100%, and the network is also vulnerable. When a service I depend on is unavailable, will I be dragged to death? When the network is unstable, will I be dragged to death myself? These issues that do not need to be considered in a stand-alone environment have to be considered in a distributed environment. Suppose I have 5 dependent services, and their availability is 99.95%, that is, the unavailability time in a year is about more than 4 hours, then does it mean that my availability is at most 99.95% to the power of 5, 99.75% (close to One day), coupled with network instability, more dependent services, and lower availability. Considering that the services we depend on are bound to be unavailable at certain times, and that the network is bound to be unstable, how should we design our own services? That is, how to design for error?

Michael T. Nygard summed up a lot of patterns for improving system availability in his wonderful book "Release It!" , two of which are very important:

  1. use timeout
  2. use a circuit breaker

First, when calling externally dependent services over the network, a timeout must be set. In a healthy situation, a remote call to a local area usually returns within tens of milliseconds, but when the network is congested, or when the dependent services are unavailable, this time may be many seconds, or it may be dead at all. . Usually, a remote call corresponds to a thread or process. If the response is too slow or it freezes, the process/thread will be dragged to death and will not be released in a short time, and the process/thread corresponds to the system resources, which means that my own service resources will be exhausted, making my own services unavailable. Suppose my service depends on many services. If one of the non-core dependencies is unavailable and there is no timeout mechanism, then this non-core dependency can kill my service, although theoretically even without it I can still be in most cases. healthy functioning.

Circuit breakers are actually familiar to all of us (do you know how to change fuses?). If you don’t have circuit breakers in your home, when the current is overloaded or short-circuited, the circuit will not be disconnected, and the wires will heat up, causing fires and burning down the house. With a circuit breaker, when the current is overloaded, the fuse will burn out first and disconnect the circuit, so as not to cause a greater disaster (but you have to replace the fuse at this time).

When our service accesses a dependency with a large number of timeouts, it does not make much sense to allow new requests to access it, it will only consume existing resources needlessly. Even if you have set a timeout of 1 second, if you know that the dependency is unavailable and then make more requests, such as 100, to access this dependency, 100 threads will waste 1 second of resources. At this time, the circuit breaker can help us avoid this waste of resources, put a circuit breaker between its own service and its dependencies, and count the access status in real time. When the access timeout or failure reaches a certain threshold (such as 50% request timeout) , or fail 20 times in a row), open the circuit breaker, then subsequent requests will return directly to failure, so as not to waste resources. The circuit breaker then tries to turn off the circuit breaker (or replace the fuse) at an interval (such as 5 minutes) to see if the dependency is back in service.

The timeout mechanism and circuit breaker can protect our services well from the unavailability of dependent services. For details, please refer to the article " Using Circuit Breaker Design Patterns to Protect Software " . However, the specific implementation of these two modes still has a certain degree of complexity. Fortunately, the open source  Hystrix framework of Netflix  greatly simplifies the implementation of the timeout mechanism and circuit breaker . Hystrix : for distributed systems, providing delay and fault tolerance functions, isolating remote Access points for systems, access and third-party libraries, preventing cascading failures and ensuring the resilience of complex distributed systems in the face of inevitable failures. There is a ported version for .NET on Codeplex https://hystrixnet.codeplex.com/ .

Using Hystrix, you need to encapsulate calls to remote dependencies through Command:

publicclassGetCurrentTimeCommand : HystrixCommand<long>

{

privatestaticlong currentTimeCache;

 

public GetCurrentTimeCommand()

: base(HystrixCommandSetter.WithGroupKey("TimeGroup")

.AndCommandKey("GetCurrentTime")

.AndCommandPropertiesDefaults(newHystrixCommandPropertiesSetter().WithExecutionIsolationThreadTimeout(TimeSpan.FromSeconds(1.0)).WithExecutionIsolationThreadInterruptOnTimeout(true)))

{

}

 

protectedoverridelong Run()

{

using (WebClient wc = newWebClient())

{

string content = wc.DownloadString("http://tycho.usno.navy.mil/cgi-bin/time.pl");

XDocument document = XDocument.Parse(content);

currentTimeCache = long.Parse(document.Element("usno").Element("t").Value);

return currentTimeCache;

}

}

 

protectedoverridelong GetFallback()

{

return currentTimeCache;

}

}

Then call this Command when needed :

GetCurrentTimeCommand command = newGetCurrentTimeCommand();

long currentTime = command.Execute();

The above is a synchronous call. Of course, if the business logic allows and pursues performance more, you may choose an asynchronous call:

In this example, regardless of whether WebClient.DownloadString ( ) itself has a timeout mechanism (you may find that many remote calling interfaces do not provide you with a timeout mechanism), after encapsulating it with HystrixCommand , the timeout is mandatory, and the default timeout is 1 second , of course, you can adjust the timeout of Command in the constructor according to your needs, for example, say 2 seconds:

HystrixCommandSetter.WithGroupKey("TimeGroup")

.AndCommandKey("GetCurrentTime")

.AndCommandPropertiesDefaults(new HystrixCommandPropertiesSetter().WithExecutionIsolationThreadTimeout(TimeSpan.FromSeconds(2.0)).WithExecutionIsolationThreadInterruptOnTimeout(true))

When Hystrix executes the command timeout, after Hystrix executes the command timeout or fails, it will try to call a fallback. This fallback is an alternate solution. To provide a fallback for HystrixCommand, just rewrite the protected virtual R GetFallback() method.

In general, Hystrix will allocate a dedicated thread pool for Command. The number of threads in the pool is fixed. This is also a protection mechanism. Suppose you depend on many services, and you do not want to call one of the services to consume too many threads. So that other services have no thread to call. The default size of this thread pool is 10, that is, there can only be at most one command executed concurrently, and calls exceeding this number have to be queued. If the queue is too long (the default exceeds 5), Hystrix will immediately fallback or throw an exception.

Depending on your specific needs, you may want to adjust the thread pool size of a Command. For example, if your call to a dependency has an average response time of 200ms, and the peak QPS is 200, then the concurrency is at least 0.2 x 200 = 40 ( Little's Law ), considering a certain leniency, the size of this thread pool is set to 60 may be more appropriate:

public GetCurrentTimeCommand()

: base(HystrixCommandSetter.WithGroupKey("TimeGroup")

.AndCommandKey("GetCurrentTime")

.AndCommandPropertiesDefaults(new HystrixCommandPropertiesSetter().WithExecutionIsolationThreadTimeout(TimeSpan.FromSeconds(1.0)).WithExecutionIsolationThreadInterruptOnTimeout(true))

.AndThreadPoolPropertiesDefaults(new HystrixThreadPoolPropertiesSetter().WithCoreSize(60) // size of thread pool

.WithKeepAliveTime(TimeSpan.FromMinutes(1.0)) // minutes to keep a thread alive (though in practice this doesn't get used as by default we set a fixed size)

.WithMaxQueueSize(100) // size of queue (but we never allow it to grow this big ... this can't be dynamically changed so we use 'queueSizeRejectionThreshold' to artificially limit and reject)

.WithQueueSizeRejectionThreshold(10) // number of items in queue at which point we reject (this can be dyamically changed)

.WithMetricsRollingStatisticalWindow(10000) // milliseconds for rolling number

.WithMetricsRollingStatisticalWindowBuckets(10)))

{

}

Having said so much, I haven't mentioned Hystrix's circuit breaker. In fact, for users, the circuit breaker mechanism is enabled by default, but the programming interface hardly needs to care about this by default. The mechanism is similar to the previous one. Hystrix will count Command invocation, look at the proportion of failures. By default, when more than 50% of the failures fail, the circuit breaker is turned on. After that, the command invocation for a period of time returns directly to failure (or fallback). After 5 seconds, Hystrix tries to close the circuit breaker and see. Whether the request can be responded normally. The following lines of Hystrix source code show how it counts the failure rate:

public HealthCounts GetHealthCounts()

{

// we put an interval between snapshots so high-volume commands don't

// spend too much unnecessary time calculating metrics in very small time periods

long lastTime = this.lastHealthCountsSnapshot;

long currentTime = ActualTime.CurrentTimeInMillis;

if (currentTime - lastTime >= this.properties.MetricsHealthSnapshotInterval.Get().TotalMilliseconds || this.healthCountsSnapshot == null)

{

if (Interlocked.CompareExchange(ref this.lastHealthCountsSnapshot, currentTime, lastTime) == lastTime)

{

// our thread won setting the snapshot time so we will proceed with generating a new snapshot

// losing threads will continue using the old snapshot

long success = counter.GetRollingSum (HystrixRollingNumberEvent.Success);

long failure = counter.GetRollingSum(HystrixRollingNumberEvent.Failure); // fallbacks occur on this

long timeout = counter.GetRollingSum(HystrixRollingNumberEvent.Timeout); // fallbacks occur on this

long threadPoolRejected = counter.GetRollingSum(HystrixRollingNumberEvent.ThreadPoolRejected); // fallbacks occur on this

long semaphoreRejected = counter.GetRollingSum(HystrixRollingNumberEvent.SemaphoreRejected); // fallbacks occur on this

long shortCircuited = counter.GetRollingSum(HystrixRollingNumberEvent.ShortCircuited); // fallbacks occur on this

long totalCount = failure + success + timeout + threadPoolRejected + shortCircuited + semaphoreRejected;

long errorCount = failure + timeout + threadPoolRejected + shortCircuited + semaphoreRejected;

healthCountsSnapshot = new HealthCounts(totalCount, errorCount); }

}

return healthCountsSnapshot;

}

Among them, failure means that an error occurred in the command itself, success is naturally unnecessary, timeout is a timeout, threadPoolRejected means a command call rejected when the thread pool is full, shortCircuited means a command call rejected after the circuit breaker is opened, and semaphoreRejected uses the semaphore mechanism (instead of the thread pool). ) rejected command invocation.

 

 

http://www.cnblogs.com/shanyou/p/4752226.html

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326870731&siteId=291194637