Solution (4) Fuse protection

Preface

Calling between services, as the client side, must prevent the unavailability of the service, which may cause the client service to crash and cause an avalanche.
The caller must call each unreliable service to achieve the "fuse" mechanism.

  • Circuit breakers are not just used for microservice guarantees. Even if it is not a microservice architecture, it must be connected to a circuit breaker.
  • Fuse mechanism allows phantom access.
  • There are two scenarios for the fuse mechanism: fuse at the gateway layer and fuse at the call location.
  • The gateway layer is fused. This requires that all calls between services must pass through the gateway, which is not suitable for fusing calls to third-party services, and it is not suitable for calling fusing between services directly without going through the gateway. The gateway layer fuse is suitable for gateway + http/grpc/rpc architecture. The fuse of the gateway layer is a superset of the fuse of the caller.
  • It can be done on the caller, there is a simple code intrusion. This article is also implemented based on the caller's circuit breaker.

[Fuse]: When the number of failures in a certain request per unit time reaches the threshold, this type of request enters the fusing state. In the fuse state, subsequent requests will directly return an error, instead of actually requesting to wait for a timeout. The fused state has a duration.

[The role of the circuit breaker]: the client role called by grpc service, the client role called by http service, and the client role called by tcp service.

analysis

Fusing process:
Insert picture description here

  • Distributed fuse can be realized based on redis.
  • The fuse can be realized based on a single point.

The fusing operation must have the following 4 attributes:

  • [Key]: It is for what operation to fuse. It is not possible to make the whole service unavailable because a route is down.
  • 【FuseTimes】: Fuse threshold. How many failures have been reached to trigger fusing.
  • 【Last】: How long does the fusing cycle last after fusing.
  • [Perns]: In how many seconds does the number of failures reach the fuse threshold and enter the fuse

Example:
For a request for obtaining user information, if there are 50 failures within 10 seconds, the request for obtaining user information will be fuse for 20 seconds.

{
    
    
    "key":"/user/get-user-info/",
    "fuseTimes": 50,
    "last": 20,
    "perns": 10
}

achieve

1. Realize distributed fuse based on redis

package redistool

import (
	"fmt"
	"github.com/garyburd/redigo/redis"
)

// 利用redis,来实现熔断
// 以下,是以http请求熔断示例:
/*

var fuseScheme = NewFuse(20, 30,10)         // 10秒内,有20次失败,则会触发熔断,熔断最短持续30秒
func HTTPUtil(url string, ...) error{
    key := url
    if !fuseScheme.FuseOk(conn, key) {
        return errorx.New
    }

    ...

    resp, e:= c.Do(req)
    if e!=nil {
         fuseScheme.Fail(conn, key)
         return
    }

    if resp.StatusCode() == 404 || resp.StatusCode() ==500 {
         fuseScheme.Fail(conn, key)
         return
    }
}
*/

type Fuse struct {
    
    
	fuseTimes int // fail times trigger fuse. Fuse times is not strictly consistent,because fuse.FuseOK() might read dirty.
	last      int // fuse lasting seconds
	perns     int // fail times reach <fuseTimes> per <perns> seconds will trigger fuse opt.
}

func NewFuse(fuseTimes int, last int, perns int) Fuse {
    
    
	return Fuse{
    
    
		fuseTimes: fuseTimes,
		last:      last,
		perns:     perns,
	}
}

// true, 未熔断,放行
// false, 熔断态,禁止通行
func (f Fuse) FuseOk(conn redis.Conn, key string) bool {
    
    
	rs, e := redis.String(conn.Do("get", fmt.Sprintf("is_fused:%s", key)))

	if e != nil && e == redis.ErrNil {
    
    
		fmt.Printf("get '%s' 未熔断 \n", fmt.Sprintf("is_fused:%s", key))
		return true
	}

	if rs == "fused" {
    
    
		fmt.Printf("get '%s' 已熔断 \n", fmt.Sprintf("is_fused:%s", key))

		return false
	}

	return false
}

// 某一次请求失败了,则需要调用Fail()
// 当fail次数达到阈值时,将会使得f.FuseOK(conn ,key) 返回false,调用方借此来熔断操作
func (f Fuse) Fail(conn redis.Conn, key string) {
    
    

	ok := MaxPerNSecond(conn, key, f.fuseTimes, int64(f.perns))

	// 未达到配置的熔断阈值,fail无操作
	if ok {
    
    
		return
	}

	// 达到了熔断点
	fmt.Printf("set '%s' 熔断\n", fmt.Sprintf("is_fused:%s", key))
	conn.Do("setex", fmt.Sprintf("is_fused:%s", key), f.last, "fused")
}

defect

  • There is a window period for the number of failures, and the actual failure threshold is a to 2a times. a represents the fusing threshold.
  • There is an overhead between services.

In view of the fact that fusing does not need to ensure strong consistency, the above defects are not a big problem.

2. Realize the fuse based on a single point

There are three ideas for single point fuses:

  • Use a locked map to implement a fuse
  • Use map lock-free to realize the fuse
  • Use hash+locked map to implement fuse

In the first type, all the fuse keys will enter the race state, and there are competition scenarios.
The second type requires the program to run in the init period, and you will need to access the fuse key and register it in the map. During the fusing process, the map is kept read-only, but the value of value can be directly modified through the atomic package.
The third type is to reduce the race scenario of different keys through hash, which requires a good design foundation.

Here, we know that different business routes, obviously, should not have a race state, and need to simplify the maintenance granularity of team development, so far, it is directly based on the third implementation.

package fuse

import (
	"fmt"
	"github.com/fwhezfwhez/cmap"
	"time"
)

type Fuse struct {
    
    
	m *cmap.MapV2

	fuseTimes int
	last      int // second
	perns     int // second
}

func NewFuse(fuseTimes int, last int, perns int, slotNum int) Fuse {
    
    
	return Fuse{
    
    
		m:         cmap.NewMapV2(nil, slotNum, 30*time.Minute),
		fuseTimes: fuseTimes,
		last:      last,
		perns:     perns,
	}
}

func (f *Fuse) FuseTimes() int {
    
    
	return f.fuseTimes
}
func (f *Fuse) Last() int {
    
    
	return f.last
}
func (f *Fuse) Perns() int {
    
    
	return f.perns
}

// true, 未熔断,放行
// false, 熔断态,禁止通行
func (f *Fuse) FuseOk(key string) bool {
    
    
	fuseKey := fmt.Sprintf("is_fused:%s", key)
	v, exist := f.m.Get(fuseKey)

	if !exist {
    
    
		return true
	}

	vs, ok := v.(string)
	if exist && ok && vs == "fused" {
    
    
		return false
	}
	return false
}

// 某一次请求失败了,则需要调用Fail()
// 当fail次数达到阈值时,将会使得f.FuseOK(conn ,key) 返回false,调用方借此来熔断操作
func (f *Fuse) Fail(key string) {
    
    

	multi := time.Now().Unix() / int64(f.perns)

	timeskey := fmt.Sprintf("%s:%d", key, multi)

	rs := f.m.IncrByEx(timeskey, 1, f.perns)

	var ok bool
	ok = rs <= int64(f.fuseTimes)

	// 未达到配置的熔断阈值,fail无操作
	if ok {
    
    
		return
	}

	// 达到了熔断点
	fuseKey := fmt.Sprintf("is_fused:%s", key)
	f.m.SetEx(fuseKey, "fused", f.last)
}


In the business, how to access the fuse, here is an example of http.

Connect the fuse mechanism to all http apis

  • The access party is a service provider or a gateway service. When the service party/gateway service itself hangs up, the fuse mechanism will also be invalid.
  • Fusing will only fuse a certain route, not the entire service
  • For access fuse, all service status must be 200. (The reason why 410 is written in the code is because most business people like to use 307, 400 and 403. These three codes are redirection, parameter abnormality, and authentication error. These three codes are easy to generate and cannot be a fuse indicator)
package middleware

import (
	"fmt"
	"github.com/fwhezfwhez/fuse"
	"github.com/gin-gonic/gin"
)

var fm = fuse.NewFuse(20, 10, 5, 128)

func ResetFm(fuseTimes int, last int, pern int, slotNum int) {
    
    
	fm = fuse.NewFuse(fuseTimes, last, pern, slotNum)
}

func GinHTTPFuse(c *gin.Context) {
    
    
	if ok := fm.FuseOk(c.FullPath()); !ok {
    
    
		c.AbortWithStatusJSON(400, gin.H{
    
    
			"tip": fmt.Sprintf("http api '%s' has be fused for setting {%d times/%ds} and will lasting for %d second to retry", c.FullPath(), fm.FuseTimes(), fm.Perns(), fm.Last()),
		})
		return
	}

	c.Next()

	if c.Writer.Status() > 410 {
    
    
		fm.Fail(c.FullPath())
		return
	}
}

Test case:

package middleware

import (
	"fmt"
	"github.com/gin-gonic/gin"
	"io/ioutil"
	"net/http"
	"sync"
	"testing"
	"time"
)

func TestGinFuse(t *testing.T) {
    
    
	go func() {
    
    
		r := gin.Default()
		// 加入熔断保障
		r.Use(GinHTTPFuse)
		r.GET("/", func(c *gin.Context) {
    
    
			c.JSON(500, gin.H{
    
    "message": "pretend hung up"})
		})
		r.Run(":8080")
	}()

	time.Sleep(3 * time.Second)

	wg := sync.WaitGroup{
    
    }
	for i := 0; i < 1000; i++ {
    
    
		wg.Add(1)
		go func() {
    
    
			time.Sleep(time.Duration(time.Now().UnixNano()%20) * time.Millisecond)
			defer wg.Done()
			rsp, e := http.Get("http://localhost:8080/")
			if e != nil {
    
    
				panic(e)
			}

			bdb, e := ioutil.ReadAll(rsp.Body)
			if e != nil {
    
    
				panic(e)
			}

			fmt.Println(rsp.StatusCode, string(bdb))
		}()
	}

	// after 10s, will recover recv 500
	time.Sleep(15 * time.Second)
	rsp, e := http.Get("http://localhost:8080/")
	if e != nil {
    
    
		panic(e)
	}

	bdb, e := ioutil.ReadAll(rsp.Body)
	if e != nil {
    
    
		panic(e)
	}

	fmt.Println(rsp.StatusCode, string(bdb))
	wg.Wait()

}

Test Results:

// 阈值前,会返回错误
...
500 {
    
    "message":"pretend hung up"}
500 {
    
    "message":"pretend hung up"}
...
// 达到阈值后,会直接熔断
...
400 {
    
    "tip":"http api '/' has be fused for setting {20 times/5s} and will lasting for 10 second to retry"}

400 {
    
    "tip":"http api '/' has be fused for setting {20 times/5s} and will lasting for 10 second to retry"}

400 {
    
    "tip":"http api '/' has be fused for setting {20 times/5s} and will lasting for 10 second to retry"}

// 睡眠等到熔断时效失效,再次返回错误。无限循环,直到服务恢复
500 {
    
    "message":"pretend hung up"}

Conclusion

  1. How to customize the fuse in production? Please observe:
  • Every failure should have an alarm mechanism. The alarm threshold must be less than the fusing threshold. (Ensure that you can receive an alarm before fusing)
  • The fusing threshold [fuseTimes]/[pern] can be reasonably higher, and the fusing time [last] can be reasonably lower.
  1. Why not consider accessing open source fuse components?
  • The fuse is not complicated to implement.
  • The fuse component only needs to call the auto-related service with the central gateway, that is, unless you guarantee that the service is only called by the gateway service, it cannot be called externally, and it cannot be directly connected by the sub-service. Otherwise, the fusing effect will not take effect.
  • For historical reasons, there are a large number of direct service invocation relationships that do not go through the gateway.
  • When you need to call a third-party service, you need to set a circuit breaker only for a certain request.
  1. Is the fusing mechanism suitable for distributed?
  • Not suitable. There is no obvious difference between single point fuse and distributed fuse. There is no need to share the number of times. Even if it is made distributed, assuming that a certain machine cannot communicate with the B service network, but other machines can communicate, this machine will If the number of fusing is full, other normal services will also follow the fusing.
  1. The fuse recovery mechanism is to choose to open it once to clean up the number of failures accumulated in the history, or choose not to clean up and wait for the key value to become invalid naturally.
  • Suitable for natural failure (refers to [last]). First of all, the invalid key itself is not long, it is suitable for 10-15 seconds, and there will be no serious consequences. Secondly, if a fusing scene occurs, it is often unable to recover quickly. Don't care about these 10-15 seconds.
  • There is a hidden danger in clearing the number of fusing times. The server of the card machine may have good and bad services because of the high cpu/mem. Every time it is cleared, it means that requests within the threshold may still fail. There is a phantom in the fusing, and the number of failures may be much higher than the threshold. Therefore, for routes that have experienced failure times, the fuse times are not cleaned up. When there is a failure, it is best to increase its fuse possibility so that other normal nodes can provide services.

Guess you like

Origin blog.csdn.net/fwhezfwhez/article/details/114447362