Improve system reliability - automatic retry

introduction

Retry is an important means to improve system availability. We often see a lot of retry logic in business code. Whether there is a way to implement retry without intrusion, business code is completely unaware.

business retry

Common business codes

func ExampleRPCSend(ctx context.Context, msg string) error {

   fmt.Printf("\nsend msg %v", msg)

   return errors.New("rpc err")

}



func RetrySend(ctx context.Context, msg string) error {

   var err error

   for i := 0; i < 3; i++ {

      err = ExampleRPCSend(ctx, msg)

      if err == nil {

         break

      } else {

         time.Sleep(10 * time.Millisecond)

      }

   }

   return err

}

Example of use

err := RetrySend(ctx, "example")

Obviously this is a common logic that can be reused, so can it be elegantly not used in the code?

simple packaging

func RetryFun(ctx context.Context, fn func(ctx context.Context) error) error {

   var err error

   for i := 0; i < 3; i++ {

      err = fn(ctx)

      if err == nil {

         break

      } else {

         time.Sleep(10 * time.Millisecond)

      }

   }

   return err

}

Business example usage example

err = RetryFun(ctx, "example", ExampleRPCSend)

limited

The method signature that needs to be retried is limited, and only one signature retry method can be applied.

Generic method encapsulation

Reference blog post

You can easily write a general retry method, and slightly encapsulate the judgment of err and the retry time backoff

func Decorate(decoPtr, f interface{}) error {

   fn := reflect.ValueOf(f)

   decoratedFunc := reflect.ValueOf(decoPtr).Elem()

   logicFunc := func(in []reflect.Value) []reflect.Value {

      ret := make([]reflect.Value, 0)

      for i := 0; i < 3; i++ {

         ret = fn.Call(in)

         if !needRetryErr(ret) {

            break

         } else {

            time.Sleep(10 * time.Millisecond)

         }

      }

      return ret

   }

   v := reflect.MakeFunc(fn.Type(), logicFunc)

   decoratedFunc.Set(v)

   return nil

}

For separate encapsulation of error judgment, not all errors need to be retried, but there should be selective retries. This judgment, which will be described later, is beneficial to avoid retry avalanches.

It is assumed here that the method is in the usual case of go, and the last return value is error

var RetryBizCode = []string{"err_01","err_02"}



func needRetryErr(out []reflect.Value) bool {

   // 框架返回的错误,网络错误

   if err, ok := out[len(out)-1].Interface().(error); ok && err != nil {

      return true

   }



   // BizCode业务错误码,需要重试的错误码

   if isContain(GetBizCode(out), RetryBizCode) {

      return true

   }

   

   return false

}

Example of use

retryFun := ExampleRPCSend

Decorate(&retryFun, ExampleRPCSend)

err := retryFun(ctx, "example")

Compared with business retry and simple encapsulation, this usage is more "ugly".

middleware encapsulation

The retry logic can be encapsulated as middleware and implemented directly in the middleware.

func RpcRetryMW(next endpoint.EndPoint) endpoint.EndPoint {

   return func(ctx context.Context, req interface{}) (resp interface{}, err error) {

      if !retryFlag(ctx) {

         return next(ctx, req)

      }

      // rpc装饰

      decoratorFunc := next

      if err := Decorate(&decoratorFunc, next); err != nil {

         return next(ctx, req)

      }

      return decoratorFunc(ctx, req)

   }

}

Adding this middleware at the framework level can realize the retry of rpc calls

AddGlobalMiddleWares(RpcRetryMW)

The retryflag of the middleware can be customized, so that only the required scenarios can be retried, or the reverse definition can be used to not retry certain scenarios.

func retryFlag(ctx context.Context) bool {

   ...

   return true

}

If there is a judgment flag , you need to set the flag. Usually, we will perform the setFlag operation at the entrance of the scene, that is, the service that requests the first touch.

func SetFlag(ctx context.Context, flag string) context.Context {

   return context.WithValue(ctx, flagKey, flag)

}

The call link is very long. How to make the flag pass? The service side needs to add middleware, and the rpc base needs to pass the flag.

func CtxFlagMW(next endpoint.EndPoint) endpoint.EndPoint {

   return func(ctx context.Context, req interface{}) (resp interface{}, err error) {

      flag, ok := getFlagFromReq(ctx, req)

      if ok {

         setFlag(ctx, flag)

      }

      return next(ctx, req)

   }

}



Use(CtxFlagMW)

Assist business idempotent

The retry must be performed under the condition that the downstream is idempotent, otherwise it will cause data confusion. If the business interface itself is idempotent, it can be used directly, but most business interfaces are not idempotent, how to intervene in automatic retry?

Through analysis, our business is not idempotent mainly because of the write library operation. If we can judge that the write library has been operated before retrying, skipping the write library can satisfy the idempotency of most scenarios. To judge that the writing database has been operated, you can create a local transaction log table in the business database through the local transaction of the database to record the written local transaction.

image.png

Code

func TransactionDecorator(ctx context.Context, txFunc func(ctx context.Context) error) error {

   // 未接入的场景,走原逻辑不变

   if retryFlag(ctx) {

      return TransactionManager(ctx, txFunc)

   }



   // 生成本地账本

   event := genEventFromCtx(ctx)



   // 若本地账本已存在,则进行空补偿,跳过写库逻辑

   if ExistEvent(ctx, event) {

      return nil

   }



   // 本地账本不存在,则事务写入业务数据和本地账本

   return TransactionManager(ctx, func(ctx context.Context) error {

      // 业务逻辑

      if err := txFunc(ctx); err != nil {

         return err

      }

      // 写入本地账本

      if err := SaveEvent(ctx, event); err != nil {

         return err

      }

      return nil

   })

}

业务幂等只能是协助加一些手段,具体接入还需要业务判断是否足够,不能说保证了写库操作唯一就是保证了幂等。

预防重试雪崩

重试最大的风险就是带来请求的累积,把下游压垮。我们会从以下几方面预防重试雪崩效应。

1. needRetryErr方法,把不需要重试的错误拦截掉,直接返回,避免业务逻辑错误重试。

2. retryFlag方法,针对需要重试的场景设置标志,有重试标志才进行重试,不是把服务所有请求都重试,可以避免无效请求。

3.非超时错误的情况下,保证请求不放大。

有两种做法,一种如上通用做法,在调用下游出错的点上重试,等重试结束才向上返回。

第二种做法是先向上返回成功,内部进行重试。第二种做法的局限性大一些,适用的场景更少。在通用方法封装上如何实现向上先返回成功,具体实现可参考博文

4.超时情况下,可能多个服务都同时感知到超时,如何保证请求不放大?

超时错误,例如A->B->C->D,A,B,C同时感知到错误,那么都会发起重试,显然就放大了请求。那么想办法只让一个服务重试呢

needRetryErr方法可以识别错错误类型,也就是可以感知到超时错误。

SetFlag方法本身是在入口场景调用,那么可以设置入口场景标记entranceFlag,而在中间件CtxFlagMW中,只传递retryFlag,不传递entranceFlag,那么就只有A服务会有entranceFlag标记

needRetryErr方法判断err类型为超时,则retryFlag判断ctx内有retryFlag 以及 entranceFlag ,两个标记都有才发起重试,则能保证整条链路只在入口服务A处发起重试。

综上超时情况下,请求也能保证不放大,只有1个服务在重试。

Guess you like

Origin juejin.im/post/7115344614025855012