golang strange 499,504 service network troubleshooting

  • After the accident
  • Investigation
  • to sum up

After the accident

11-01 12:00 noon during the lunch, the phone suddenly received 200 non-business gateway abnormal alarm, usually there will be some small amount of network jitter 499 or an alarm is triggered, but will soon be restored (the current configuration of alarm thresholds of 5% , with the threshold value at the time of the sampling window qps are directly related).

200 non-alarm time has been accounted for 10% and continues to rise, the history of law should resume quickly according to our little observation for a few minutes (while eating very fragrant dumplings and watching the phone), but after a few minute fault recovery and no increase in the proportion exceeded 50% failure escalating (fault if resolved gradually upgrade is not within the fixed time, fault-layer group each upgrade will pull the boss came in higher level) has been shaking the phone continue to alarm hot, and the failure accounted for almost 100%, the impact was suddenly becomes larger.

At this point withdrawals have begun to alarm system, a lot of play money orders squeeze (squeeze play money orders to break a certain threshold to trigger an alarm, so it is not real-time), work colleagues also reflects payment systems also have a small connection error, suddenly feeling the situation more complicated, rapid stop to eat, and quickly back to the company investigation.

Back to the station almost time around 12:40, Quick View monitor the market, are basically 499,504 errors, such errors are due to network timeouts lead. A cluster of two machines were wrong, but also more qps average, you can exclude a machine problem.
vim

RT99 line basic 5s, and continuous sideways, which is triggered 5s upstream sidecar proxy to call a timeout initiative calls for a real time RT may be longer.
vim

Fault has not been restored, operation and maintenance services to assist with the investigation, at this time the group has been upgraded to a technical fault center boss, a large sum of instant pressure.

Check the gateway system log, call a large number of our internal systems reported two "downstream server timeout" error, according to the log information to determine network problems can lead to time-out, but we're calling is a network service, if the question is why only our network system affected.

Between 12:51 to 13:02 error accounted for the situation has improved, but after accounting for the error continues to rise.

At this time, synchronous operation and maintenance services to other departments have a large number of 302 alarms, consistent with the timeline a little bit, then it is almost time 13:30. But other sectors of the system and our system has nothing to do, too many questions we sit down together and began to focus on troubleshooting.

They try to do a version rollback did not improve, then try to access the domain name returned 302 cut into the network failure recovery immediately, this time just 14:00. According to their feedback in an experiment heavy volume, leading to a wave of peak traffic time at 12:00, but this wave of peak traffic to my link shock system where, look ignorant force, many doubts.
vim

vim

The fault duration is too long, the police reported a full two hours, three alarm once every failure group and a phone call, alarm call dozens of micro-channel alarm group level "disaster" for more information, the severity can be imagined know.

Investigation

Although the failure was due to the heavy volume led to other departments, but there are still too many unanswered questions, the next will occur again. As a technician, online environment is a very sacred place is restricted, we must find the root cause of failure every time, otherwise no way to give an account of himself, we started peeling an onion layer by layer.

Our next question to sort out the point:

1.302 What is the reason, why did the overall recovery of the domain name to switch on?
2. What are the system on both sides of the intersection on the link? If the application link is no intersection, does it intersect on the web link?
3. "downstream server timeout" Why did not affect other systems of our business gateway? Interpretation of logs describe whether or ambiguous?
4.504 is triggered sidecar proxy timeout, timeout Why gateway service settings did not work?

1.302 What is the reason, why did the overall recovery of the domain name to switch on?

After the operation and maintenance of our investigation and Ali cloud experts, because a large number of 302 domain names triggered DDos / CC high defense policies. Because domain names are configured DDos / CC high defense policy, which triggered a large number of requests a rule lead to deny the request (what specifically triggered the rule is not convenient to disclose), it will return 302, the situation can be solved by adding the white list of manslaughter .
(From the perspective of a reasonable internal calls we should not go outside the network, in part, a problem left over from history.)

2. What are the system on both sides of the intersection on the link? If the application link is no intersection, does it intersect on the web link?

Everyone attention was focused on high defense that failure is because the gateway also went to the address is being protected high, but our gateway configuration simply failed to address the high defense, and we will not have internal systems outside the network address .

排查提现系统问题,提现系统的配置里确实有用到被高防的外网地址,认为提现打款挤压也是因为走到了高防地址,但是这个高防地址只是一个旁路作用,不会影响打款流程。但是配置里确实有配置到,所以有理由判断肯定使用到了才会影响,这在当时确实是个很重要的线索,是个突破口。

根据这个线索认为网关系统虽然本身没有调用到高防地址,但是调用的下游也有可能会走到才会导致整个链路出现雪崩的问题。

通过大量排查下游服务,翻代码、看日志,基本上在应用层调用链路没有找到任何线索。开始在网络层面寻找线索,由于是内网调用所以路线是比较简单的,client->slb->gateway->slb->sidecar proxy->ecs,几个下游被调用系统请求一切正常,slb、sidecar proxy监控也一切正常,应用层、网络层都没有找到答案。

sidecar proxy 因为没有打开日志所以看不到请求(其实有一部分调用没有直连还是通过slb、vtm中转),从监控上看下游的 sidecar proxy 也一切正常,如果网路问题肯定是连锁反应。

百般无解之后,开始仔细检查当天出现故障的所有系统日志(由于现在流行Microservice所以服务比较多,错误日志量也比较大),在排查到支付系统的渠道服务时发现有一些线索,在事故发生期间有一些少量的 connection reset by peer,这个错误基本上多数出现在连接池化技术中使用了无效连接,或者下游服务器发生重启导致。但是在事故当时并没有发布。

通过对比前一周日志没有发生此类错误,那很有可能是很重要的线索,联系阿里云开始帮忙排查当时ecs实例在链路上是否有问题,惊喜的是阿里云反馈在事故当时出现 nat网关 限流丢包,一下子疑问全部解开了。
vim

限流丢包才是引起我们系统大量错误的主要原因,所以整个故障原因是这样的,由于做活动放量导致高防302和出网限流丢包,而我们系统受到影响都是因为需要走外网,提现打款需要用到支付宝、微信等支付渠道,而支付系统也是需要出外网用到支付宝、微信、银联等支付渠道。
(由于当时我们并没有nat网关的报警导致我们都一致认为是高防拦截了流量。)

问题又来了,为什么网关调用内部系统会出现问题,但是答案已经很明显。简单的检查了下其中一个调用会走到外网,网关的接口会调用下游三个服务,其中第一个服务调用就是会出外网。

这个问题是找到了,但是为什么下游设置的超时错误一个没看见,而且“下游服务器超时”的错误日志stack trace 堆栈信息是内网调用,这个还是没搞明白。

3.我们业务网关中的“下游服务器超时”为什么其他系统没有影响?对日志的解读或者描述是否有歧义?

通过分析代码,这个日志的输出并不是直接调用某个服务发生超时timeout,而是 go Context.Done() channel 的通知,我们来看下代码:

func Send(ctx context.Context, serverName, method, path string, in, out interface{}) (err error) {
    e := make(chan error)
    go func() {
        opts := []utils.ClientOption{
            utils.WithTimeout(time.Second * 1),
        }
        if err = utils.HttpSend(method, path, in, out, ops, opts...); err != nil {
            e <- err
            return
        }
        e <- nil
    }()

    select {
    case err = <-e:
        return
    case <-ctx.Done():
        err = errors.ErrClientTimeOut
        return
    }
}

Send 的方法通过 goroutine 启动一个调用,然后通过 select channel 感知http调用的结果,同时通过 ctx.Done() 感知本次上游http连接的 canceled

err = errors.ErrClientTimeOut
ErrClientTimeOut         = ErrType{64012, "下游服务器超时"}

这里的 errors.ErrClientTimeOut 就是日志“下游服务器超时”的错误对象。

很奇怪,为什么调用下游服务器没有超时错误,明明设置了timeout时间为1s。

        opts := []utils.ClientOption{
                    utils.WithTimeout(time.Second * 1),
                }
        if err = utils.HttpSend(method, path, in, out, ops, opts...); err != nil {
            e <- err
            return
        }

这个 utils.HttpSend 是有设置调用超时的,为什么一条调用超时错误日志没有,跟踪代码发现虽然opts对象传给了utils.HttpSend方法,但是里面却没有设置到 __http.Client__对象上。

client := &http.Client{}
    // handle option
    {
        options := defaultClientOptions
        for _, o := range opts {
            o(&options)
        }
        for _, o := range ops {
            o(req)
        }
        
        //set timeout
        client.Timeout = options.timeout

    }

    // do request
    {
        if resp, err = client.Do(req); err != nil {
            err = err502(err)
            return
        }
        defer resp.Body.Close()
    }

就是缺少一行 client.Timeout = options.timeout 导致http调用未设置超时时间。加上之后调用一旦超时会抛出 “net/http: request canceled (Client.Timeout exceeded while awaiting headers)” timeout 错误。

问题我们大概知道了,就是因为我们没有设置下游服务调用超时时间,导致上游连接超时关闭了,继而触发context.canceled事件。

上层调用会逐个同步进行。

    couponResp, err := client.Coupon.GetMyCouponList(ctx, r)
    // 不返回错误 降级为没有优惠券
    if err != nil {
        logutil.Logger.Error("get account coupon  faield",zap.Any("err", err))
    }
    coins, err := client.Coin.GetAccountCoin(ctx, cReq.UserID)
    // 不返回错误 降级为没有金币
    if err != nil {
        logutil.Logger.Error("get account coin faield",zap.Any("err", err))
    }
    subCoins, err := client.Coin.GetSubAccountCoin(ctx, cReq.UserID)
    // 不返回错误 降级为没有金币
    if err != nil {
        logutil.Logger.Error("get sub account coin faield",zap.Any("err", err))
    }

client.Coupon.GetMyCouponList 获取优惠券
client.Coin.GetAccountCoin 获取金币账户
client.Coin.GetSubAccountCoin 获取金币子账户

这三个方法内部都会调用Send方法,这个接口逻辑就是获取用户名下所有的现金抵扣权益,并且在超时时间内做好业务降级。但是这里处理有一个问题,就是没有识别Send方法返回的错误类型,其实连接断了之后程序再往下走已经没有意义也就失去了Context.canceld的意义。
(go和其他主流编程语言在线程(Thread)概念上有一个很大的区别,go是没有线程概念的(底层还是通过线程在调度),都是goroutine。go也是完全隐藏routine的,你无法通过类似Thread Id 或者 Thread local线程本地存储等技术,所有的routine都是通过context.Context对象来协作,比如在java 里要想取消一个线程必须依赖Thread.Interrupt中断,同时要捕获和传递中断信号,在go里需要通过捕获和传递Context信号。)

4.504是触发sidecar proxy 超时断开连接,网关服务器设置的超时为什么没起作用?

sidecar proxy 断开连接有三个场景:

1.499同时会关闭下游连接
2.504超时直接关闭下游连接
3.空闲超过60s关闭下游连接

事故当时499、504 sidecar proxy 主动关闭连接,网关服务Context.Done()方法感知到连接取消抛出异常,上层方法输出日志“下游服务器超时”。那为什么我们网关服务器本身的超时没起作用。

http/server.Server对象有四个超时参数我们并没有设置,而且这一类参数通常会被忽视,作为一个服务器本身对所有进来的请求是有最长服务要求,我们一般关注比较多的是下游超时会忽视服务本身的超时设置。

type Server struct {
    // ReadTimeout is the maximum duration for reading the entire
    // request, including the body.
    //
    // Because ReadTimeout does not let Handlers make per-request
    // decisions on each request body's acceptable deadline or
    // upload rate, most users will prefer to use
    // ReadHeaderTimeout. It is valid to use them both.
    ReadTimeout time.Duration

    // ReadHeaderTimeout is the amount of time allowed to read
    // request headers. The connection's read deadline is reset
    // after reading the headers and the Handler can decide what
    // is considered too slow for the body.
    ReadHeaderTimeout time.Duration

    // WriteTimeout is the maximum duration before timing out
    // writes of the response. It is reset whenever a new
    // request's header is read. Like ReadTimeout, it does not
    // let Handlers make decisions on a per-request basis.
    WriteTimeout time.Duration

    // IdleTimeout is the maximum amount of time to wait for the
    // next request when keep-alives are enabled. If IdleTimeout
    // is zero, the value of ReadTimeout is used. If both are
    // zero, ReadHeaderTimeout is used.
    IdleTimeout time.Duration
}

这些超时时间都会通过setDeadline计算成绝对时间点设置到netFD对象(Network file descriptor.)上。
由于没有设置超时时间所以相当于所有的连接关闭都是通过sidecar proxy触发传递下来的。

我们已经知道 sidecar proxy 关闭连接的1、2两种原因,第3种情况出现在http长连接上,但是这类连接关闭是无感知的。

默认的tcpKeepAliveListener对象的keepAlive是3分钟。

func (ln tcpKeepAliveListener) Accept() (net.Conn, error) {
    tc, err := ln.AcceptTCP()
    if err != nil {
        return nil, err
    }
    tc.SetKeepAlive(true)
    tc.SetKeepAlivePeriod(3 * time.Minute)
    return tc, nil
}

我们服务host是使用endless框架,默认也是3分钟,这其实是个约定90s,过小会影响上游代理。

func (el *endlessListener) Accept() (c net.Conn, err error) {
    tc, err := el.Listener.(*net.TCPListener).AcceptTCP()
    if err != nil {
        return
    }

    tc.SetKeepAlive(true)                  // see http.tcpKeepAliveListener
    tc.SetKeepAlivePeriod(3 * time.Minute) // see http.tcpKeepAliveListener

    c = endlessConn{
        Conn:   tc,
        server: el.server,
    }

    el.server.wg.Add(1)
    return
}

sidecar proxy 的超时是60s,就算我们要设置keepAlive的超时时间也要大于60s,避免sidecar proxy使用了我们关闭的连接。
(但是这在网络不稳定的情况下会有问题,如果发生HA Failover 然后在一定小概率的心跳窗口内,服务状态并没有传递到注册中心,导致sidecar proxy重用了之前的http长连接。这其实也是个权衡,如果每次都检查连接状态一定会影响性能。)

这里有个好奇问题,http是如何感知到四层tcp的状态,如何将Context.cancel的事件传递上来的,我们来顺便研究下。

type conn struct {
    // server is the server on which the connection arrived.
    // Immutable; never nil.
    server *Server

    // cancelCtx cancels the connection-level context.
    cancelCtx context.CancelFunc
}
func (c *conn) serve(ctx context.Context) {
    
    // HTTP/1.x from here on.
    
    ctx, cancelCtx := context.WithCancel(ctx)
    c.cancelCtx = cancelCtx
    defer cancelCtx()

    c.r = &connReader{conn: c}
    c.bufr = newBufioReader(c.r)
    c.bufw = newBufioWriterSize(checkConnErrorWriter{c}, 4<<10)

    for {
        w, err := c.readRequest(ctx)

        if !w.conn.server.doKeepAlives() {
            // We're in shutdown mode. We might've replied
            // to the user without "Connection: close" and
            // they might think they can send another
            // request, but such is life with HTTP/1.1.
            return
        }

        if d := c.server.idleTimeout(); d != 0 {
            c.rwc.SetReadDeadline(time.Now().Add(d))
            if _, err := c.bufr.Peek(4); err != nil {
                return
            }
        }
        c.rwc.SetReadDeadline(time.Time{})
    }
}
// handleReadError is called whenever a Read from the client returns a
// non-nil error.
//
// The provided non-nil err is almost always io.EOF or a "use of
// closed network connection". In any case, the error is not
// particularly interesting, except perhaps for debugging during
// development. Any error means the connection is dead and we should
// down its context.
//
// It may be called from multiple goroutines.
func (cr *connReader) handleReadError(_ error) {
    cr.conn.cancelCtx()
    cr.closeNotify()
}
// checkConnErrorWriter writes to c.rwc and records any write errors to c.werr.
// It only contains one field (and a pointer field at that), so it
// fits in an interface value without an extra allocation.
type checkConnErrorWriter struct {
    c *conn
}

func (w checkConnErrorWriter) Write(p []byte) (n int, err error) {
    n, err = w.c.rwc.Write(p)
    if err != nil && w.c.werr == nil {
        w.c.werr = err
        w.c.cancelCtx()
    }
    return
}

其实tcp的状态不是通过主动事件触发告诉上层http的,而是每当http主动去发现。

read时使用connReader来感知tcp状态,writer时使用checkConnErrorWriter对象来感知tcp状态,然后通过server.conn对象中的cancelCtx来递归传递。

type conn struct {
    // server is the server on which the connection arrived.
    // Immutable; never nil.
    server *Server

    // cancelCtx cancels the connection-level context.
    cancelCtx context.CancelFunc
}

总结

此次故障排查了整整两天半,很多点是需要去反思和优化的。

1. All the networks call does not throw the most original error message. (After log after processing will seriously misleading.)
2. Set timeout failed to play the role, it has not been a complete failure and pressure measurement exercise, so it is easy to null timeout.
3. internal and external domain name is not isolated, it is necessary to distinguish between internal and external network calls and do the environment.
4.http server itself timeout is not set, if problems arise internal procedures leading to handle timeouts, concurrent server will collapse.
5. call link and network architecture on the cloud need to be very familiar with, so as to quickly locate the problem.

In fact, the entire system once the cloud network architecture is complicated, too many confounding factors, the investigation will face a relatively large dependence, alarm monitoring coverage and the base is relatively large difficult to detect individual business lines. (In fact, some questions could not be answered.)
All the fault can not reproduce is the most difficult to troubleshoot, because the only evidence afterwards by a ring explanation, involving a network problem situation is more complicated.

Author: Wang Qingpei (Fun headlines Tech Leader)

Guess you like

Origin www.cnblogs.com/wangiqngpei557/p/11873096.html