[Original] Experience Sharing: A bloody case triggered by Content-Length (almost...)

[Original] Experience Sharing: A bloody case triggered by Content-Length (almost...)

Preface

I encountered a problem at work last week, which is quite interesting. Let me record it. I encountered a problem at work last week, which is quite interesting. Let me record it. The title is bluffing. This problem almost caused a murder. Brother Hua is still a very strict person.almost....

In the test environment, when the front-end called an interface of our service, it was found to be extremely slow, and the response time exceeded 30s , which was simply unbearable! !

Viewing the log shows that our service has been timing out when Feigncalling another service's GETinterface through a request , and then retried until it failed. But the strange thing is ip+端口that the GETresponse speed is very fast when manually requesting this timeout interface.

This is very strange. Why does the timeout keep calling the good interface before? At this moment, I am full of question marks. . .

phenomenon

The front end calls 服务Aa query interface of our service (here called ), where the front end uses a POSTrequest, and our service will Feigncall 服务Ban interface of another service (here called ), which provides a GETform of call to the outside .

From the perspective of the phenomenon, it is extremely slow to call our service, and a request is responded to several tens of seconds. The specific process is as follows:

Request process.png

Troubleshooting

At that time, the doubts that appeared in my mind were too strange. This should not happen in the previous interface called, and the ip+端口response speed is very fast if you manually call it, so I found a 服务Bclassmate who was developing externally, because I ignored it. I got some important log information, so I took a lot of detours here. With the help of my colleagues, I sorted out the problem.

The root cause of the problem is that we passed parameters in the GETrequest , and service B recently added a package, and an interceptor did something to cause this problem. Here I sort out the root cause of the whole problem from the source code level , and how to avoid such problems in the future!HeaderContent-Lengthjarjar

For that matter, they were to start their own local 服务Aand 服务Border DEBUGmode is activated, you can find a stable reproducible, and can be seen in the call 服务Bstuck when the stack information :

Thread stack information.png

服务AThe reason why the initiated request is stuck is that it is awaitLatch()being suspended. It is the breakthrough point to find the cause of the problem when it is here. If you continue to track it step by step, you can find the problem. The following will carefully analyze it step by step.

problem causes

The reason for this problem is actually derived from the above troubleshooting:

  1. When the front-end calls the server-side interface, because it is a postrequest, there headerare Content-Lengthattributes passed in . When the feignrequest is called , no matter it getis a postrequest, there is an Feigninterceptor in the company's underlying package that will assign the front-end request Headerattribute to the feignrequest Header, causing us to send the GETrequest HeaderIt also contains Content-Lengthattributes.

ps: This is a pitfall. The underlying package that depends on adds a Feign interceptor. We only saw the Content-Length attribute in the console by printing the feign request log, and finally tracked the FeignInterceptor.

  1. 服务B刚好依赖了另一个jar包,该包中包含一个Filter拦截器,它会读取发送的请求body数据,然后做一些日志打印。而且这个jar包依赖也是他们刚加的,他们使用该包中的其他一些工具类

public class ChannelFilter implements Filter {    public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse, FilterChain filterChain) throws IOException, ServletException {        if (servletRequest instanceof HttpServletRequest) {
            requestWrapper = new RequestWrapper((HttpServletRequest)servletRequest);
            log.info("Http RequestURL : {}, Method : {}, RequestParam : {}, RequestBody : {}", new Object[]{((HttpServletRequest)servletRequest).getRequestURL(), ((HttpServletRequest)servletRequest).getMethod(), JSON.toJSON(servletRequest.getParameterMap()), ((RequestWrapper)requestWrapper).getBody()});
        }


        filterChain.doFilter((ServletRequest)requestWrapper, servletResponse);
    }    public void destroy() {
    }
}public class RequestWrapper extends HttpServletRequestWrapper {    private static final Logger log = LoggerFactory.getLogger(RequestWrapper.class);    private final String body;    public RequestWrapper(HttpServletRequest request) {        super(request);
        StringBuilder stringBuilder = new StringBuilder();
        BufferedReader bufferedReader = null;
        ServletInputStream inputStream = null;        try {
            inputStream = request.getInputStream();            if (inputStream != null) {
                bufferedReader = new BufferedReader(new InputStreamReader(inputStream));                char[] charBuffer = new char[4096];                boolean var6 = true;                int bytesRead;                while((bytesRead = bufferedReader.read(charBuffer)) != -1) {
                    stringBuilder.append(charBuffer, 0, bytesRead);
                }
            }
        } catch (IOException var19) {
            log.error(var19.getMessage(), var19);
        }
    }
}

在执行request body读取的代码时使用到:

while((bytesRead = bufferedReader.read(charBuffer)) != -1) {
   stringBuilder.append(charBuffer, 0, bytesRead);
}

bufferedReader.read()最终会调用到Tomcat 中org.apache.tomcat.util.net.NioBlockingSelector.read()的方法读取request中的body属性:

int keycount = 1; 
while(!timedout) {    if (keycount > 0) { //only read if we were registered for a read
        read = socket.read(buf);        if (read != 0) {            break;
        }
    }    try {        if ( att.getReadLatch()==null || att.getReadLatch().getCount()==0) att.startReadLatch(1);
        poller.add(att,SelectionKey.OP_READ, reference);        if (readTimeout < 0) {
            att.awaitReadLatch(Long.MAX_VALUE, TimeUnit.MILLISECONDS);
        } else {
            att.awaitReadLatch(readTimeout, TimeUnit.MILLISECONDS);
        }
    } catch (InterruptedException ignore) {        // Ignore
    }
}

这里因为GET请求的body为空,所以socket.read() 返回为0,进而走到att.awaitReadLatch(readTimeout, TimeUnit.MILLISECONDS);

protected void awaitLatch(CountDownLatch latch, long timeout, TimeUnit unit) throws InterruptedException {    if ( latch == null ) throw new IllegalStateException("Latch cannot be null");
    latch.await(timeout,unit);
}

这里就会调用到LockSuport.parkNanos(time) 接口 直到超时,此时的你们会不会仍然有疑惑,为什么Header中传递了Content-Length就会走这个逻辑链路呢?别急,继续往下看,后面还有更精彩的分析......

解决方案

  1. 服务B取消有问题jar包的依赖

  2. 修改问题jar包中Filter的配置,判断只有Post请求才去读取body属性

  3. 接口调用方添加配置如果是GET请求时过滤掉Content-Length属性(主要原因)

  4. 修改底层依赖包FeignInterceptor,判断请求的方式然后再针对Header赋值(公司底层依赖的包我们不太好修改)

其实最应该修改的是方案4,只是这个是全公司都会依赖的一个底层包,如果改动起来需要通知架构组等等,而且影响面会比较大。

最终我们先采用方案3,在我们请求链路中去做一些判断,去除GET请求中Content-Length的传递。

解决原理

接下来就是真正原理的地方了,当服务端发出feign请求后,一定会走Tomcat中的org.apache.coyote.http11.Http11Processor.prepareRequest()方法,代码如图:

Http11Processor.prepareRequest().png

如果contentLength >= 0,那么会添加一个org.apache.coyote.http11.filters.IdentityInputFilter类,在服务B添加的jar包中的RequestWrapper中的bufferedReader.read()会调用到 org.apache.coyote.http11.filters.IdentityInputFilter.doRead() 方法:

wE7F6s.png

这个方法又会直接调用到 org.apache.tomcat.util.net.NioBlockingSelector.read()中:

NioBlockingSelector.read().png

Because the GETrequest request bodyis empty, the socketreturn value is 0 when it is read here , and the following awaitReadLatch() method is run directly . Here, the LockSuport.parkNanos(time) interface will be called until the timeout, which is why each feignrequest will timeout.

But what if the service requester is configured to pass Content-Lengthempty? One will be constructed here, and the construction org.apache.coyote.http11.filters.VoidInputFilterof this interceptor Http11Processor.prepareRequest()has been indicated in the diagram above :

VoidInputFilter.png

Obviously, -1 is directly returned here, and the NioBlockingSelector.read() method will not be called anymore, so successfully solving this problem is also the key to the problem.

to sum up

There is not too much to introduce Content-Lengththe concept here, acquiescence everyone knows this, if you are not clear, you can refer to:
https://blog.piaoruiqing.com/2019/09/08/do-you-know-content-length /

A simple one Content-Lengthreally stumped me, the irregular request is the real cause of this problem. It took a lot of time to find out this problem, but these are all worthwhile. A person's growth cannot be separated from the baptism of various problems. I hope everyone will gain something after reading.

Welcome to pay attention:
Original dry goods sharing.png


Guess you like

Origin blog.51cto.com/7592962/2543108