[Original] Experience Sharing: A bloody case triggered by Content-Length (almost...)
Preface
I encountered a problem at work last week, which is quite interesting. Let me record it. I encountered a problem at work last week, which is quite interesting. Let me record it. The title is bluffing. This problem almost caused a murder. Brother Hua is still a very strict person.almost....
In the test environment, when the front-end called an interface of our service, it was found to be extremely slow, and the response time exceeded 30s , which was simply unbearable! !
Viewing the log shows that our service has been timing out when Feign
calling another service's GET
interface through a request , and then retried until it failed. But the strange thing is ip+端口
that the GET
response speed is very fast when manually requesting this timeout interface.
This is very strange. Why does the timeout keep calling the good interface before? At this moment, I am full of question marks. . .
phenomenon
The front end calls 服务A
a query interface of our service (here called ), where the front end uses a POST
request, and our service will Feign
call 服务B
an interface of another service (here called ), which provides a GET
form of call to the outside .
From the perspective of the phenomenon, it is extremely slow to call our service, and a request is responded to several tens of seconds. The specific process is as follows:
Troubleshooting
At that time, the doubts that appeared in my mind were too strange. This should not happen in the previous interface called, and the ip+端口
response speed is very fast if you manually call it, so I found a 服务B
classmate who was developing externally, because I ignored it. I got some important log information, so I took a lot of detours here. With the help of my colleagues, I sorted out the problem.
The root cause of the problem is that we passed parameters in the GET
request , and service B recently added a package, and an interceptor did something to cause this problem. Here I sort out the root cause of the whole problem from the source code level , and how to avoid such problems in the future!Header
Content-Length
jar
jar
For that matter, they were to start their own local 服务A
and 服务B
order DEBUG
mode is activated, you can find a stable reproducible, and can be seen in the call 服务B
stuck when the stack information :
服务A
The reason why the initiated request is stuck is that it is awaitLatch()
being suspended. It is the breakthrough point to find the cause of the problem when it is here. If you continue to track it step by step, you can find the problem. The following will carefully analyze it step by step.
problem causes
The reason for this problem is actually derived from the above troubleshooting:
When the front-end calls the server-side interface, because it is a
post
request, thereheader
areContent-Length
attributes passed in . When thefeign
request is called , no matter itget
is apost
request, there is anFeign
interceptor in the company's underlying package that will assign the front-end requestHeader
attribute to thefeign
requestHeader
, causing us to send theGET
requestHeader
It also containsContent-Length
attributes.
ps: This is a pitfall. The underlying package that depends on adds a Feign interceptor. We only saw the Content-Length attribute in the console by printing the feign request log, and finally tracked the FeignInterceptor.
服务B刚好依赖了另一个
jar
包,该包中包含一个Filter
拦截器,它会读取发送的请求body
数据,然后做一些日志打印。而且这个jar
包依赖也是他们刚加的,他们使用该包中的其他一些工具类
public class ChannelFilter implements Filter { public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse, FilterChain filterChain) throws IOException, ServletException { if (servletRequest instanceof HttpServletRequest) { requestWrapper = new RequestWrapper((HttpServletRequest)servletRequest); log.info("Http RequestURL : {}, Method : {}, RequestParam : {}, RequestBody : {}", new Object[]{((HttpServletRequest)servletRequest).getRequestURL(), ((HttpServletRequest)servletRequest).getMethod(), JSON.toJSON(servletRequest.getParameterMap()), ((RequestWrapper)requestWrapper).getBody()}); } filterChain.doFilter((ServletRequest)requestWrapper, servletResponse); } public void destroy() { } }public class RequestWrapper extends HttpServletRequestWrapper { private static final Logger log = LoggerFactory.getLogger(RequestWrapper.class); private final String body; public RequestWrapper(HttpServletRequest request) { super(request); StringBuilder stringBuilder = new StringBuilder(); BufferedReader bufferedReader = null; ServletInputStream inputStream = null; try { inputStream = request.getInputStream(); if (inputStream != null) { bufferedReader = new BufferedReader(new InputStreamReader(inputStream)); char[] charBuffer = new char[4096]; boolean var6 = true; int bytesRead; while((bytesRead = bufferedReader.read(charBuffer)) != -1) { stringBuilder.append(charBuffer, 0, bytesRead); } } } catch (IOException var19) { log.error(var19.getMessage(), var19); } } }
在执行request body
读取的代码时使用到:
while((bytesRead = bufferedReader.read(charBuffer)) != -1) { stringBuilder.append(charBuffer, 0, bytesRead); }
bufferedReader.read()
最终会调用到Tomcat
中org.apache.tomcat.util.net.NioBlockingSelector.read()
的方法读取request
中的body
属性:
int keycount = 1; while(!timedout) { if (keycount > 0) { //only read if we were registered for a read read = socket.read(buf); if (read != 0) { break; } } try { if ( att.getReadLatch()==null || att.getReadLatch().getCount()==0) att.startReadLatch(1); poller.add(att,SelectionKey.OP_READ, reference); if (readTimeout < 0) { att.awaitReadLatch(Long.MAX_VALUE, TimeUnit.MILLISECONDS); } else { att.awaitReadLatch(readTimeout, TimeUnit.MILLISECONDS); } } catch (InterruptedException ignore) { // Ignore } }
这里因为GET
请求的body
为空,所以socket.read()
返回为0,进而走到att.awaitReadLatch(readTimeout, TimeUnit.MILLISECONDS)
;
protected void awaitLatch(CountDownLatch latch, long timeout, TimeUnit unit) throws InterruptedException { if ( latch == null ) throw new IllegalStateException("Latch cannot be null"); latch.await(timeout,unit); }
这里就会调用到LockSuport.parkNanos(time)
接口 直到超时,此时的你们会不会仍然有疑惑,为什么Header
中传递了Content-Length
就会走这个逻辑链路呢?别急,继续往下看,后面还有更精彩的分析......
解决方案
服务B
取消有问题jar
包的依赖修改问题
jar
包中Filter
的配置,判断只有Post
请求才去读取body
属性接口调用方添加配置如果是
GET
请求时过滤掉Content-Length
属性(主要原因)修改底层依赖包
FeignInterceptor
,判断请求的方式然后再针对Header
赋值(公司底层依赖的包我们不太好修改)
其实最应该修改的是方案4,只是这个是全公司都会依赖的一个底层包,如果改动起来需要通知架构组等等,而且影响面会比较大。
最终我们先采用方案3,在我们请求链路中去做一些判断,去除GET
请求中Content-Length
的传递。
解决原理
接下来就是真正原理的地方了,当服务端发出feign
请求后,一定会走Tomcat
中的org.apache.coyote.http11.Http11Processor.prepareRequest()
方法,代码如图:
如果contentLength >= 0
,那么会添加一个org.apache.coyote.http11.filters.IdentityInputFilter
类,在服务B
添加的jar
包中的RequestWrapper
中的bufferedReader.read()
会调用到 org.apache.coyote.http11.filters.IdentityInputFilter.doRead()
方法:
这个方法又会直接调用到 org.apache.tomcat.util.net.NioBlockingSelector.read()
中:
Because the GET
request request body
is empty, the socket
return value is 0 when it is read here , and the following awaitReadLatch()
method is run directly . Here, the LockSuport.parkNanos(time)
interface will be called until the timeout, which is why each feign
request will timeout.
But what if the service requester is configured to pass Content-Length
empty? One will be constructed here, and the construction org.apache.coyote.http11.filters.VoidInputFilter
of this interceptor Http11Processor.prepareRequest()
has been indicated in the diagram above :
Obviously, -1 is directly returned here, and the NioBlockingSelector.read()
method will not be called anymore, so successfully solving this problem is also the key to the problem.
to sum up
There is not too much to introduce Content-Length
the concept here, acquiescence everyone knows this, if you are not clear, you can refer to:
https://blog.piaoruiqing.com/2019/09/08/do-you-know-content-length /
A simple one Content-Length
really stumped me, the irregular request is the real cause of this problem. It took a lot of time to find out this problem, but these are all worthwhile. A person's growth cannot be separated from the baptism of various problems. I hope everyone will gain something after reading.
Welcome to pay attention: