java.io.IOException Broken pipe workaround ClientAbortException: java.io.IOException: Broken

Today, the company's technical support children's shoes reported that a customer's service was not working, and urgently asked for help, so he remotely logged in to the server to troubleshoot the problem.

    Check the tomcat log of the collected data, and habitually turn to the end of the log to check whether there are any abnormal prints. Sure enough, several kinds of abnormal information are found, but the most is this:

[java]  view plain copy  
 
  1. 24-Nov-201609:54:21.116 SEVERE [http-nio-8081-Acceptor-0] org.apache.tomcat.util.net.NioEndpoint$Acceptor.run Socket accept failed   
  2.  java.io.IOException: Too many open files  
  3.     at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)  
  4.     at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)  
  5.     at org.apache.tomcat.util.net.NioEndpoint$Acceptor.run(NioEndpoint.java:688)  
  6.     at java.lang.Thread.run(Thread.java:745)  

    The problem of "Too manay open files" is very obvious. The file descriptor exceeds the limit, which makes it impossible to open the file or create a network connection. This problem will lead to some other problems. It must be that ulimit is not optimized, so check the setting of ulimit;

[plain]  view plain copy  
 
  1. [root@sdfassd logs]# ulimit -a  
  2. core file size          (blocks, -c) 0  
  3. data seg size           (kbytes, -d) unlimited  
  4. scheduling priority             (-e) 0  
  5. file size               (blocks, -f) unlimited  
  6. pending signals                 (-i) 62819  
  7. max locked memory       (kbytes, -l) 64  
  8. max memory size         (kbytes, -m) unlimited  
  9. open files                      (-n) 65535  
  10. pipe size            (512 bytes, -p) 8  
  11. POSIX message queues     (bytes, -q) 819200  
  12. real-time priority              (-r) 0  
  13. stack size              (kbytes, -s) 10240  
  14. cpu time               (seconds, -t) unlimited  
  15. max user processes              (-u) 62819  
  16. virtual memory          (kbytes, -v) unlimited  
  17. file locks                      (-x) unlimited  

 

     The open files turned out to be 65535, which has been optimized. Did you start the tomcat and other services first, and then optimize the ulimit? It is possible, in this case, restarting the service will be ok, so I restarted all the services, and it is running normally. After a while, the data will be displayed in the report, and then tell the technical support that the problem has been solved, and then go to deal with other things. case;

    The result was less than 20 minutes. The technical support said that there was no data in the report, so I checked the tomcat log of the data collection application and found a bunch of exceptions, all of which were a mistake:

[java]  view plain copy  
 
  1. 24-Nov-2016 09:54:24.574 WARNING [http-nio-18088-exec-699] org.apache.catalina.core.StandardHostValve.throwable Exception Processing ErrorPage[exceptionType=java.lang.Throwable, location=/views/error/500.jsp]  
  2.  org.apache.catalina.connector.ClientAbortException: java.io.IOException: Broken pipe  
  3.     at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:393)  
  4.     at org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:426)  
  5.     at org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:342)  
  6.     at org.apache.catalina.connector.OutputBuffer.close(OutputBuffer.java:295)  
  7.     at org.apache.catalina.connector.Response.finishResponse(Response.java:453)  
  8.     at org.apache.catalina.core.StandardHostValve.throwable(StandardHostValve.java:378)  
  9.     at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:174)  
  10.     at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:79)  
  11.     at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:610)  
  12.     at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:610)  
  13.     at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:88)  
  14.     at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:537)  
  15.     at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1085)  
  16.     at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:658)  
  17.     at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:222)  
  18.     at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1556)  
  19.     at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1513)  
  20.     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)  
  21.     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)  
  22.     at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)  
  23.     at java.lang.Thread.run(Thread.java:745)  


    这个异常非常多,看报错信息,是tomcat的connector在执行写操作的时候发生了Broken pipe异常,connector是tomcat处理网络请求的,难道是网络出问题了,但是为什么发生异常的都是写,读就没问题呢?为了判断是不是网络问题,于是用wget命令访问了一下服务器的一个接口,结果发现等了好久都没有响应,正常情况下应该是马上就有响应的,这说明不是网络的原因,是服务器的问题,又用命令查看了下当前tcpip连接的状态:

[plain]  view plain  copy
 
  1. [root@sdfassd logs]# netstat -n | awk '/^tcp/ {++state[$NF]} END {for(key in state) print key,"\t",state[key]}'  
  2. CLOSE_WAIT        3853  
  3. TIME_WAIT         40  
  4. ESTABLISHED       285  
  5. LAST_ACT          6  


    CLOSE_WAIT 状态的连接竟然有3853个,这太不正常了,这说明是客户端先关闭了连接,服务器端没有执行关闭连接的操作,导致服务器端一直维持在CLOSE_WAIT的状态,如果不对操作系统的keepalive做优化,这个状态默认会维持两个小时,查看了下系统的设置:

[plain]  view plain  copy
 
  1. [root@sdfassd logs]# sysctl -a |grep keepalive  
  2. net.ipv4.tcp_keepalive_time = 7200  
  3. net.ipv4.tcp_keepalive_probes = 9  
  4. net.ipv4.tcp_keepalive_intvl = 75  

    果然是7200秒,这就解释通了,为什么第一次查看tomcat日志最后报错都是“Too manay open files”异常,一定是在两个小时内,close_wait状态暴增,导致文件描述符超过了65535的最大限制;

    而这个状态应该就是broken pipe 异常导致的,是什么导致的broken pipe异常呢?为什么探针关闭了连接,但是数据采集服务器却没有关闭连接?报异常的是tomcat的connector,tomcat不可能会忘记调用close方法去关闭连接,排除了程序的问题,也想不出来是什么导致的了;

    于是去拿了往采集服务器上传数据的探针的日志查看,竟然有大量的一个异常:

[plain]  view plain  copy
 
  1. 2016-11-24 16:27:36,217 [TingYun Harvest Service 1] 166 WARN  - Error occurred sending metric data to TingYun. There can be intermittent connection failures. Please wait for a short period of time: java.net.SocketTimeoutException: Read timed out  
  2. java.net.SocketTimeoutException: Read timed out  
  3.     at java.net.SocketInputStream.socketRead0(Native Method) ~[na:1.7.0_60]  
  4.     at java.net.SocketInputStream.read(SocketInputStream.java:152) ~[na:1.7.0_60]  
  5.     at java.net.SocketInputStream.read(SocketInputStream.java:122) ~[na:1.7.0_60]  
  6.     at com.tingyun.agent.libs.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SourceFile:136) ~[tingyun-agent-java.jar:2.1.3]  
  7.         .................  

    都是read time out异常,那么问题就明确了,  是探针端读取超时了,断开了连接,而这时候数据采集服务器还在处理请求,它并不知道探针端已经断开了连接,处理完请求后再将处理结果发给探针,就broken pipe了;

    It turns out that the exception is that the client reads the timeout and closes the connection. At this time, when the server writes data to the connection that the client has disconnected, a broken pipe exception occurs!

 

    The probe read timeout is 2 minutes. Why does the server not respond for such a long time? So I used the jstack command to export the thread stack information of tomcat for analysis, and finally found that there were time-consuming operations in the code that were locked, resulting in thread blocking (for confidentiality reasons, the code will not be posted here);

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326564720&siteId=291194637