Research on the timeout problem of some rest interfaces of dubbox

 

During the peak period of business, some rest interfaces have timed out for a period of time. I have been suspected of kafka, nginx, log4j, network and other reasons and optimized it, but there has not been much change. We have a total of four nginx reverse proxy gateways in production. The operation and maintenance can see through log grep in a certain nginx. During the peak period, nginx reverse proxy to a backend tomcat, reaching 100+ per second, and 4 nginx is 400+ , has exceeded the number of concurrent connections set by tomcat and the size of the fully connected queue (200+100=300). Two days ago, the operation and maintenance said that after increasing the number of tomcat threads to 600 (the original was 200), the timeout problem no longer appeared.

 

The reasons are considered after the fact as follows:

There is a parameter in the tomcat configuration called acceptCount, which in tomcat refers to the socket upper limit (backlog) of the fully connected queue of the listening port on the server side. This value cannot be set in dubbox, the default is 100.

Another parameter is called threadCount, which refers to the size of the tomcat worker thread pool. The tomcat thread pool takes a thread from the head of the queue to process the request each time, and then puts it at the tail of the queue after the request is finished, that is to say, the same thread will not be used for the two requests before and after. as follows:


 

In other words, when threadCount=200 and acceptCount=100 before, the maximum number of concurrent requests that tomcat can accept is 200. If the number of concurrent requests is greater than 200, only an additional 100 concurrent requests will be queued in the fully connected queue, and other concurrent requests will be queued. The request cannot enter the fully connected queue at all, and will be rejected by the server ( see Note 1 for details ). There are three more cases:

1. The semi-connection queue is full: see Note 2 for details

2. Timeout in the fully connected queue: This happens mainly because the interface processing response is slow, resulting in no idle threads in the tomcat worker thread pool to process the requests in the fully connected queue, and the requests in the fully connected queue reach the timeout time (20s). ), the tcp connection was actively closed by nginx (http ressponse code: 504 in the nginx log). One thing to note is that requests that have entered the fully connected queue will eventually be processed by tomcat, but the connection has been disconnected at this time, and nginx will not receive a response from tomcat.

3. Interface processing timeout: If this happens, it means that the request has been processed in the worker thread of tomcat. Indeed, due to the slow response of the interface processing, the tcp connection is finally actively closed by nginx. 

 

in conclusion:

The solution to the symptoms is to increase the number of tomcat worker thread pools (threadCount), and the solution to the root causes is to improve the response speed of the interface (rt: response time). Imagine if all our interfaces can be controlled within 100ms, the tomcat worker thread pool is set to 200, and a single machine can withstand a peak concurrent request volume of 2000 (200 / 0.1) (of course, mem and cpu should not be too bad).

 

 

Note 1: When the accept queue is full, even if the client continues to send ACK packets to the server, it will not be responded. At this time, ListenOverflows+1, and the server decides how to return through /proc/sys/net/ipv4/tcp_abort_on_overflow. 0 means to directly discard the ACK, 1 means to send RST to notify the client; correspondingly, the client will return read timeout or  respectively connection reset by peer.

 注2:对于SYN半连接队列的大小是由(/proc/sys/net/ipv4/tcp_max_syn_backlog)这个内核参数控制的,有些内核似乎也受listen的backlog参数影响,取得是两个值的最小值。当这个队列满了,不开启syncookies的时候,Server会丢弃新来的SYN包,而Client端在多次重发SYN包得不到响应而返回(connection time out)错误。但是,当Server端开启了syncookies=1,那么SYN半连接队列就没有逻辑上的最大值了,并且/proc/sys/net/ipv4/tcp_max_syn_backlog设置的值也会被忽略。

 

 最后再附录一张tcp连接建立 和 半连接队列和完全连接队列的图,很经典!



 

 

参考链接:

tcp的半连接与完全连接队列

https://segmentfault.com/a/1190000008224853

tomcat的acceptCount与maxConnections

https://segmentfault.com/a/1190000008064162

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326486525&siteId=291194637