Server CLOSE_WAIT too many requests

Last week, because of the unavailability of calling a certain service, a large number of CLOSE_WAIT tcp links appeared on the server, which led to tomcat's suspended animation. A large number of tcp requests have been stuck, and other requests have come in and tomcat can no longer provide services. The
first encounter In this situation, I checked the information of this CLOSE_WAIT and found that CLOSE_WAIT is actually a state of tcp. Let's first look at the picture to understand the various states of tcp.

Status:
CLOSED: There is no connection status, it is the start and end of the tcp status (the server does nothing at this time)
LISTEN: Listen for connection requests from a remote TCP port
SYN_SENT: The status of the first request received by the client Send to the server, enter SYN_RECEIVED after success, and directly CLOSED if it fails.
SYN_RECEIVED: After receiving and sending a connection request, wait for the other party to confirm the connection request.
ESTABLISHED: The three-way handshake is completed, and an open connection has been resumed. Data sent
FIN_WAIT_1: Waiting for the remote TCP connection interruption request, or confirmation of the previous connection interruption request
FIN_WAIT_2: After receiving the remote ACK confirmation, waiting for the connection interruption request
CLOSE_WAIT: Waiting for the connection interruption request sent from the local user
CLOSING: Waiting for the remote TCP's confirmation of the connection interruption
LAST_ACK: Waiting for the confirmation of the original connection interruption request sent to the remote TCP
TIME_WAIT: Waiting for enough time to ensure that the remote TCP receives the confirmation of the connection interruption request

Compare the status changes of the client and server to understand and digest

problem analysis:

After reading the various state change processes of tcp above, you should have a general understanding. Show us the status of tcp requests on the server, you can use the command

netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'

At that time, it was detected that CLOSE_WAIT was about 1000, and this and the number of maxConnTotal set by httpClient are not much to check.

TCP request closure includes active closure and passive closure. When the server receives the interrupted seq and ack, it is in the CLOSE_WAIT state,
and it should be immediately sent ack. But in CLOSE_WAIT means that the server did not send ack to the client (in fact, here In fact, it is the remote server). This situation may be caused by the
server being busy processing data or other operations, resulting in not issuing the ack command.

I checked the code that sent the request, response and other streams are all closed normally, then let's see if it is caused by other problems, and found that when the response is received, if the status is not 200, the request is directly returned. It is probably similar to this:

if (response.getStatusLine() == HttpStatus.SC_OK) {
 // 具体处理
}

If it is not equal to 200, it is not processed, after modification

if (response.getStatusLine() == HttpStatus.SC_OK) {
 // 具体处理
} else {
  httpRequest.abort();
 // 中断状态不对的请求
}

Then we add the following three parameters to etc/sysctl.conf:
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 2
net.ipv4.tcp_keepalive_time = 1800 // The unit is second is 30 minutes

Then sysctl -p makes the configuration take effect.

1.tcp_keepalive_time
When keepalive is enabled, the frequency of TCP sending keepalive messages, that is, the idle time and then to confirm whether the connection is still there, the default is 2 hours.
2. tcp_keepalive_intvl
When the detection is not confirmed, the frequency of resending the detection. The default is 75 seconds.
3.
How many TCP keepalive probe packets sent by tcp_keepalive_probes before determining that the connection fails

I only did these two operations and the number of CLOSE_WAIT was down. There are other possibilities that can also cause this situation. You can refer to the following information

Reference materials:

1.https://www.cnblogs.com/jessezeng/p/5616518.html
2.https://blog.csdn.net/shootyou/article/details/6615051

Guess you like

Origin blog.csdn.net/sc9018181134/article/details/100678894