Processing methods of TCP time_wait and close_wait

You don't need to remember many states of tcp, as long as you understand the meaning of the three most common states: ESTABLISHED means communicating, TIME_WAIT means active shutdown, and CLOSE_WAIT means passive shutdown. Generally, it is not necessary to check the network status. If the server is abnormal, 80 to 90% are the following two situations:
1. The server maintains a large number of TIME_WAIT states
2. The server maintains a large number of CLOSE_WAIT states
because The file handle allocated to a user by linux is limited (refer to: http://blog.csdn.net/shootyou/article/details/6579139 ), and if the two states of TIME_WAIT and CLOSE_WAIT are kept, it means that the corresponding The number of channels has been occupied, and it is "not working hard to occupy the pit". Once the upper limit of the number of handles is reached, new requests cannot be processed, and then a large number of Too Many Open Files exceptions occur, and tomcat crashes. . .
Let’s discuss the handling methods of these two cases. There are many online materials that confuse the handling methods of these two cases. It is not appropriate to think that optimizing the system kernel parameters can solve the problem. It may be very difficult to optimize the system kernel parameters to solve TIME_WAIT. Easy, but dealing with CLOSE_WAIT still needs to start from the program itself. Now let’s talk about the handling methods of these two situations:

1. The server maintains a large number of TIME_WAIT states.
This situation is relatively common, and some crawler servers or WEB servers (if the network administrator did not optimize the kernel parameters during installation) often Encountered this problem, how did this problem arise?
As can be seen from the above diagram, TIME_WAIT is the state maintained by the party that actively closes the connection. For the crawler server, it is itself a "client". After completing a crawling task, it will initiate an active closing of the connection, thus Enter the TIME_WAIT state, and then completely close the recycling resources after maintaining this state for 2MSL (max segment lifetime) time. Why do you want to do this? The connection has already been actively closed, why keep the resources for a period of time? This is specified by the designers of TCP/IP, mainly for the following two considerations:
1. Prevent the packets in the previous connection from reappearing after getting lost, affecting the new connection (after 2MSL, all the repetitions in the previous connection are repeated packets will disappear)
2. Reliably close the TCP connection. The last ack(fin) sent by the active closing party may be lost. At this time, the passive party will resend the fin. If the active party is in the CLOSED state at this time, it will respond to rst instead of ack. So the active party should be in TIME_WAIT state, not CLOSED. In addition, TIME_WAIT is designed in this way to periodically recycle resources, and it will not take up a lot of resources unless a large number of requests are accepted in a short time or it is attacked.
The following passage is quoted about MSL:
MSL is the time that a TCP Segment (a certain piece of TCP network packet) can survive from the source to the destination (that is, the time that a network packet can survive when it is transmitted on the network), Since the RFC 793 TCP transport protocol was defined in 1981, the network speed at that time was not as developed as the current Internet, can you imagine that you have to wait 4 minutes for the first byte to appear when you enter the URL from the browser? It is almost impossible for this kind of thing to happen in the current network environment, so we can greatly reduce the persistence time of the TIME_WAIT state, so that the port (Ports) can be freed up for other connections faster. 
Another quote from a network resource:
It is worth mentioning that for the HTTP protocol based on TCP, it is the Server side that closes the TCP connection. In this way, the Server side will enter the TIME_WAIT state. It is conceivable that for the heavily visited Web Server, there will be a large number of TIME_WAIT states. If the server receives 1000 requests in one second, there will be a backlog of 240*1000=240,000 TIME_WAIT records, and maintaining these states will bring a burden to the server. Of course, modern operating systems use fast lookup algorithms to manage these TIME_WAITs, so for a new TCP connection request, it will not take too much time to determine whether there is a TIME_WAIT in the hit, but it is always bad to have so many states to maintain. 
HTTP protocol version 1.1 stipulates that the default behavior is Keep-Alive, which means that multiple requests/responses will be transmitted by reusing TCP connections. One of the main reasons is that this problem was discovered. 

That is to say, the interaction of HTTP is different from the picture drawn above. It is not the client that closes the connection, but the server, so the web server will also have a lot of TIME_WAIT.

Now let's talk about how to solve this problem.

The solution is very simple, which is to allow the server to quickly recycle and reuse those TIME_WAIT resources.

Let's take a look at our network management's modification to the /etc/sysctl.conf file: #For
a new connection, how many SYN connection requests the kernel needs to send before deciding to give up, should not be greater than 255, the default value is 5, which corresponds to about 180 seconds Time  
net.ipv4.tcp_syn_retries=2 
#net.ipv4.tcp_synack_retries=2 #Indicates 
the frequency of TCP sending keepalive messages when keepalive is enabled. The default is 2 hours, change to 300 seconds 
net.ipv4.tcp_keepalive_time=1200 
net.ipv4.tcp_orphan_retries=3 #Indicates 
that if the socket is requested to be closed by the local end, this parameter determines the time it remains in the FIN-WAIT-2 state 
net.ipv4.tcp_fin_timeout=30 #Indicates   
SYN The length of the queue, the default is 1024, and the length of the queue is increased to 8192, which can accommodate more network connections waiting to be connected. 
net.ipv4.tcp_max_syn_backlog = 4096 #Indicates 
that SYN Cookies are enabled. When the SYN waiting queue overflows, enable cookies to prevent a small number of SYN attacks. The default value is 0, which means to close 
net.ipv4.tcp_syncookies = 1 
 
# means to enable reuse. Allow TIME-WAIT sockets to be reused for new TCP connections, the default is 0, which means to close 
net.ipv4.tcp_tw_reuse = 1 
#Indicates that the fast recycling of TIME-WAIT sockets in TCP connections is enabled, the default is 0, which means that 
net.ipv4 is closed .tcp_tw_recycle = 1 
 
##Reduce the number of probes before timeout  
net.ipv4.tcp_keepalive_probes=5  
##Optimize the network device receiving queue  
net.core.netdev_max_backlog=3000  
 
After modification, execute /sbin/sysctl -p to make the parameters take effect.

The main thing to notice here is the parameters net.ipv4.tcp_tw_reuse
net.ipv4.tcp_tw_recycle
net.ipv4.tcp_fin_timeout
net.ipv4.tcp_keepalive_*
.

Both net.ipv4.tcp_tw_reuse and net.ipv4.tcp_tw_recycle are enabled to recycle resources in the TIME_WAIT state.
net.ipv4.tcp_fin_timeout This time can reduce the time for the server to go from FIN-WAIT-2 to TIME_WAIT under abnormal conditions.
net.ipv4.tcp_keepalive_* A series of parameters are used to set the relevant configuration of the server to detect connection survival.
For the use of keepalive, please refer to: http://hi.baidu.com/tantea/blog/item/580b9d0218f981793812bb7b.html

2. The server maintains a large number of CLOSE_WAIT
states . Take a break and take a breath. At first, I just planned to talk about TIME_WAIT and CLOSE_WAIT The difference, I didn’t expect to dig deeper and deeper, this is also the benefit of writing a blog summary, and there can always be unexpected gains.

The TIME_WAIT state can be solved by optimizing the server parameters, because the occurrence of TIME_WAIT is controllable by the server itself, either because the connection of the other party is abnormal, or because it does not recover resources quickly, in short, it is not caused by its own program error.
but
It is different if it is CLOSE_WAIT. As can be seen from the above figure, if it remains in the CLOSE_WAIT state, there is only one case, that is, the server
program . In other words, after the other party's connection is closed, the program does not detect it, or the program simply forgets that the connection needs to be closed at this time, so this resource is
always occupied by the program. Personally, I feel that this situation cannot be solved through the server kernel parameters. The server has no right to actively recycle the resources preempted by the program unless the program is terminated.

If you are using HttpClient and you have encountered a lot of CLOSE_WAIT, then this log may be useful to you: http://blog.csdn.net/shootyou/article/details/6615051
In the log there I gave a Scenario, to illustrate the difference between CLOSE_WAIT and TIME_WAIT, here is a re-description:
server A is a crawler server, it uses a simple HttpClient to request apache on resource server B to obtain file resources, under normal circumstances, if the request is successful, then in After grabbing the resource, server A will actively send a request to close the connection. At this time, it will actively close the connection. We can see that the connection status of server A is TIME_WAIT. What if an exception occurs? Assuming that the requested resource does not exist on server B, then server B will issue a request to close the connection at this time. Server A passively closes the connection. If server A passively closes the connection, the programmer forgets to let HttpClient release the connection. That will cause the state of CLOSE_WAIT.

So if the solution to a large number of CLOSE_WAIT can be summed up in one sentence: check the code. Because the problem lies in the server program.

Reference materials:
1. The processing of TIME_WAIT under windows can participate in this hero's log: http://blog.miniasp.com/post/2010/11/17/How-to-deal-with-TIME_WAIT-problem-under-Windows.aspx
2.WebSphere server optimization has certain reference value: http:// publib.boulder.ibm.com/infocenter/wasinfo/v6r0/index.jsp?topic=/com.ibm.websphere.express.doc/info/exp/ae/tprf_tunelinux.html
3. The meaning of various kernel parameters: http ://haka.sharera.com/blog/BlogTopic/32309.htm
4. sysctl optimization of linux network in linux server adventure: http://blog.csdn.net/chinalinuxzend/article/details/1792184

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326780549&siteId=291194637