Understand TIME_WAIT, thoroughly understand and solve TCP: time wait bucket table overflow

  I have always been aware of this problem but I don’t know why. I encountered it again these days, read a lot of information, and solved it completely. Haha, let’s look at the picture first. All understanding revolves around this picture. This picture describes four waved hands The whole process:

 

wKiom1cd6_mwEZr2AACU62IiAp4333.png

 

Through this figure, a few concepts are first explained:

The generation condition of TIME_WAIT: the active closing party will change to TIME_WAIT state after sending the last ACK of four waves, and the time to keep the secondary state is two MSL (one MSL in linux is 30s, which is not configurable)

 

The role of the two MSLs of TIME_WAIT: to close the TCP connection reliably and safely. For example, if the network is congested and the last ACK of the active party is not received by the passive party, the passive party will enable TCP retransmission for FIN and send multiple FIN packets. Affects new connections and other services.

 

Resources occupied by TIME_WAIT: a small amount of memory (about 4K for information) and one fd.

 

The harm of TIME_WAIT closing: 1. When the network condition is not good, if the active party does not wait for TIME_WAIT, after closing the previous connection, the active party and the passive party establish a new TCP connection, and the passive party retransmits or delays After the FIN packet comes, it will directly affect the new TCP connection;

2. The same network situation is not good and there is no TIME_WAIT waiting. After closing the connection, there is no new connection. When the passive party retransmits or delays the FIN packet, it will return a RST packet to the passive party, which may affect other services of the passive party. connect.

 

The cause and impact of TCP: time wait bucket table overflow: The reason is that the threshold of the number of tw in the Linux system is exceeded. The harm is that after the threshold is exceeded, the system will delete the redundant time-wait socket and display a warning message. If it is a NAT network environment and there are a lot of accesses, various connection instability and disconnection will occur.

 

 

Relevant parameter optimization and adjustment (of course, it must be configured according to the actual situation of the server, here we focus on the meaning of the parameters):

    Now that you know the purpose of TIME_WAIT, try to adjust it according to the TCP protocol. The reuse and recycle of tw actually violate the TCP protocol. If the server resources allow and the load is not heavy, try not to open it. When TCP: For time wait bucket table overflow, try to increase the following parameters:

tcp_max_tw_buckets = 256000 

While adjusting the sub-parameters, adjust the timeout period from TIME_WAIT_2 to TIME_WAIT, the default is 60s, optimize to 30s:

net.ipv4.tcp_fin_timeout = 30

Other TCP coordination parameters are similar to synack retransmission times, syn retransmission times, etc., which will be introduced later, and it will also be beneficial after optimization.

 

     Next, let’s talk about reuse and recycle, the exclusive optimization parameters of TIME_WAIT in Linux, which are also turned off by default. These two parameters must be enabled on the premise of timestamps to take effect:

net.ipv4.tcp_timestamps = 1

net.ipv4.tcp_tw_reuse = 1

When the machine works as a client, time_wait is recycled within one second after it is turned on

net.ipv4.tcp_tw_recycle = 0 ( do not enable it, there are many NAT structures on the Internet now, and the three-way handshake may not be possible directly )

After it is turned on, TIME_WAIT is recovered within 3.5*RTO (RTO time is calculated based on RTT time), and the timestamp in the socket connect request of the same source ip host within 60s must be incremented. For the server, the same source ip may be There are many machines behind NAT, and the incrementality of the timestamp of these machines cannot be guaranteed. The server will reject the non-incremental request connection, which directly leads to the failure of the three-way handshake.

 

 

Self-built personal original station operation and maintenance Internet Cafe Club (www.net-add.com) , new blog posts will be updated in Internet Cafe Club, welcome to browse.

 

Reprinted from: https://blog.51cto.com/benpaozhe/1767612

Guess you like

Origin blog.csdn.net/qq_21514303/article/details/88095253