I'm too much a project TIME_WAIT triggered recording

Symptoms:

During the part of the user client receiving counseling user to disconnect and reconnect connections reason, leading to a new customer to assign

Troubleshooting

First, the abnormal point

  1. See the problem occurs, the first time think of to see server status and monitoring indicators
    of basic indicators by the investigation, such as system memory, CPU usage, and so on in the normal range
  2. Because it is disconnection problem, so after the basic indicators of normal thought to see the state of TCP connections, found only in the case of less than 200 ESTABLISHED, the number of TIME_WAIT reached the amount of around 6W, the online search found under this ratio is very normal, the analysis points to the direction of the TCP connection problems

Second, analyze why

  1. Analysis TIME_WAIT state generated
TIME_WAIT产生于tcp连接的4次挥手阶段,并且产生在主动断开连接的一方
这里假设是客户端首先发起断开连接的请求,过程如下
a、客户端先发送FIN,进入FIN_WAIT1状态
b、服务端收到FIN,发送ACK,进入CLOSE_WAIT状态,客户端收到这个ACK,进入FIN_WAIT2状态
c、服务端发送FIN,进入LAST_ACK状态
d、客户端收到FIN,发送ACK,进入TIME_WAIT状态,服务端收到ACK,进入CLOSE状态
客户端TIME_WAIT持续2倍MSL时长,在linux体系中大概是60s,转换成CLOSE状态
注:MSL 是Maximum Segment Lifetime英文的缩写,中文可以译为“报文最大生存时间”,MSL指明TCP报文在Internet上最长生存时间,每个具体的TCP实现都必须选择一个确定的MSL值.RFC 1122建议是2分钟,但BSD传统实现采用了30秒.TIME_WAIT 状态最大保持时间是2 * MSL,也就是1-4分钟.在redhat和centos系统中MSL也是采用的30秒 

By a simple diagram to illustrate the Internet at


 
FIG 4 sway simplified process TCP connection is disconnected
  1. TIME_WAIT too much influence
    in the TIME_WAIT state before the end of the socket, the socket occupied by the local port number has been unable to release. High concurrent TCP connections and the use of short communication system to communicate at high concurrent high load running for some time, it often appears as a client program fails to establish a new socket connection to the server. At this time, with "netstat -tanlp" command to view a large number of socket connections will be found in the TIME_WAIT state on the machine, and take up a lot of local port number. Finally, when the available local port number on the machine occupied finish (or reach the upper limit of file handles available to the user), and a lot of the old socket in the TIME_WAIT state system has not yet been recovered, there will be unable to create new server the case of socket connections. At this point the system is almost stopped, it takes even the best performance play out.
    Note: This question relates to a server connection limit how much can be up to, due to the elements of TCP - four-tuple (source address, source port, destination address, destination port) can have a different connection from the new connection, so in theory, a server can accept the number of TCP connections are seemingly unlimited (without regard to the storage connection information table and other resources are used up resource depletion) ---- back here again dig deep

  2. Whether TIME_WAIT state can skip directly to close the connection?
    Why take the initiative to shut down the party does not directly enter the CLOSED state, but enters the TIME_WAIT state, and long residence time of twice the MSL it? This is because the TCP is to establish a reliable protocol over an unreliable network.
    Examples: receive one of the passive close off one of the active FIN packet sent, in response to the ACK packet, and enters the TIME_WAIT state, but because of the network itself, one of the active close the ACK packet is transmitted may be delayed, thereby triggering one passive connection FIN packet retransmission. In extreme cases, this go back, that is twice the length of MSL. : If you take the initiative to shut down the party skip TIME_WAIT directly into CLOSED, or stay in the TIME_WAIT less than twice the length of MSL, then closed when the passive party earlier calls delayed packet arrives, it might look something like the problems
    of the old TCP connection does not exist, the system can now only return RST packets
    new TCP connection is established, the delay may interfere with the new connection
    either way will allow TCP is no longer reliable, so TIME_WAIT state of necessity existed. Then from this explanation, the duration is set 2MSL be appreciated, the MSL is the maximum survival time of the packet, if the re-transmission, the ACK a FIN + a, plus the delay time from time to time, generally in the range of 2MSL.

  3. How to solve the problem of
    understanding the reasons TIME_WAIT generated and found to be normal will produce, and is a necessary process can not be skipped, then how to solve it generally there is a direction.

a、修改TIME_WAIT连接状态的上限值
b、启动快速回收机制
c、开启复用机制
d、修改短连接为长连接方式
e、 由客户端来主动断开连接

a program: modify kernel parameters

Modify /etc/sysctl.conf,net.ipv4.tcp_max_tw_buckets corresponds to the maximum number of systems while maintaining the TIME_WAIT, when more than this number, the system will immediately clear out excess TIME_WAIT connections, system log will appear TCP: time wait bucket table overflow warning of the final state of the connection does not exceed the set value, a maximum value may be set for net.ipv4.tcp_max_tw_buckets 262144 (hardware limitations)
Although this method can quickly reduce the amount TIME_WAIT state below the preset value , the concurrent use of high short connection mode state, TIME_WAIT produce very fast, when TIME_WAIT connections reaches its set value will generate the same speed to the speed of destruction of normal TIME_WAIT connection, said earlier time can occur the result TIME_WAIT skip the connection status may appear, some abnormal connection or a new connection fails.
net.ipv4.tcp_max_tw_buckets value should be based on official recommendations should not be set too small

b: Enable rapid recovery

What is a fast-recovery mechanisms? Since we said earlier TIME_WAIT wait long 2 * MSL is necessary, and how quickly they can be recycled?
Fast recovery system mechanism is connected by way tcp fast recovery mechanism, a corresponding kernel parameters net.ipv4.tcp_tw_recycle, to find out the parameters would have to mention another kernel parameters: net.ipv4.tcp_timestamps

net.ipv4.tcp_timestamps是在RFC 1323中定义的一个TCP选项。
tcp_timestamps的本质是记录数据包的发送时间
TCP作为可靠的传输协议,一个重要的机制就是超时重传。因此如何计算一个准确(合适)的RTO对于TCP性能有着重要的影响。而tcp_timestamp选项正是*主要*为此而设计的。 

When the case timestamp and tw_recycle both options open, open PAWS mechanism of per-host. So that it can recover quickly in the TIME-WAIT state TCP flows.

PAWS — Protect Againest Wrapped Sequence numbers
目的是解决在高带宽下,TCP序号可能被重复使用而带来的问题。
PAWS同样依赖于timestamp,并且假设在一个TCP流中,按序收到的所有TCP包的timestamp值都是线性递增的。而在正常情况下,每条TCP流按序发送的数据包所带的timestamp值也确实是线性增加的。
如果针对per-host的使用PAWS中的机制,则会解决TIME-WAIT中考虑的上一个流的数据包在下一条流中被当做有效数据包的情况,这样就没有必要等待2*MSL来结束TIME-WAIT了。只要等待足够的RTO,解决好需要重传最后一个ACK的情况就可以了。因此Linux就实现了这样一种机制:
当timestamp和tw_recycle两个选项同时开启的情况下,开启per-host的PAWS机制。从而能快速回收处于TIME-WAIT状态的TCP流。

How fast is fast recovery?

According to data obtained by Baidu Google is 700ms can be recycled

Use quick recovery in some cases can largely alleviate the problem of excessive TIME_WAIT state, after all, the recovery time from the original becomes less than 1min 1s. But there is the case of PAWS mechanism to open a can not solve the problem: under the NAT network will be part of the host connection can not be normal

Knot NAT network and PAWS mechanism

a. 同时开启tcp_timestamp和tcp_tw_recycle会启用TCP/IP协议栈的per-host的PAWS机制 b. 经过同一NAT转换后的来自不同真实client的数据流,在服务端看来是于同一host打交道 c. 虽然经过同一NAT转化,但由于不同真实client会携带各自的timestamp值,因而无法保证整过NAT转化后的数据包携带的timestamp值严格递增 d. 当服务器的per-host PAWS机制被触发后,会丢弃timestamp值不符合递增条件的数据包 造成的直接结果就是连接成功率降低 

Why do some people recommend the same time open tcp_timestamp and tcp_tw_recycle it?

If you go to Baidu TIME_WAIT too many problems, a lot of people write the blog solution is to open tcp_tw_recycle, because after open simultaneously, faster recovery socket TIME-WAIT state <== This is exactly PAWS from the per-conn extended to the purpose of the per-host configuration, but unfortunately although the logic is right, but perhaps not take into account the public's widespread problems NAT mechanisms may bring .
But if someone wrote a open tcp_tw_recycle can solve the problem, then we might continue to explore what
the Linux system tcp_timestamp is turned on by default, add the user to manually open tcp_tw_recycle, started the PAWS mechanism, it opened a rapid recovery, then you You will find the number of TIME_WAIT really down, but why did not find the NAT network connectivity problems and PAWS mechanism we mentioned above is causing it? - a close look at the knot NAT network and PAWS mechanism of the above-B through the data streams from different real client of the same NAT translation, appears to be dealing with the same host server side can understand the truth, the situation needs to meet two trigger conditions:

a、是发起请求的主机位于NAT网络中
b、请求主机和被请求的服务端tcp_timestamp都是开启的,且服务端开启了tcp_tw_recycle

So while this is a big pit, but as a server is not too easy to find, but still found some time, because when the server program sends a request to other services, such as database connection requests at this time of the server in this connection acts as the role of the client.

NAT and PAWS conflicts that occur when the client how to avoid connection problems with our server connection failure rate it?

Due to the conditions that triggered the case is open tcp_timestamp client and server at the same time we open tcp_timestamp and tcp_tw_recycle, so to avoid this situation we must From these points, so there are two ways:

a、关闭tcp_tw_recycle,等于是放弃了快速回收TIME_WAIT状态的方案。
b、关闭tcp_timestamp,虽然很多情况下也不会出现问题,但是考虑到tcp_timestamp参数设定的初衷,建议还是使用默认的开启状态(如果作为客户端要连接其他的服务器出现了连接失败,不妨试下关闭tcp_timestamp,毕竟这个时候服务端在别人家,我们没办法改参数)

c program: open reuse mechanism

Open net.ipv4.tcp_tw_reuse with reuse mechanism, allowing the re-TIME-WAIT sockets for new TCP connection, the default is 0, meaning off
to open reuse mechanism relies on features tcp_timestamps
the conditions reuse TIME_WAIT last packet is received more than 1s
official website of this parameter such description: this parameter is speaking from the perspective of security protocol, and a description of the fast recovery is not recommended if no special requirement is on, it can be seen from this point compared to reuse open to enable the rapid recovery a little more security
official website of the manual there is a warning: it should not be changed without advice / request of technical experts.

The advantages and disadvantages of reuse mechanism

Advantages: tcp_timestamps can TIME_WAIT connection with the agreement on the premise of safety for new TCP connection, compared to the default 1s 60s or greatly shorten the time
Cons: The only mechanism of "clients" are valid, that initiates connected party. For example, when a web server, the client sent a request to the server when the web server, but they need to connect to the web server back-end database, this time, at the same time as the web server and the client, and only initiate connections initiative calls for the party TIME_WAIT the parameters generated to take effect

d Protocol: connection mode is connected by a shorter or longer

The difference between long and short connecting connection works:

短连接 
连接->传输数据->关闭连接 
HTTP是无状态的,浏览器和服务器每进行一次HTTP操作,就建立一次连接,但任务结束就中断连接。 
也可以这样说:短连接是指SOCKET连接后发送后接收完数据后马上断开连接。 

长连接 
连接->传输数据->保持连接 -> 传输数据-> 。。。 ->关闭连接。 长连接指建立SOCKET连接后不管是否使用都保持连接,但安全性较差。 

As can be seen from the difference between the length shorter than the connection reduces the number of connections connecting fundamentally closed, reducing the number of generation TIME_WAIT state
distinction scene using short and long connecting connection:

长连接多用于操作频繁,点对点的通讯,而且连接数不能太多情况。
每个TCP连接都需要三步握手,这需要时间,
如果每个操作都是先连接,再操作的话那么处理速度会降低很多,
所以每个操作完后都不断开,次处理时直接发送数据包就OK了,
不用建立TCP连接。
而像WEB网站的http服务一般都用短链接,
因为长连接对于服务端来说会耗费一定的资源,
而像WEB网站这么频繁的成千上万甚至上亿客户端的连接用短连接会更省一些资源,
如果用长连接,而且同时有成千上万的用户,
如果每个用户都占用一个连接的话,那可想而知吧。
所以并发量大,但每个用户无需频繁操作情况下需用短连好。 

Currently only supports Nginx to reverse proxy server configured in the upstream, it is not supported directly by the configuration server proxy_pass instructions, but does not support the case where the variable parameter includes proxy_pass. In addition, to support long connections need to be configured to use protocol HTTP1.1 (though HTTP 1.0 Connection may be achieved by providing a request header to "keep-alive" long connection, but this is not recommended).
Further, since the default module HTTPPROXY will reverse proxy connection request header set to Close, so there need to clear the connection head (i.e. the head clearance of the head is not transmitted, default length is connected in HTTP 1.1).

About the contents of the other from a long connection: long connection

e scheme: by the client to disconnect (virtual, not Study Method)

In this way the client as a party initiating disconnection, TIME_WAIT will remain in the client.
A little research on who should disconnect the TCP connection



Author: attack the fat of
the link: https: //www.jianshu.com/p/a2938fc35573
Source: Jane books
are copyrighted by the author. Commercial reprint please contact the author authorized, non-commercial reprint please indicate the source.

Guess you like

Origin www.cnblogs.com/gao88/p/12129583.html