Analysis of TCP connection disconnection problem

From: http://www.ibm.com/developerworks/cn/aix/library/0808_zhengyong_tcp/Introduction In
official documents, the TCP/IP protocol suite is also called the Internet protocol suite.

The TCP/IP protocol cluster is the most widely used global Internet technology. Its hierarchical structure is shown in Figure 1:
Figure 1. The hierarchical structure of the TCP/IP protocol cluster
Figure

1 is shown in Figure 1. The data link layer is mainly responsible for processing Transmission media and many other physical interface details; the network layer is responsible for processing the activities of data packets in the network, including the segmentation of upper-layer data packets, routing phost2008-08-21T00:00:00, etc.; the transport layer is responsible for the two hosts Provides end-to-end communication; the application layer will take care of application specific details. Among them, the IP protocol is the core protocol of the network layer, which is used to provide unreliable and connectionless data transmission services; while the TCP protocol is at the transport layer, which is based on the unreliable and connectionless IP protocol. , reliable communication. UDP?

Since TCP is a connection-oriented protocol, before two hosts can communicate, a connection needs to be established first. Below we will briefly introduce the establishment of a TCP connection and how the two communicating parties maintain the established TCP connection.
Establishing and maintaining

a TCP connection The establishment of a TCP connection needs to be done through the well-known "three-way handshake". The following example will intuitively give a TCP connection establishment process.

In the following description of this article, the client host is testClient.cn.ibm.com (Linux), and the server host is testServer.cn.ibm.com (AIX). Execute the tcpdump –i eth0 host testServer command on a terminal of the testClient host to start tcpdump to monitor network data (where eth0 is the network card used by the client host to communicate with the external network); at the same time, on another terminal of the client host Execute the following command on: (root@testClient /)>telnet testServer. The output of tcpdump on the client host at this point is shown in Listing 1.
Listing 1. Three-way handshake to create a TCP connection

# tcpdump –S -i en0 host testServer
1 14:02:38.384918 IP testClient.cn.ibm.com.43370 >
testServer.cn.ibm.com.telnet: S 3392458353:3392458353 (0) …
2 14:02:38.629578 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.43370: S 881279296:881279296(0) ack 3392458354 …
3 14:02:38.629592 IP testClient. cn.ibm.com.43370 >
testServer.cn.ibm.com.telnet: . ack 881279297 …

Note: We removed some extraneous information from the tcpdump output. For ease of understanding, we convert the above output to the actual sequence diagram 2.
Figure 2. The actual sequence of TCP establishment and creation of three-way handshake
figure2

From Figure 2, we can clearly see that when establishing a connection between testClient and testServer, the following three-way handshake process is required:

    testClient actively sends handshake protocol to testServer, message The serial number is 3392458353 and the size is 1 byte.
    The testServer actively sends the handshake protocol to the testClient, the serial number of the packet is 881279296, and the size is 1 byte; at the same time, the ACK 3392458354 is returned as the response to the 3392458354 packet sent by the testClient.
    testClient returns ACK 881279297 to testServer as a response to 881279296 packets sent by testServer.

A TCP connection is established after the above three-way handshake is completed; after that, the two ends of the connection can transfer information to each other. Therefore, a TCP connection can be regarded as a communication channel identified by the IP addresses and ports of both ends, and the establishment of a TCP connection is the process of registering the above-mentioned communication channel with both communicating parties. Once the TCP connection is established, as long as the intermediate nodes between the two communicating parties (including network devices such as gateways, switches, routers, etc.) work normally, the TCP connection will be maintained until either of the communicating parties actively closes the connection.

This feature of a TCP connection allows an idle connection that does not exchange any information for a long time to remain for hours, days or even months. The intermediate router can crash and restart, the network cable can be hung up and reconnected, and the TCP connection can be maintained as long as the hosts at both ends are not restarted.

Back to top Factors that
cause a TCP connection

to drop Ideally, a TCP connection can be held for a long time. However, in practical applications, a seemingly normal TCP connection maintained on the client or server side may have been disconnected. The TCP connection is mainly affected by two aspects and leads to disconnection: the two-party nodes involved in the communication between the network intermediate node and the client/server node?

In practical network applications, communication between two hosts often needs to pass through multiple intermediate nodes, such as routers, gateways, and firewalls. Therefore, the maintenance of TCP connections between two hosts is also affected by intermediate nodes, especially by firewalls (software or hardware firewalls). A firewall is a device that can be implemented in many different ways (software, hardware, or a combination of software and hardware). It needs to scan the incoming and outgoing information flow according to a series of rules, and allow security (in compliance with the rules) information exchange, preventing unsafe (violating the rules) information exchange. The working characteristics of firewalls determine that maintaining a network connection requires more resources, and enterprise firewalls are often located at the entrance and exit of the enterprise network. Maintaining inactive TCP connections for a long time will inevitably lead to network performance degradation. Therefore, by default, most firewalls will close connections that have been inactive for a long time, resulting in disconnected TCP connections. Similarly, if the request from the client to close the connection cannot be delivered to the server due to an exception in the intermediate node, the corresponding connection on the server will also be disconnected.

On the other hand, for hosts at both ends of a TCP connection, creating a TCP connection requires a certain amount of system resources. If a connection is no longer used, we always hope that the two communicating hosts can actively close the corresponding connection in order to release the occupied system resources. However, if the connection is not closed gracefully due to an exception on the client side (such as a crash or abnormal restart), this will result in a disconnection on the server side.

Whether it is a client node or a server node, disconnected TCP connections can no longer transmit any information. Therefore, maintaining a large number of disconnected TCP connections will lead to a waste of system resources. This waste of system resources may not cause much problem for client nodes; however, for server hosts, it may cause system resources (especially memory resources and socket resources) to be exhausted and refuse to be new of user requests to provide services. Therefore, in practical applications, the server side needs to take corresponding methods to detect whether the TCP connection has been disconnected.

Back to
top Three common methods for

detecting disconnected TCP connections The principle of detecting whether a TCP connection is disconnected or working is relatively simple: periodically send messages of a certain format to the connected remote communication node and wait for feedback from the remote communication node. If the correct feedback information from the remote node is received within the specified time, the connection is normal, otherwise the connection has been disconnected. According to this principle, there are the following three detection methods commonly used at present.
Application self-probing The

application itself comes with the ability to probe the TCP connections it has established. This method has great flexibility, and the corresponding detection mechanism and function implementation can be selected according to the characteristics of the application itself. However, in practical applications, most applications do not have the function of self-detection.
Detection of third-party applications

This method is to install corresponding third-party applications on the service node to detect whether all TCP connections on the node are normal or disconnected. The biggest disadvantage of this method is that all clients that support detection need to be able to identify the data packets from the detection application, so it is rare in practical applications.
Keep-alive detection at the TCP protocol layer The

most commonly used detection method is to use the keep-alive detection function provided by the TCP protocol layer, that is, the TCP connection keep-alive timer. Although this feature is not part of the RFC specification, it is implemented by almost all Unix-like systems, making this detection method widely used.

In the following sections, we will focus on keepalive detection methods from the TCP protocol layer.

Back to
top TCP connection keep-alive timers on Unix-like systems Keep-alive timers for

TCP connections can be implemented at the application layer or provided in TCP. This issue is controversial, so keep-alive detection of TCP connections is not part of the TCP specification. But for convenience, almost all Unix-like systems provide corresponding functions in TCP.
Listing 2. Keepalive timers on common Unix systems
OS keepalive timers
AIX # no -a | grep keep
tcp_keepcnt = 8
tcp_keepidle = 14400
tcp_keepintvl = 150
Linux # sysctl -A | grep keep
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
FreeBSD #sysctl -A | grep net.inet.tcp
net.inet.tcp.keepidle=…
net.inet.tcp.keepintvl=…

The time unit of each parameter is different on different systems. On AIX, the time unit for tcp_keeidle/tcp_keepinit/tcp_keepintvl is 0.5 seconds; on Linux, the time unit for net.ipv4.tcp_keepalive_intvl and net.ipv4.tcp_keepalive_time is seconds. Also, the above parameters are only valid for connection to the server application running on it.

Note: On Solaris, similar parameter information can be displayed by "ndd /dev/tcp \?" command, while on HP Unix it can be queried by nettune or ndd command.

Since this function is supported on all Unix-like systems, we will describe the meaning and action mechanism of the above parameters based on the AIX system in the following sections.

Back to top
TCP connection keep-alive detection mechanism and principle in AIX

As listed in Listing 2, the keep-alive detection mechanism on AIX is controlled by 4 parameters, the specific meaning of which is shown in Listing 3:
Listing 3. On AIX Keep-alive timer control parameters
Control parameter parameter description
tcp_keepcnt The maximum number of probes before closing an inactive connection, the default is 8 times The maximum
inactivity interval before tcp_keepidle performs validity detection on a connection, the default value is 14400 (ie 2 hours)
tcp_keepintvl The time interval between two probes, the default value is 150 or 75 seconds

Let's look at a concrete example. On the testServer side (AIX host), use the parameter value of tcp_keepidel=240 (ie 2 minutes): tcp_keepcnt=8: tcp_keepintvl=150 (ie 75 seconds); start tcpdump on the testServer to check the interaction of network packets; initiate a request from the testClient Establish a telnet connection to testServer. After the connection is established, unplug the network cable from the testClient and observe the data output from the server (see Listing 4).
Listing 4. tcpdump output of telnet connection on server side

1 # tcpdump -i en1 host testServer.cn.ibm.com
2 04:51:51.379716 IP testClient.cn.ibm.com.telnet.40621 >
testServer.cn.ibm.com .telnet: S 4097149880:4097149880(0)
3 04:51:51.379755 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.40621: S 2543529892:2543529892(0) ack 4097149881
4 04:51 :51.380609 IP testClient.cn.ibm.com.telnet.40621 >
testServer.cn.ibm.com.telnet: . ack 1
5 ...
6 04:51:54.924058 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.40621: P 676:696(20) ack 87
7 04:51:54.924909 IP testClient.cn.ibm.com .telnet.40621 >
testServer.cn.ibm.com.telnet: . ack 696

8 04:53:54.550192 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.40621: . 695:696(1) ack 86
9 04:55:09.550997 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.40621: . 695:696(1) ack 86
10 04:56:24.552053 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.40621: . 695:696(1) ack 86
11 04:57:39.552615 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.40621: . 695:696(1) ack 86
12 04:58:54.553446 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.40621: . 695:696(1) ack 86
13 05:00:09.554287 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.40621: . 695:696(1) ack 86
14 05:01:24.555117 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.40621: . 695:696(1) ack 86
15 05:02:39.555958 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.40621: . 695:696(1) ack 86
16 05:03:54.557282 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.40621: . 695:696(1) ack 86
17 05:05:09.559795 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.40621: R 696:696(0) ack 87

As can be seen from Listing 4, the message on line 6 is the last data sent by this connection, and line 7 is the acknowledgment of the data on line 6. After that, there is no data interaction on the connection, which keeps the connection in an inactive state. After 2 minutes of inactivity time (the difference between the datagram time of line 8 at 04:53:54 and the datagram time of line 7 at 04:51:54, that is, the value of tcp_keepidle), line 8 is the first line initiated by the server. A keep-alive probe datagram. Since the server does not receive the client's response to the probe packet, after the interval of tcp_keepintvl (75 seconds), the ninth line shows that the server initiates the keep-alive probe datagram again. After the server continues to send tcp_keepcnt probe packets (the above results show that on AIX, tcp_keepcnt+1 probe packets are continuously sent), it still does not receive any response from the client, so the server sends a message to the client on line 17. Sending a reset message closes the connection on the server side.

It should be noted that although keep-alive detection sends TCP detection packets, the detection packets will not have any impact on normal TCP connections. As can be seen from Listing 4, the TCP packet sequence number of the data sent in line 8 is 1Byte data starting from 695, and the data has been sent and confirmed by the client in line 6. For a connection in a normal state, the client will return an ACK message as shown in line 7 after receiving the probe message, thereby indicating to the server that the connection is working properly.

Next, we will analyze the impact of the above mechanisms on TCP connection retention through an actual TCP disconnection example, and propose two optional solutions for applications that need to maintain TCP connections for a long time.

Back to top
TCP disconnection and data analysis on AIX
Figure 3. Schematic diagram of network topology with TCP disconnection
figure3

All server hosts are zoned into a LAN and behind Firewall B. Due to work requirements, the host testClient from the work area LAN needs to establish a connection with the database on the testServer in the server LAN using TCP/IP, and the upper application on the testClient will perform corresponding operations on the database on the testServer through this connection.

In the actual test, we found that under the condition that both testClient and testServer work normally, the client on the testClient does not receive any abnormal information in advance, and the connection it holds will be disconnected unexpectedly ( When trying to perform database operations through the connection, you will be told the error connection is reset by foreign host).

Since this phenomenon occurs continuously, and the intermediate nodes (routers and switches, etc.) in the network are all working normally, the possibility of physical factors (such as power failure, downtime, etc.) can be ruled out. In order to analyze the cause of disconnection, we first checked the default keepalive settings on the testServer machine:

# no -a | grep keep
tcp_keepcnt = 8
tcp_keepidle = 14400
tcp_keepintvl = 150

The tcp_keepidle on the testServer is 14400, which is 2 hours. Why doesn't the keep-alive mechanism work, since the intermediate nodes are working fine? For analysis, we use the tcpdump tool to capture the message information on testClient and testServer, as shown in Listing 5 and Listing 6.
Listing 5. Server-side tcpdump data output

1 10:18:58.881950 IP testClient.cn.ibm.com.59098 >
testServer.cn.ibm.com.telnet: S 1182666808:1182666808(0) ...
2 10:18:58.882001 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.59098: S 3333341833:3333341833(0) ack 1182666809 ...
3 10:18:58.882845 IP testClient.cn.ibm.com.59098 >
testServer.cn.ibm.com.telnet: . ack 1 ...
4 ...
5 10:19:03.165568 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.59098: P 1010:1032(22) ack 87 ...
6 10:19:03.166457 IP testClient.cn.ibm.com.59098 >
testServer.cn.ibm.com.telnet: . ack 1032 ...
7 12:19:05.445336 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.59098: . 1031:1032(1) ack 86 ...
8 12:19:05.445464 IP testClient.cn.ibm.com.59098 >
testServer.cn.ibm.com.telnet: R 86:87(1) ack 1031 ...

清单 6. 客户端的 tcpdump 数据输出

1 # tcpdump -e -i eth0 host testServer.cn.ibm.com
2 10:18:55.800553 IP testClient.cn.ibm.com.59098 >
testServer.cn.ibm.com.telnet: S 1182666808:1182666808(0) ...
3 10:18:55.801778 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.59098: S 3333341833:3333341833(0) ack 1182666809 ...
4 10:18:55.801799 IP testClient.cn.ibm.com.59098 >
testServer.cn.ibm.com.telnet: . ack 1 ...
5 ...
6 10:19:00.084662 IP testServer.cn.ibm.com.telnet >
testClient.cn.ibm.com.59098: P 1010:1032(22) ack 87 ...
7 10:19:00.084678 IP testClient.cn.ibm.com.59098 >
testServer.cn.ibm.com.telnet: . ack 1032 ...

As can be seen from Listing 5, the first connection is made by the server host when the connection has been inactive for 2 hours as set by tcp_keepidle Keepalive probes (line 7 in Listing 5). Immediately after, the server host receives a connection reset message from testClient (line 8 in Listing 5). After that, the server closes the connection (which can be checked with netstat –ni). However, as can be seen from the tcpdump data in Listing 6, the testClient did not send any packets. So, who sent the reset message to testServer?

In order to check the sender of the above reset message, the above tcpdump command is also used to capture the message information of the server and firewall B again (note: usually it is necessary to capture the data of the egress network card and the ingress network card of the network data on the firewall host), the result shows that the firewall B immediately sends a reset message to testServer after receiving the first probe message from testServer.

The above analysis shows that after the connection has passed the last interaction data and the server sends the first keep-alive probe, the connection has been terminated by firewall B; When the firewall is in use, it will be discarded by the firewall and a reset packet will be sent.

Back to top
Two common solutions

There are two common solutions for the above TCP disconnection:

Solution 1. Extend the time for the firewall to terminate inactive TCP connections. For example, for the above case, you can adjust the firewall settings to set the time to be greater than the 2 hours set on the server side.

Option 2. Shorten the TCP connection keep-alive time on the server side. The purpose of shortening this time is to send keep-alive detection packets before the connection is terminated by the firewall, which can detect the client state and make the connection active.

For the first scenario, prolonging the hold time of the TCP connection may result in a decrease in firewall performance, especially if a large number of connections are maintained that are inactive for a long time; for the second scenario, if Shortening the TCP connection keep-alive time on the server side means that the number of data packets in the network will be increased and additional network bandwidth will be occupied. Therefore, the two schemes have their own advantages and disadvantages, which need to be selected according to different practical application situations.

Back to top
Summary

This article introduced concepts related to TCP connection establishment and hold and the common factors that affect TCP connection hold. The related configuration parameters of TCP connection keep-alive detection on common Unix-like systems are given, and an actual TCP disconnection case is analyzed with the help of tcpdump tool based on AIX. Finally, two feasible solutions are given for the case of TCP disconnection.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327059351&siteId=291194637