Solve the TIME_WAIT accumulation problem caused by Tengine health check

1. Problem background

"After the service is on the cloud, our TCP ports are basically in the state of TIME_WAIT", "This problem has never occurred in the offline computer room" This is the description of the problem submitted by the customer.

The customer environment is self-built Tengine as a 7-layer reverse proxy, with about 18,000 NGINX connected to the back end. After Tengine went to the cloud, a large number of TCP sockets in the TIME_WAIT state were found on the server; there are many backends, which may potentially affect business availability. Compared with previous experience, users are more worried about whether it may be caused by connecting to Alibaba Cloud, so I hope we will conduct a detailed analysis on this.

Note: The problem with the monitoring of the TIME_WAIT state is that the host cannot allocate dynamic ports for external connection requests. At this time, you can configure net.ipv4.ip_local_port_range to increase the port selection range (5000-65535 can be considered), but there is still the possibility of being used up within 2 MSL.

2. TIME_WAIT reason analysis

First of all, if we review the TCP state machine, we can know that the port in the TIME_WAIT state only appears on the party that actively closed the connection (it has nothing to do with whether the party is the client or the server). When the TCP protocol stack makes a connection close request, only the [actively closing the connection] will enter the TIME_WAIT state.

And the concerns of customers are here.

On the one hand, the health check using HTTP1.0 is a short connection. Logically, the backend NGINX server should actively close the connection, and most TIME_WAIT should appear on the NGINX side.

On the other hand, we also confirmed through packet capture that most of the first FIN requests for connection closure were initiated by the backend NGINX server. In theory, the socket of the Tengine server should enter the CLOSED state directly without so many TIME_WAIT.
The packet capture situation is as follows, we filter according to the socket port number of TIME_WAIT on Tengine.

Figure 1: An HTTP request interaction process

Although the above packet capture results show that the current behavior of Tengine does seem strange, in fact, through analysis, such situations still exist logically. In order to explain this behavior, we should first understand: the network packet captured by tcpdump is the "result" of the packet sent and received on the host. Although from the perspective of packet capture, the Tengine side looks like a [passive receiver] role, in the operating system, whether this socket is actively closed depends on how the TCP protocol stack in the operating system handles this socket.

For this packet capture analysis, our conclusion is: There may be a race condition (Race Condition). If the operating system closes the socket and receives the FIN from the other party at the same time, then the decision to enter the TIME_WAIT or CLOSED state of the socket is determined by the active close request (the Tengine program calls the close operating system function for the socket) and the passive close request (the operating system kernel thread) The tcp_v4_do_rcv processing function called after receiving the FIN) which happens first.

In many cases, different environmental factors such as network delay and CPU processing capacity may bring different results. For example, due to the low latency in the offline environment, passive shutdown may occur first; since the service went to the cloud, the latency between Tengine and the back-end Nginx has been lengthened due to the distance, so Tengine actively shuts down earlier , Etc., leading to inconsistencies between the cloud and the cloud.

However, if the current behavior seems to be in compliance with the protocol standard, then how to solve this problem head-on becomes more difficult. We cannot delay active connection closure requests by reducing the performance of the host where Tengine is located, nor can we reduce the delay consumption due to physical distance to speed up the collection of FIN requests. In this case, we would recommend adjusting the system configuration to alleviate the problem.

Note: There are many ways to quickly alleviate this problem in current Linux systems, such as:

a) When timestamps is enabled, configure tw_reuse.
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_timestamps = 1
b) Configure max_tw_buckets
net.ipv4.tcp_max_tw_buckets = 5000 The
disadvantage is that it will write in syslog: time wait bucket table overflow.

Since users use self-built Tengine, and users are unwilling to perform TIME_WAIT mandatory cleanup, we consider using Tengine's code analysis to see if there is an opportunity to change Tengine's behavior without changing Tengine's source code to prevent sockets from being actively closed by Tengine.
Tengine version: Tengine/2.3.1
NGINX version: nginx/1.16.0

2.1 Tengine code analysis

From the previous captures, we can see that most of the TIME_WAIT sockets are created for back-end health checks, so we mainly focus on the health check behavior of Tengine. The following is an excerpt from the open source code of ngx_http_upstream_check_module about socket cleaning functions .

Figure 2: Process of cleaning socket after Tengine health check is completed

From this logic, we can see that if any of the following conditions is met, Tengine will directly close the connection after receiving the data packet.

  • c->error != 0
  • cf->need_keepalive = false
  • c->requests > ucscf->check_keepalive_requ

Figure 3: The function in Tengine that actually completes the socket closing

Here, if we let the above conditions become unsatisfied, then it is possible for the operating system where Tengine is located to process the passive close request, perform socket cleaning, and enter the CLOSED state, because from the HTTP1.0 protocol, the NGINX server One party will actively close the connection.

2.2 Solution

Under normal circumstances, we don't need to care too much about the TIME_WAIT connection. Generally, after 2MSL (default 60s), the system automatically releases it. If you need to reduce, you can consider the long link mode, or adjust the parameters.

In this case, the customer has a better understanding of the protocol, but is still worried about the forced release of TIME_WAIT; at the same time, because there are 18,000 hosts on the back end, the overhead caused by the long connection mode is even more unbearable.

Therefore, based on the previous code analysis and sorting out the logic in the code, we recommend the following health check configurations for customers:

check interval=5000 rise=2 fall=2 timeout=3000 type=http default_down=false;
check_http_send "HEAD / HTTP/1.0\r\n\r\n";
check_keepalive_requests 2
check_http_expect_alive http_2xx http_3xx; The
reason is simple, we need to let The three conditions mentioned earlier are not met. In the code, we do not consider the error situation, and need_keepalive is enabled by default in the code (if not, you can adjust it through configuration), so you need to ensure that check_keepalive_requests is greater than 1 to enter Tengine's KEEPALIVE logic to prevent Tengine from actively closing the connection.

Figure 4: Tengine health check reference configuration

Because the HTTP1.0 HEAD method is used, the back-end server will actively close the connection after receiving it, so the socket created by Tengine enters the CLOSED state to avoid entering TIME_WAIT and occupying dynamic port resources.

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/112770535