tcp socket file handle leak

tcp socket file handle leak

Today, found that the number of socket warnings are displayed on the machine table redis, this is a very strange phenomenon. Because on one redis redis server on the deployment of a few examples, open ports should be limited.

1, tcp connections normal display netstat


netstat -n | awk '/^tcp/ {++state[$NF]} END {for(key in state) print key,"\t",state[key]}'`

TIME_WAIT        221
ESTABLISHED      103

netstat  -nat |wc -l
368

Number tcp connection establishment is not a lot.

ss -s display a large number of connected closed

ss -s

Total: 158211 (kernel 158355)
TCP:   157740 (estab 103, closed 157624, orphaned 0, synrecv 0, timewait 173/0), ports 203

Transport Total     IP        IPv6
158355    -         -        
RAW       0         0         0        
UDP       9         6         3        
TCP       116       80        36       
INET      125       86        39       
FRAG      0         0         0        
closed 157624

And my value system monitoring method is:

cat /proc/net/sockstat | grep sockets | awk '{print $3}'
158391

cat /proc/net/sockstat
sockets: used 158400
TCP: inuse 89 orphan 2 tw 197 alloc 157760 mem 16
UDP: inuse 6 mem 0
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

Many socket in alloc state, has been allocated sk_buffer, and in closed.
redis leaks of file discriptes, the kernel is not recovered.

3, track down the murderers
exist socket fd leak information above explanation, then use lsof command to check the system sock file handle.

lsof | grep sock
java        4684      apps *280u     sock                0,6       0t0 675441359 can't identify protocol
java        4684      apps *281u     sock                0,6       0t0 675441393 can't identify protocol
java        4684      apps *282u     sock                0,6       0t0 675441405 can't identify protocol
java        4684      apps *283u     sock                0,6       0t0 675441523 can't identify protocol
java        4684      apps *284u     sock                0,6       0t0 675441532 can't identify protocol
java        4684      apps *285u     sock                0,6       0t0 675441566 can't identify protocol

Can be found, Name column is "an't identify protocol", socket could not find open files.

This display is the java process (pid = 4684) appeared socket fd leak situation.

ps auxww | grep 4684

Flume log collection tool is found on redis machine.

4. Solution

Today found after the restart flume agent, there will still be a large number of closed socket of this phenomenon.
strace flume process, we found flume process has hung.

strace -p 36111 sudo
Process 36111 attached - to interrupt quit
futex (0x2b80e2c2e9d0, FUTEX_WAIT, 36120, NULL
First of all, I am more skeptical file handle is not enough, because google to find information also increases the file fd insufficient cause this problem.
In on my machine, the maximum allowed number of open files is 131072, and nearly a quarter fd file number is not used.

lsof | wc -l 
10201

ulimit -a 
ulimit  -n
131072

At this time, a colleague prompted me, there are a lot of other machines have appeared in this issue (flume has been online for three months, are normal before).
This is, I think there is a log flume can be viewed. And view the log flume, suggesting flume can not find broker 5.
Nani, not kafka cluster not only 4 broker (node). When it is remembered that a few days ago and then e-mail to colleagues spark development, the expansion of kakf cluster.
The new cluster nodes 9092 port redis this stage where the room is not open access.

[SinkRunner-PollingRunner-DefaultSinkProcessor] (kafka.utils.Logging$class.warn:89) - Failed to send producer request with correlation id 63687539 to broker 5 with data for partitions [titan,4]

5, to reproduce the problem
in lsof: can not identify protocol this article, use python code to reproduce the situation.

:)

In solving the problem, google lookup is a more efficient way. And sometimes, google out the results it will affect the direction of the investigation of the problem.
After I saw the google search results, the first feeling is because the operating system max open files parameter is too small lead. After the discovery is not the reason. My thinking is still stuck in the kernel configuration parameters are reasonable idea. Know deployed on other machines of the same kind of situation appeared flume is that I realized flume itself is a problem, before going strace flume state of the process and view the log flume.

Guess you like

Origin blog.51cto.com/13120271/2451351