tcp socket file handle leak
Today, found that the number of socket warnings are displayed on the machine table redis, this is a very strange phenomenon. Because on one redis redis server on the deployment of a few examples, open ports should be limited.
1, tcp connections normal display netstat
netstat -n | awk '/^tcp/ {++state[$NF]} END {for(key in state) print key,"\t",state[key]}'`
TIME_WAIT 221
ESTABLISHED 103
netstat -nat |wc -l
368
Number tcp connection establishment is not a lot.
ss -s display a large number of connected closed
ss -s
Total: 158211 (kernel 158355)
TCP: 157740 (estab 103, closed 157624, orphaned 0, synrecv 0, timewait 173/0), ports 203
Transport Total IP IPv6
158355 - -
RAW 0 0 0
UDP 9 6 3
TCP 116 80 36
INET 125 86 39
FRAG 0 0 0
closed 157624
And my value system monitoring method is:
cat /proc/net/sockstat | grep sockets | awk '{print $3}'
158391
cat /proc/net/sockstat
sockets: used 158400
TCP: inuse 89 orphan 2 tw 197 alloc 157760 mem 16
UDP: inuse 6 mem 0
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
Many socket in alloc state, has been allocated sk_buffer, and in closed.
redis leaks of file discriptes, the kernel is not recovered.
3, track down the murderers
exist socket fd leak information above explanation, then use lsof command to check the system sock file handle.
lsof | grep sock
java 4684 apps *280u sock 0,6 0t0 675441359 can't identify protocol
java 4684 apps *281u sock 0,6 0t0 675441393 can't identify protocol
java 4684 apps *282u sock 0,6 0t0 675441405 can't identify protocol
java 4684 apps *283u sock 0,6 0t0 675441523 can't identify protocol
java 4684 apps *284u sock 0,6 0t0 675441532 can't identify protocol
java 4684 apps *285u sock 0,6 0t0 675441566 can't identify protocol
Can be found, Name column is "an't identify protocol", socket could not find open files.
This display is the java process (pid = 4684) appeared socket fd leak situation.
ps auxww | grep 4684
Flume log collection tool is found on redis machine.
4. Solution
Today found after the restart flume agent, there will still be a large number of closed socket of this phenomenon.
strace flume process, we found flume process has hung.
strace -p 36111 sudo
Process 36111 attached - to interrupt quit
futex (0x2b80e2c2e9d0, FUTEX_WAIT, 36120, NULL
First of all, I am more skeptical file handle is not enough, because google to find information also increases the file fd insufficient cause this problem.
In on my machine, the maximum allowed number of open files is 131072, and nearly a quarter fd file number is not used.
lsof | wc -l
10201
ulimit -a
ulimit -n
131072
At this time, a colleague prompted me, there are a lot of other machines have appeared in this issue (flume has been online for three months, are normal before).
This is, I think there is a log flume can be viewed. And view the log flume, suggesting flume can not find broker 5.
Nani, not kafka cluster not only 4 broker (node). When it is remembered that a few days ago and then e-mail to colleagues spark development, the expansion of kakf cluster.
The new cluster nodes 9092 port redis this stage where the room is not open access.
[SinkRunner-PollingRunner-DefaultSinkProcessor] (kafka.utils.Logging$class.warn:89) - Failed to send producer request with correlation id 63687539 to broker 5 with data for partitions [titan,4]
5, to reproduce the problem
in lsof: can not identify protocol this article, use python code to reproduce the situation.
:)
In solving the problem, google lookup is a more efficient way. And sometimes, google out the results it will affect the direction of the investigation of the problem.
After I saw the google search results, the first feeling is because the operating system max open files parameter is too small lead. After the discovery is not the reason. My thinking is still stuck in the kernel configuration parameters are reasonable idea. Know deployed on other machines of the same kind of situation appeared flume is that I realized flume itself is a problem, before going strace flume state of the process and view the log flume.