Storm Worker process hangs causing drift

The worker drift of the storm task [worker reassignment] usually occurs when the heartbeat information of the executor in the worker cannot be updated to the zookeeper due to OOM, excessive FullGC, and high zookeeper load. This scenario is different from the one described above and belongs to the operating system. Caused by a kernel bug.

Problem Description

In dataVVCount, 3 worker drifts occurred from June 4th to 6th, and the most recent one was the drift of the worker with port=5723.
Drift workers occur on both machines a01/a02

Environmental inspection

a01/a02 operating system Centos6.6, kernel 2.6.32-302.el6.x86_64 [inconsistent with other machines]
Through the monitoring of the storm platform, it is found that the load of the host machine & the entire cluster is relatively low during the drift period.

task check

There is no log of any communication error with zk found in the worker log.
The worker process monitoring indicators did not fluctuate significantly.

worker drift process

nimbus determines that the executor is not alive [(current time - last heartbeat time [2 seconds update]) > nimbus.task.timeout.secs=30]
Nimbus tasks do executor reassignment
supervisor closes the original worker[port=5723]
New worker starts

Preliminary Conclusions

According to the information obtained above, there is not much valuable information, and I am still confused.

zookeeper & storm logs

Analysis shows that the sessionId used by 5723 workers is: 0x353df7b6a55026f

establish connection

2016-06-04 13:45:06,529 [myid:3] - INFO [CommitProcessor:3:ZooKeeperServer@595] - Established session 0x353df7b6a55026f with negotiated timeout 40000 for client /10.x.x.18:55783

Disconnect

2016-06-06 06:22:13,626 [myid:3] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x353df7b6a55026f due to java.io.IOException: Connection reset by peer
2016-06-06 06:22:13,628 [myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client /10.x.x.18:55783 which had sessionid 0x353df7b6a55026f

Session timeout was removed [occurs on another zk]

2016-06-06 06:22:16,000 [myid:2] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x353df7b6a55026f, timeout of 40000ms exceeded
2016-06-06 06:22:16,001 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@476] - Processed session termination for sessionid: 0x353df7b6a55026f

nimbus determines that Executor is dead

2016-06-06 06:22:10 b.s.d.nimbus [INFO] Executor dataVVCount-214-1464683013:[544 544] not alive
2016-06-06 06:22:10 b.s.s.EvenScheduler [INFO] Available slots: ([“8033393c-e639-41a5-a565-066e6bd1748b” 5724]…..
2016-06-06 06:22:10 b.s.d.nimbus [INFO] Reassigning dataVVCcount-214-1464683013 to 20 slots

supervisor shuts down worker

2016-06-06 06:22:12 b.s.d.supervisor [INFO] Shutting down and clearing state for id 8277aca4-aa83-4a66-b59c-80db17da807e.
2016-06-06 06:22:12 b.s.d.supervisor [INFO] Shutting down 8033393c-e639-41a5-a565-066e6bd1748b:8277aca4-aa83-4a66-b59c-80db17da807e
2016-06-06 06:22:13 b.s.d.supervisor [INFO] Shut down 8033393c-e639-41a5-a565-066e6bd1748b:8277aca4-aa83-4a66-b59c-80db17da807e

log analysis

The disconnection of zookeeper occurs after the worker drifts [06:22:16 to remove the session, 06:22:12 to close the worker]
The connection with the zkServer was not disconnected before the worker drifted [there is no error message with zookeeper in the worker log]

final conclusion

The executor will not be judged not alive under normal circumstances during the worker's running process. It must be because the executor does not send heartbeat packets to zookeeper for some reason. Based on the situation where the java process hangs before, I boldly guess that the worker process hangs unexpectedly. Now, decisively upgrade the kernel version from Centos6.6 2.6.32-504.el6.x86_6 to a higher version [2.6.32-573.el6.x86_64]. After 3-4 months of observation and practice, it is proved that the worker has no longer Drift occurred, indicating that the problem at that time was indeed caused by a kernel bug.

kernel hang bug

Tene describes the flaw as follows:

"The impact of this kernel vulnerability is very simple: under some seemingly impossible circumstances, the user process will deadlock and be suspended. Any futex call waiting (even if properly awakened) may be blocked from executing forever. Like Thread.park() in Java might block all the time, etc. If you are lucky, you will find soft lockup messages in the dmesg log; if you are not so lucky (like us), you will have to spend a few Months of labor costs to troubleshoot problems in the code, and possibly nothing.”

http://www.infoq.com/cn/news/2015/06/redhat-futex