Storm Worker process hangs causing drift

The worker drift of the storm task [worker reassignment] usually occurs when the heartbeat information of the executor in the worker cannot be updated to the zookeeper due to OOM, excessive FullGC, and high zookeeper load. This scenario is different from the one described above and belongs to the operating system. Caused by a kernel bug.

Problem Description

In dataVVCount, 3 worker drifts occurred from June 4th to 6th, and the most recent one was the drift of the worker with port=5723.
Drift workers occur on both machines a01/a02

Environmental inspection

a01/a02 operating system Centos6.6, kernel 2.6.32-302.el6.x86_64 [inconsistent with other machines]
Through the monitoring of the storm platform, it is found that the load of the host machine & the entire cluster is relatively low during the drift period.

task check

There is no log of any communication error with zk found in the worker log.
The worker process monitoring indicators did not fluctuate significantly.

worker drift process

nimbus determines that the executor is not alive [(current time - last heartbeat time [2 seconds update]) > nimbus.task.timeout.secs=30]
Nimbus tasks do executor reassignment
supervisor closes the original worker[port=5723]
New worker starts

Preliminary Conclusions

According to the information obtained above, there is not much valuable information, and I am still confused.

zookeeper & storm logs

Analysis shows that the sessionId used by 5723 workers is: 0x353df7b6a55026f

establish connection

2016-06-04 13:45:06,529 [myid:3] - INFO [CommitProcessor:3:ZooKeeperServer@595] - Established session 0x353df7b6a55026f with negotiated timeout 40000 for client /10.x.x.18:55783

Disconnect

2016-06-06 06:22:13,626 [myid:3] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x353df7b6a55026f due to java.io.IOException: Connection reset by peer 
2016-06-06 06:22:13,628 [myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client /10.x.x.18:55783 which had sessionid 0x353df7b6a55026f

session超时被剔除【发生在另外一台zk】

2016-06-06 06:22:16,000 [myid:2] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x353df7b6a55026f, timeout of 40000ms exceeded 
2016-06-06 06:22:16,001 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@476] - Processed session termination for sessionid: 0x353df7b6a55026f

nimbus判定Executor死掉

2016-06-06 06:22:10 b.s.d.nimbus [INFO] Executor dataVVCount-214-1464683013:[544 544] not alive 
2016-06-06 06:22:10 b.s.s.EvenScheduler [INFO] Available slots: ([“8033393c-e639-41a5-a565-066e6bd1748b” 5724]….. 
2016-06-06 06:22:10 b.s.d.nimbus [INFO] Reassigning dataVVCcount-214-1464683013 to 20 slots

supervisor关闭worker

2016-06-06 06:22:12 b.s.d.supervisor [INFO] Shutting down and clearing state for id 8277aca4-aa83-4a66-b59c-80db17da807e. 
2016-06-06 06:22:12 b.s.d.supervisor [INFO] Shutting down 8033393c-e639-41a5-a565-066e6bd1748b:8277aca4-aa83-4a66-b59c-80db17da807e 
2016-06-06 06:22:13 b.s.d.supervisor [INFO] Shut down 8033393c-e639-41a5-a565-066e6bd1748b:8277aca4-aa83-4a66-b59c-80db17da807e

日志分析

zookeeper断开连接发生在worker漂移之后【06:22:16剔除session,06:22:12关闭worker】
worker飘移之前与zkServer的连接没有断开【worker日志中没有任何与zookeepr的错误信息】

最终结论

worker运行过程中executor正常情况下不会被判定not alive，肯定是由于某种原因executor不发心跳包给zookeeper了，基于之前遇到过java进程挂起的情况，大胆猜测这个worker进程意外挂起了，果断升级内核版本从Centos6.6 2.6.32-504.el6.x86_6升至较高版本【2.6.32-573.el6.x86_64】，经过3-4个月的观察实践证明，worker没有再发生漂移，说明当时的问题的确由于内核Bug导致。

内核挂起bug

Tene是对该缺陷的描述如下：

“这个内核漏洞的影响非常简单：在一些看似不可能的情况下，用户进程会死锁并被挂起。任何一个futex调用等待（即使被正确地唤醒）都有可能永远被阻止执行。就像Java里的Thread.park() 可能会一直阻塞那样，等等。如果足够幸运，你会在dmesg日志中发现soft lockup消息；如果没那么幸运（比如跟我们这样），你将不得不花几个月的人工成本去排查代码中的问题，还有可能一无所获。”

http://www.infoq.com/cn/news/2015/06/redhat-futex