The worker drift of the storm task [worker reassignment] usually occurs when the heartbeat information of the executor in the worker cannot be updated to the zookeeper due to OOM, excessive FullGC, and high zookeeper load. This scenario is different from the one described above and belongs to the operating system. Caused by a kernel bug.
Problem Description
- In dataVVCount, 3 worker drifts occurred from June 4th to 6th, and the most recent one was the drift of the worker with port=5723.
- Drift workers occur on both machines a01/a02
Environmental inspection
- a01/a02 operating system Centos6.6, kernel 2.6.32-302.el6.x86_64 [inconsistent with other machines]
- Through the monitoring of the storm platform, it is found that the load of the host machine & the entire cluster is relatively low during the drift period.
task check
- There is no log of any communication error with zk found in the worker log.
- The worker process monitoring indicators did not fluctuate significantly.
worker drift process
- nimbus determines that the executor is not alive [(current time - last heartbeat time [2 seconds update]) > nimbus.task.timeout.secs=30]
- Nimbus tasks do executor reassignment
- supervisor closes the original worker[port=5723]
- New worker starts
Preliminary Conclusions
According to the information obtained above, there is not much valuable information, and I am still confused.
zookeeper & storm logs
2016-06-04 13:45:06,529 [myid:3] - INFO [CommitProcessor:3:ZooKeeperServer@595] - Established session 0x353df7b6a55026f with negotiated timeout 40000 for client /10.x.x.18:55783
2016-06-06 06:22:13,626 [myid:3] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x353df7b6a55026f due to java.io.IOException: Connection reset by peer 2016-06-06 06:22:13,628 [myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client /10.x.x.18:55783 which had sessionid 0x353df7b6a55026f
2016-06-06 06:22:16,000 [myid:2] - INFO [SessionTracker:ZooKeeperServer@325] - Expiring session 0x353df7b6a55026f, timeout of 40000ms exceeded 2016-06-06 06:22:16,001 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@476] - Processed session termination for sessionid: 0x353df7b6a55026f
2016-06-06 06:22:10 b.s.d.nimbus [INFO] Executor dataVVCount-214-1464683013:[544 544] not alive 2016-06-06 06:22:10 b.s.s.EvenScheduler [INFO] Available slots: ([“8033393c-e639-41a5-a565-066e6bd1748b” 5724]….. 2016-06-06 06:22:10 b.s.d.nimbus [INFO] Reassigning dataVVCcount-214-1464683013 to 20 slots
2016-06-06 06:22:12 b.s.d.supervisor [INFO] Shutting down and clearing state for id 8277aca4-aa83-4a66-b59c-80db17da807e. 2016-06-06 06:22:12 b.s.d.supervisor [INFO] Shutting down 8033393c-e639-41a5-a565-066e6bd1748b:8277aca4-aa83-4a66-b59c-80db17da807e 2016-06-06 06:22:13 b.s.d.supervisor [INFO] Shut down 8033393c-e639-41a5-a565-066e6bd1748b:8277aca4-aa83-4a66-b59c-80db17da807e
log analysis
- The disconnection of zookeeper occurs after the worker drifts [06:22:16 to remove the session, 06:22:12 to close the worker]
- The connection with the zkServer was not disconnected before the worker drifted [there is no error message with zookeeper in the worker log]
final conclusion
The executor will not be judged not alive under normal circumstances during the worker's running process. It must be because the executor does not send heartbeat packets to zookeeper for some reason. Based on the situation where the java process hangs before, I boldly guess that the worker process hangs unexpectedly. Now, decisively upgrade the kernel version from Centos6.6 2.6.32-504.el6.x86_6 to a higher version [2.6.32-573.el6.x86_64]. After 3-4 months of observation and practice, it is proved that the worker has no longer Drift occurred, indicating that the problem at that time was indeed caused by a kernel bug.
kernel hang bug
Tene describes the flaw as follows:
"The impact of this kernel vulnerability is very simple: under some seemingly impossible circumstances, the user process will deadlock and be suspended. Any futex call waiting (even if properly awakened) may be blocked from executing forever. Like Thread.park() in Java might block all the time, etc. If you are lucky, you will find soft lockup messages in the dmesg log; if you are not so lucky (like us), you will have to spend a few Months of labor costs to troubleshoot problems in the code, and possibly nothing.”
http://www.infoq.com/cn/news/2015/06/redhat-futex