Alluxio 1.2.0 HA 问题集合

一、问题之 last execution took xxxx ms. Longer than the interval xxx

集群情况如下:
hdfs-yarn-1;IP地址:192.168.1.151;服务:Master & Worker & Zookeeper
hdfs-yarn-2;IP地址:192.168.1.152;服务:Master & Worker
hdfs-yarn-3;IP地址:192.168.1.153;服务:Worker

现象:
当将hdfs-yarn-1上的master kill之后,worker向 hdfs-yarn-2上面的master注册的时候,报如下错误:

2016-08-20 10:52:30,425 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ HDFS-YARN-2/192.168.1.152:19998
2016-08-20 10:52:48,509 WARN  logger.type (SleepingTimer.java:tick) - Worker Pin List Sync last execution took 43787 ms. Longer than the interval 1000
2016-08-20 10:52:48,520 WARN  logger.type (SleepingTimer.java:tick) - Worker FileSystemMaster Sync last execution took 43673 ms. Longer than the interval 1000

之后的现象是,hdfs-yarn-2中没有一台可用的 worker,如下图:
这里写图片描述

根据上面的错误提示,可以知道,其是因为 Pin List 和 FileSystemMaster 元数据信息恢复的时候,所使用的时候超过了系统设定的心跳检测时间1000ms,所以可能引起向新master注册不成功。

解决方法:
修改 alluxio-site.properties 中修改 Pin List 和 FileSystemMaster 的心跳检测时长,如下:

alluxio.worker.block.heartbeat.timeout.ms=60000
alluxio.worker.filesystem.heartbeat.interval.ms=60000

再次测试,OK了。

猜你喜欢

转载自blog.csdn.net/sun_qiangwei/article/details/52260310
HA