hdfs集群出现SocketTimeoutException,但是原因不得而知,社区不少issue都提到过,但是就是不懂具体原因。
https://issues.apache.org/jira/browse/HDFS-693
https://issues.apache.org/jira/browse/HDFS-770
https://issues.apache.org/jira/browse/HDFS-3342
http://www.quora.com/What-are-some-tips-for-configuring-HBase
相关讨论说设置dfs.datanode.socket.write.timeout为0可以解决,不管怎么样都觉得没到点上。
今天测试,偶然发现居然重现了
重现步骤如下:
CacheConfig cacheConf = new CacheConfig(conf);
SchemaMetrics.setUseTableNameInTest(false);
Reader reader = HFile.createReader(fs, path, cacheConf);
reader.loadFileInfo();
HFileScanner scanner = reader.getScanner(false, false); // pread=false,非随机读取
// Align scanner at start of the file.
scanner.seekTo();
ByteBuffer key = scanner.getKey();
byte [] keyBytes = Bytes.toBytes(key);
ByteBuffer val = scanner.getValue();
byte[] valBytes = Bytes.toBytes(val);
System.out.println(" key: " + Bytes.toString(keyBytes));
System.out.println(" value: " + Bytes.toString(valBytes));
while (true) {
try { Thread.sleep(60000); count++; } catch (Exception e) { e.printStackTrace(); }
}
这个时候可以看到
DataNode有个DataXceiver线程,栈如下:
"DataXceiver for client /127.0.0.1:35602 [sending block blk_912297361534887040_1518]" daemon prio=10 tid=0x08259400 nid=0x225 runnable [0xb34e1000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
locked <0x508e9be8> (a sun.nio.ch.Util$2)
locked <0x508e9bd8> (a java.util.Collections$UnmodifiableSet)
locked <0x508d9798> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:339)
at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:249)
at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:164)
at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:207)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:391)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:291)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:180)
过了480000ms后,DataNode出现日志:
2012-08-13 00:55:14,155 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:50010, dest: /127.0.0.1:35634, bytes: 330240, op: HDFS_READ, cliID: DFSClient_1059963309, offset: 0, srvID: DS-383488255-0:0:0:0:0:0:0:1-50010-1343028787069, blockid: blk_912297361534887040_1518, duration: 480154978318
2012-08-13 00:55:14,155 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-383488255-0:0:0:0:0:0:0:1-50010-1343028787069, infoPort=50075, ipcPort=50020):Got exception while serving blk_912297361534887040_1518 to /127.0.0.1:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 remote=/127.0.0.1:35634]
at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:250)
at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:164)
at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:207)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:391)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:291)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:180)
2012-08-13 00:55:14,162 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: check disk error costs(ms): 7
2012-08-13 00:55:14,163 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-383488255-0:0:0:0:0:0:0:1-50010-1343028787069, infoPort=50075, ipcPort=50020):DataXceiver remoteAddress:/127.0.0.1:35634
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 remote=/127.0.0.1:35634]
at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:250)
at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:164)
at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:207)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:391)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:291)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:180)
此时在看DataNode进程栈,发现上面的DataXceiver 栈信息已经没有了,这个线程退出了
原因是因为Client向DataNode请求数据,但是只读取了一部分,剩下的一直没有去读取,所以过了480000后就会超时
在一些scan场景下就可能发生,scan的时候调用的是seek+read,new BlockReader时,设置的长度是(blk.getNumBytes() - offsetIntoBlock,也就是当前块剩余可读的数据量,从当前位置到block结束的长度)
如果scan没有读取到块结束就不读了,并且之后480000ms都没有在通过这个流执行seek+read,那么就会出现SocketTimeoutException异常