java.net.SocketTimeoutException: 480000 millis timeout hdfs

hdfs集群出现SocketTimeoutException,但是原因不得而知,社区不少issue都提到过,但是就是不懂具体原因。

https://issues.apache.org/jira/browse/HDFS-693

https://issues.apache.org/jira/browse/HDFS-770

https://issues.apache.org/jira/browse/HDFS-3342

http://search-hadoop.com/m/zjdcx1aR6Aa1/millis+timeout+while+waiting+for+channel+to+be+ready+for+write.+ch&subj=Lots+of+Different+Kind+of+Datanode+Errors

http://www.quora.com/What-are-some-tips-for-configuring-HBase

相关讨论说设置dfs.datanode.socket.write.timeout为0可以解决,不管怎么样都觉得没到点上。

今天测试,偶然发现居然重现了

重现步骤如下:

CacheConfig cacheConf = new CacheConfig(conf);

SchemaMetrics.setUseTableNameInTest(false);

Reader reader = HFile.createReader(fs, path, cacheConf);

reader.loadFileInfo();

HFileScanner scanner = reader.getScanner(false, false); // pread=false,非随机读取

// Align scanner at start of the file.

scanner.seekTo();

ByteBuffer key = scanner.getKey();

byte [] keyBytes = Bytes.toBytes(key);

ByteBuffer val = scanner.getValue();

byte[] valBytes = Bytes.toBytes(val);

System.out.println(" key: " + Bytes.toString(keyBytes));

System.out.println(" value: " + Bytes.toString(valBytes));

while (true) {

try { Thread.sleep(60000); count++; } catch (Exception e) { e.printStackTrace(); }

}

这个时候可以看到 

DataNode有个DataXceiver线程,栈如下:

"DataXceiver for client /127.0.0.1:35602 [sending block blk_912297361534887040_1518]" daemon prio=10 tid=0x08259400 nid=0x225 runnable [0xb34e1000]

java.lang.Thread.State: RUNNABLE

at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)

at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)

at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)

at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)

locked <0x508e9be8> (a sun.nio.ch.Util$2)

locked <0x508e9bd8> (a java.util.Collections$UnmodifiableSet)

locked <0x508d9798> (a sun.nio.ch.EPollSelectorImpl)

at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)

at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:339)

at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:249)

at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:164)

at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:207)

at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:391)

at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:291)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:180)

过了480000ms后,DataNode出现日志:

2012-08-13 00:55:14,155 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:50010, dest: /127.0.0.1:35634, bytes: 330240, op: HDFS_READ, cliID: DFSClient_1059963309, offset: 0, srvID: DS-383488255-0:0:0:0:0:0:0:1-50010-1343028787069, blockid: blk_912297361534887040_1518, duration: 480154978318

2012-08-13 00:55:14,155 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-383488255-0:0:0:0:0:0:0:1-50010-1343028787069, infoPort=50075, ipcPort=50020):Got exception while serving blk_912297361534887040_1518 to /127.0.0.1:

java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 remote=/127.0.0.1:35634]

at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:250)

at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:164)

at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:207)

at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:391)

at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:291)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:180)

2012-08-13 00:55:14,162 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: check disk error costs(ms): 7

2012-08-13 00:55:14,163 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-383488255-0:0:0:0:0:0:0:1-50010-1343028787069, infoPort=50075, ipcPort=50020):DataXceiver remoteAddress:/127.0.0.1:35634

java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 remote=/127.0.0.1:35634]

at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:250)

at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:164)

at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:207)

at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:391)

at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:291)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:180)

此时在看DataNode进程栈,发现上面的DataXceiver 栈信息已经没有了,这个线程退出了

原因是因为Client向DataNode请求数据,但是只读取了一部分,剩下的一直没有去读取,所以过了480000后就会超时

在一些scan场景下就可能发生,scan的时候调用的是seek+read,new BlockReader时,设置的长度是(blk.getNumBytes() - offsetIntoBlock,也就是当前块剩余可读的数据量,从当前位置到block结束的长度)

如果scan没有读取到块结束就不读了,并且之后480000ms都没有在通过这个流执行seek+read,那么就会出现SocketTimeoutException异常

猜你喜欢

转载自bupt04406.iteye.com/blog/1630655