关于操作HDFS的一个问题

近日写程序定时任务调Hadoop MR程序，然后生成报表，发送邮件，当时起了两个任务A和B，调MR程序之前，会操作hdfs（读写都有），任务A每天一点跑，任务B每十分钟跑一次，B任务不会调用MR程序，纯粹采集数据。结果第一天就发现任务A没有发送邮件，于是乎查日志，异常信息如下

java.io.IOException: Failed on local exception: java.io.InterruptedIOException: Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/10.1.23.249:52305 remote=/10.1.23.249:9000]. 60000 millis timeout left.; Host Details : local host is: "hadoop-alone-test/10.1.23.249"; destination host is: "hadoop-alone-test":9000; 
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
    at org.apache.hadoop.ipc.Client.call(Client.java:1479)
    at org.apache.hadoop.ipc.Client.call(Client.java:1412)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
    at com.sun.proxy.$Proxy109.getListing(Unknown Source)
........

当时有点懵，不知道为什么出现这种IO被中断。于是乎，我在job控制台再调一下，并没有出现错误。本来想看看源码，但是领导安排了个比较紧急的任务。忙了几天，期间没有仔细去管，我只是注意了下每天的邮件，发现还是有的时候会成功的。忙完了任务，赶紧来排查，再看日志，发现报的错不止这一种，下面列出其他错误

java.io.IOException: Failed on local exception: java.io.IOException; Host Details : local host is: "hadoop-alone-test/10.1.23.249"; destination host is: "hadoop-alone-test":9000; 
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
    at org.apache.hadoop.ipc.Client.call(Client.java:1479)
    at org.apache.hadoop.ipc.Client.call(Client.java:1412)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
    at com.sun.proxy.$Proxy109.getFileInfo(Unknown Source)
...........

java.io.IOException: Filesystem closed
    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:808)
    at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2083)
    at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:944)
    at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:927)
    at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:872)
    at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:868)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
................

java.io.IOException: The client is stopped
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1507)
    at org.apache.hadoop.ipc.Client.call(Client.java:1451)
    at org.apache.hadoop.ipc.Client.call(Client.java:1412)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
    at com.sun.proxy.$Proxy109.getFileInfo(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
    at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)

就是上述这四个错误，都是和io通信有关，然后都出现在操作Hadoop的FileSystem那段代码，且任务A和B都不等量的出现过，而且出现在凌晨一点，两个线程同时操作的时候，所以问题很明确，出在并发上面，联系到上面的FileSystem被关闭，我代码确实是调用了close方法

于是查看了一下源码，创建FileSystem的时候有如下代码

String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme);
return conf.getBoolean(disableCacheName, false) ? createFileSystem(uri, conf) : CACHE.get(uri, conf);

创建FileSystem的时候读取配置"fs.%s.impl.disable.cache"，默认为false，所以第二次走了缓存， FileSystem的URI相同的话，一定只创建一个FileSystem

涉及到多线程访问，而线程B已经调用了filesystem.close()方法，这个时候线程A还在操作filesystem，所以报错上面种种异常

解决办法

1、代码同步（这里就不贴代码，用synchronized、lock这些都行）

2、禁用FileSystem缓存

代码里

Configuration conf = new Configuration();
conf.set("fs.hdfs.impl.disable.cache", "true");

在core-site.xml文件里面配置（二者选其一）

<property>
    <name>fs.hdfs.impl.disable.cache</name>
    <value>true</value>
</property>

关于操作HDFS的一个问题

猜你喜欢