HDFS客户端读文件流程分析

本文链接： https://blog.csdn.net/answer100answer/article/details/99579469

文章目录

1. 建入口类断点调试
2. 读操作分析

1. 建入口类断点调试

我们断点调试hdfs读数据过程，为了方便调试，写一个测试类：

public class FsShellTest {
    public static void main(String argv[]) throws Exception {

        FsShell shell = new FsShell();
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS","hdfs://hadoop1:9000");
        conf.setQuietMode(false);
        shell.setConf(conf);

        String[] args = {"-text","/user/hello.txt"};   // 普通目录
        int res;
        try {
            res = ToolRunner.run(shell, args);
        } finally {
            shell.close();
        }
        System.exit(res);
    }
}

测试类中直接调用ToolRunner.run，后续过程同FsShell类调用过程。

我们先看一下 hadoop fs -text 的调用栈：

"main@1" prio=5 tid=0x1 nid=NA runnable
  java.lang.Thread.State: RUNNABLE
	  at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:353)
	  at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:837)
	  at org.apache.hadoop.fs.shell.Display$Cat.getInputStream(Display.java:114)
	  at org.apache.hadoop.fs.shell.Display$Text.getInputStream(Display.java:131)
	  at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:102)
	  at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:319)
	  at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:291)
	  at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:273)
	  at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:257)
	  at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:203)
	  at org.apache.hadoop.fs.shell.Command.run(Command.java:167)
	  at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
	  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
	  at cn.whbing.hadoop.FsShellTest.main(FsShellTest.java:25)

在这里插入图片描述
我们看到处理流程：

ToolRunner.run --> FsShell.run --> Command.processPaths --> Display$Cat.processPath

至此参数解析完毕，通过传入的参数解析到为 read 操作。此后就是open等操作，可以看做 read 操作的开始，分析如下。

附：
对于读的断点调试，我们还可以直接在客户端调用open方法来测试，如：

    /**
     * 测试2：直接调用FileSystem的open方法来读
     */
    @Test
    public void testRead(){
        try {
            Configuration conf = new Configuration();
            conf.set("fs.defaultFS","hdfs://hadoop1:9000");
            FileSystem fs = FileSystem.get(conf);
            Path path = new Path("/user/hello.txt");
            FSDataInputStream in = fs.open(path);
            BufferedReader buff = new BufferedReader(new InputStreamReader(in));
            String str = null;
            while ((str = buff.readLine()) != null){
                System.out.println(str);
            }
            buff.close();
            in.close();
        } catch (Exception e){

        }
    }

2. 读操作分析

1. 打开HDFS文件

对于 FSDataInputStream in = fs.open(path);断点调试，其调用栈为：

"main@1" prio=5 tid=0x1 nid=NA runnable
  java.lang.Thread.State: RUNNABLE
	  at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:353)
	  at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:837)
	  at cn.whbing.hadoop.FsShellTest.testRead(FsShellTest.java:57)

调用了DistributedFileSystem.open，

1. DFSInputStream.open()

DFSInputStream.java：

  DFSInputStream(DFSClient dfsClient, String src, boolean verifyChecksum,
                 LocatedBlocks locatedBlocks) throws IOException {
    this.dfsClient = dfsClient;
    this.verifyChecksum = verifyChecksum;
    this.src = src;
    synchronized (infoLock) {
      this.cachingStrategy = dfsClient.getDefaultReadCachingStrategy();
    }
    this.locatedBlocks = locatedBlocks;
    openInfo(false); //由info改成了info(false)
  }

在构造器中，openInfo多了boolean类型的参数，openInfo方法如下：

  void openInfo(boolean refreshLocatedBlocks) throws IOException {
    final DFSClient.Conf conf = dfsClient.getConf();
    synchronized(infoLock) {
      lastBlockBeingWrittenLength =
          fetchLocatedBlocksAndGetLastBlockLength(refreshLocatedBlocks);
      int retriesForLastBlockLength = conf.retryTimesForGetLastBlockLength;
      while (retriesForLastBlockLength > 0) {
        // Getting last block length as -1 is a special case. When cluster
        // restarts, DNs may not report immediately. At this time partial block
        // locations will not be available with NN for getting the length. Lets
        // retry for 3 times to get the length.
        if (lastBlockBeingWrittenLength == -1) {
          DFSClient.LOG.warn("Last block locations not available. "
              + "Datanodes might not have reported blocks completely."
              + " Will retry for " + retriesForLastBlockLength + " times");
          waitFor(conf.retryIntervalForGetLastBlockLength);
          lastBlockBeingWrittenLength =
              fetchLocatedBlocksAndGetLastBlockLength(true);
        } else {
          break;
        }
        retriesForLastBlockLength--;
      }
      if (lastBlockBeingWrittenLength == -1
          && retriesForLastBlockLength == 0) {
        throw new IOException("Could not obtain the last block locations.");
      }
    }
  }

其中 fetchLocatedBlocksAndGetLastBlockLength方法仅仅加了一个 refresh 布尔参数。

  private long fetchLocatedBlocksAndGetLastBlockLength(boolean refresh)
      throws IOException {
    LocatedBlocks newInfo = locatedBlocks;
    if (locatedBlocks == null || refresh) {
      newInfo = dfsClient.getLocatedBlocks(src, 0);
    }
    ...

    protected InputStream getInputStream(PathData item) throws IOException {
      FSDataInputStream i = (FSDataInputStream)super.getInputStream(item);

      // Handle 0 and 1-byte files
      short leadBytes;
      try {
        leadBytes = i.readShort();
      }
      ...

  public synchronized int read() throws IOException {
    if (oneByteBuf == null) {
      oneByteBuf = new byte[1];
    }
    int ret = read( oneByteBuf, 0, 1 );
    return ( ret <= 0 ) ? -1 : (oneByteBuf[0] & 0xff);
  }

  public synchronized int read(final byte buf[], int off, int len) throws IOException {
    ReaderStrategy byteArrayReader = new ByteArrayStrategy(buf);
    TraceScope scope =
        dfsClient.getPathTraceScope("DFSInputStream#byteArrayRead", src);
    try {
      return readWithStrategy(byteArrayReader, off, len);
    } finally {
      scope.close();
    }
  }

其中read方法调用 readWithStrategy ，


  private synchronized int readWithStrategy(ReaderStrategy strategy, int off, int len) throws IOException {
    dfsClient.checkOpen();
    if (closed.get()) {
      throw new IOException("Stream closed");
    }
    Map<ExtendedBlock,Set<DatanodeInfo>> corruptedBlockMap 
      = new HashMap<ExtendedBlock, Set<DatanodeInfo>>();
    failures = 0;
    if (pos < getFileLength()) {
      int retries = 2;
      while (retries > 0) {
        try {
          // currentNode can be left as null if previous read had a checksum
          // error on the same block. See HDFS-3067
          if (pos > blockEnd || currentNode == null) {
            currentNode = blockSeekTo(pos);
          }
          int realLen = (int) Math.min(len, (blockEnd - pos + 1L));
          synchronized(infoLock) {
            if (locatedBlocks.isLastBlockComplete()) {
              realLen = (int) Math.min(realLen,
                  locatedBlocks.getFileLength() - pos);
            }
          }
          int result = readBuffer(strategy, off, realLen, corruptedBlockMap);
          
          if (result >= 0) {
            pos += result;
          } else {
            // got a EOS from reader though we expect more data on it.
            throw new IOException("Unexpected EOS from the reader");
          }
          if (dfsClient.stats != null) {
            dfsClient.stats.incrementBytesRead(result);
          }
          return result;
        } catch (ChecksumException ce) {
          throw ce;            
        } catch (IOException e) {
          if (retries == 1) {
            DFSClient.LOG.warn("DFS Read", e);
          }
          blockEnd = -1;
          if (currentNode != null) { addToDeadNodes(currentNode); }
          if (--retries == 0) {
            throw e;
          }
        } finally {
          // Check if need to report block replicas corruption either read
          // was successful or ChecksumException occured.
          reportCheckSumFailure(corruptedBlockMap, 
              currentLocatedBlock.getLocations().length);
        }
      }
    }
    return -1;
  }

private synchronized DatanodeInfo blockSeekTo(long target) 方法，
在这里插入图片描述

  private DNAddrPair chooseDataNode(LocatedBlock block,
      Collection<DatanodeInfo> ignoredNodes) throws IOException {
    while (true) {
      try {
        return getBestNodeDNAddrPair(block, ignoredNodes);
      } catch (IOException ie) {
        String errMsg = getBestNodeDNAddrPairErrorString(block.getLocations(),
          deadNodes, ignoredNodes);
        String blockInfo = block.getBlock() + " file=" + src;
        if (failures >= dfsClient.getMaxBlockAcquireFailures()) {
          String description = "Could not obtain block: " + blockInfo;
          DFSClient.LOG.warn(description + errMsg
              + ". Throwing a BlockMissingException");
          throw new BlockMissingException(src, description,
              block.getStartOffset());
        }

getBestNodeDNAddrPair 的方法如下：
在这里插入图片描述
Q:
这个true 出不来怎么办？

private DNAddrPair chooseDataNode(LocatedBlock block,
Collection ignoredNodes) throws IOException {
while (true) {
try {

根据错误提示：

hadoop fs -text hdfs://hadoop1:9000/ec/t.log

19/08/14 17:24:08 WARN hdfs.DFSClient: Failed to connect to /10.179.17.22:9866 for block, add to deadNodes and continue. java.io.IOException: Got error, status message opReadBlock BP-1712821023-10.179.25.59-1564737285129:blk_-9223372036854775648_1019 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not found for BP-1712821023-10.179.25.59-1564737285129:blk_-9223372036854775648_1019, for OP_READ_BLOCK, self=/10.179.72.122:47858, remote=/10.179.17.22:9866, for file /ec/t.log, for pool BP-1712821023-10.179.25.59-1564737285129 block -9223372036854775648_1019
java.io.IOException: Got error, status message opReadBlock BP-1712821023-10.179.25.59-1564737285129:blk_-9223372036854775648_1019 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not found for BP-1712821023-10.179.25.59-1564737285129:blk_-9223372036854775648_1019, for OP_READ_BLOCK, self=/10.179.72.122:47858, remote=/10.179.17.22:9866, for file /ec/t.log, for pool BP-1712821023-10.179.25.59-1564737285129 block -9223372036854775648_1019
        at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:142)
        at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:456)
        at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:424)
        at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:818)
        at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:697)
        at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:355)
        at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:679)
        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:888)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:940)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:741)
        at java.io.DataInputStream.readShort(DataInputStream.java:312)
        at org.apache.hadoop.fs.shell.Display$Text.getInputStream(Display.java:136)
        at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:102)
        at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
        at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
        at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
        at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
        at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:201)
        at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)

对于 BlockReader：

public interface BlockReader extends ByteBufferReadable { 
	  int read(byte[] buf, int off, int len) throws IOException;
      long skip(long n) throws IOException;
      ...
}

getRemoteBlockReaderFromTcp

  private BlockReader getRemoteBlockReader(Peer peer) throws IOException {
    if (conf.useLegacyBlockReader) {
      return RemoteBlockReader.newBlockReader(fileName,
          block, token, startOffset, length, conf.ioBufferSize,
          verifyChecksum, clientName, peer, datanode,
          clientContext.getPeerCache(), cachingStrategy);
    } else {
      return RemoteBlockReader2.newBlockReader(
          fileName, block, token, startOffset, length,
          verifyChecksum, clientName, peer, datanode,
          clientContext.getPeerCache(), cachingStrategy);
    }
  }

在这里插入图片描述

readBlock方法中，通过发送值为81的状态码org.apache.hadoop.hdfs.protocol.datatransfer.Op.READ_BLOCK 到DataXceiver中的peer服务。
在这里插入图片描述
在这里我们看到把各个参数都set到了OpReadBlockProto对象里,然后发送出去,也就是发送到了初始化的DataXceiverServer服务.

这个时候服务端一直阻塞的socket线程将会收到操作码为81的读请求，然后就进行后续的处理

我们看下,其实其他的一些对于数据的操作,如copyBlock,writeBlock都是在Sender中完成的.

Sender的包是 public class Sender implements DataTransferProtocol

我们来看DataXceiverServer的run方法。

 public void run() {
    Peer peer = null;
    while (datanode.shouldRun && !datanode.shutdownForUpgrade) {
      try {
       //通过accept方法在这里一直阻塞,直到有请求过来,我们通过跟踪代码,看到内部其实是封装了java的serverSocket的accept方法.
        peer = peerServer.accept();

        // Make sure the xceiver count is not exceeded
        int curXceiverCount = datanode.getXceiverCount();
        if (curXceiverCount > maxXceiverCount.get()) {
          throw new IOException("Xceiver count " + curXceiverCount
              + " exceeds the limit of concurrent xcievers: "
              + maxXceiverCount.get());
        }
        
      //当有请求过来的时候,就通过DataXceiver.create创建了一个守护进程,并将其加到线程组里.
        new Daemon(datanode.threadGroup,
            DataXceiver.create(peer, datanode, this))
            .start();
      } catch (SocketTimeoutException ignored) {
        // wake up to see if should continue to run
      } 
      ...
  }

通过 op = readOp();获取具体是什么操作，读、写、copy等，然后processOp(op);方法来处理具体的逻辑

在方法中，通过switch来具体的分发，让不同的方法执行不同的逻辑

protected final void processOp(Op op) throws IOException {
    switch(op) {
    case READ_BLOCK:
      opReadBlock();
      break;
    case WRITE_BLOCK:
      opWriteBlock(in);
      break;
    case REPLACE_BLOCK:
      opReplaceBlock(in);
      break;
    case COPY_BLOCK:
      opCopyBlock(in);
      break;
    case BLOCK_CHECKSUM:
      opBlockChecksum(in);
      break;
    case TRANSFER_BLOCK:
      opTransferBlock(in);
      break;
    case REQUEST_SHORT_CIRCUIT_FDS:
      opRequestShortCircuitFds(in);
      break;
    case RELEASE_SHORT_CIRCUIT_FDS:
      opReleaseShortCircuitFds(in);
      break;
    case REQUEST_SHORT_CIRCUIT_SHM:
      opRequestShortCircuitShm(in);
      break;
    default:
      throw new IOException("Unknown op " + op + " in data stream");
    }
  }

参考：

https://zhangjun5965.iteye.com/blog/2375278