HDFS源码解析之HDFS写数据流程(九)

1. 当我们向HDFS写文件会发生什么?

测试代码

/**
 * Copyright (c) 2019 leyou ALL Rights Reserved
 * Project: hadoop-main
 * Package: org.apache.hadoop
 * Version: 1.0
 *
 * @author qingzhi.wu
 * @date 2020/7/6 20:05
 */
public class TestHDFS {
    public static void main(String[] args) throws IOException {
        Configuration configuration = new Configuration();
        FileSystem fileSystem = FileSystem.newInstance(configuration);
        /**
         * 我们可以提前猜测一下,根据我们之前看的源码来说,这个mkdirs应该是执行在哪里呢?
         * 我猜测肯定是在rpc的服务端,那么毫无疑问的就是我们的 NamenodeRpcServer
         * 那么为什么不直接
         * NameNode namenode = new NameNode(conf)
         *
         * namenode.mkdirs(new path("xx"))
         *
         * 这么说这个FileSystem里面肯定会获取代理
         */
        //场景驱动的方式(元数据的更新流程)
        fileSystem.mkdirs(new Path("/user/hive/warehouse/test/my"));


        /**
         * 之前: 创建目录(更新元数据) Inode: INodeDirectory
         * 现在: 创建文件(更新元数据) Inode: INodeFile
         *
         * 初始化的工作:
         * 1. 添加INodeFile,修改目录树
         * 2. 添加契约
         * 3. 启动Data Streamer
         * 4. 开启续约的线程
         */
        FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path("/data"));

        fsDataOutputStream.write("测试".getBytes());

        //fileSystem.copyFromLocalFile(new Path("aaa"),new Path("sss"));

    }
}

1.1 向目录树添加INodeFile (更新目录树)

1.1.1 fileSystem.create(new Path("/data"));

  /**
   * Create an FSDataOutputStream at the indicated Path.
   * Files are overwritten by default.
   * 
   * 在指定的路径创建FSDataOutputStream。
   * 默认情况下会覆盖文件。
   * @param f the file to create
   */
  public FSDataOutputStream create(Path f) throws IOException {
    //TODO 重要
    return create(f, true);
  }

1.1.2 DistributedFileSystem的create方法

  @Override
  public FSDataOutputStream create(Path f, FsPermission permission,
      boolean overwrite, int bufferSize, short replication, long blockSize,
      Progressable progress) throws IOException {
    //TODO 重要
    return this.create(f, permission,
        overwrite ? EnumSet.of(CreateFlag.CREATE, CreateFlag.OVERWRITE)
            : EnumSet.of(CreateFlag.CREATE), bufferSize, replication,
        blockSize, progress, null);
  }

1.1.3 create

  @Override
  public FSDataOutputStream create(final Path f, final FsPermission permission,
    final EnumSet<CreateFlag> cflags, final int bufferSize,
    final short replication, final long blockSize, final Progressable progress,
    final ChecksumOpt checksumOpt) throws IOException {
    statistics.incrementWriteOps(1);
    Path absF = fixRelativePart(f);
    return new FileSystemLinkResolver<FSDataOutputStream>() {
      @Override
      public FSDataOutputStream doCall(final Path p)
          throws IOException, UnresolvedLinkException {

        //TODO 创建了一个DFSOutputStream,做了很多初始化的工作

        /**
         * 1. 往文件目录书里面添加了INodeFile
         * 2. 添加了契约管理
         * 3. 启动了DataStreamer(写数据流程的关键服务)
         */
        final DFSOutputStream dfsos = dfs.create(getPathName(p), permission,
                cflags, replication, blockSize, progress, bufferSize,
                checksumOpt);
        return dfs.createWrappedOutputStream(dfsos, statistics);
      }
      @Override
      public FSDataOutputStream next(final FileSystem fs, final Path p)
          throws IOException {
        return fs.create(p, permission, cflags, bufferSize,
            replication, blockSize, progress, checksumOpt);
      }
    }.resolve(this, absF);
  }

1.1.4 DFSClient的create方法

  /**
   * Call {@link #create(String, FsPermission, EnumSet, boolean, short, 
   * long, Progressable, int, ChecksumOpt)} with <code>createParent</code>
   *  set to true.
   */
  public DFSOutputStream create(String src, 
                             FsPermission permission,
                             EnumSet<CreateFlag> flag, 
                             short replication,
                             long blockSize,
                             Progressable progress,
                             int buffersize,
                             ChecksumOpt checksumOpt)
      throws IOException {
    //TODO 重要
    return create(src, permission, flag, true,
        replication, blockSize, progress, buffersize, checksumOpt, null);
  }
 // ----------------------------------------------------------------------------------
  /**
   * Same as {@link #create(String, FsPermission, EnumSet, boolean, short, long,
   * Progressable, int, ChecksumOpt)} with the addition of favoredNodes that is
   * a hint to where the namenode should place the file blocks.
   * The favored nodes hint is not persisted in HDFS. Hence it may be honored
   * at the creation time only. HDFS could move the blocks during balancing or
   * replication, to move the blocks from favored nodes. A value of null means
   * no favored nodes for this create
   */
  public DFSOutputStream create(String src, 
                             FsPermission permission,
                             EnumSet<CreateFlag> flag, 
                             boolean createParent,
                             short replication,
                             long blockSize,
                             Progressable progress,
                             int buffersize,
                             ChecksumOpt checksumOpt,
                             InetSocketAddress[] favoredNodes) throws IOException {
    checkOpen();
    if (permission == null) {
      permission = FsPermission.getFileDefault();
    }
    FsPermission masked = permission.applyUMask(dfsClientConf.uMask);
    if(LOG.isDebugEnabled()) {
      LOG.debug(src + ": masked=" + masked);
    }

    //TODO
    /**
     * 1. 往文件目录里面添加了文件
     * 2. 添加了契约
     * 3. 启动了DataStreamer
     */
    final DFSOutputStream result = DFSOutputStream.newStreamForCreate(this,
        src, masked, flag, createParent, replication, blockSize, progress,
        buffersize, dfsClientConf.createChecksum(checksumOpt),
        getFavoredNodesStr(favoredNodes));
    beginFileLease(result.getFileId(), result);
    return result;
  }

1.1.5 newStreamForCreate

static DFSOutputStream newStreamForCreate(DFSClient dfsClient, String src,
      FsPermission masked, EnumSet<CreateFlag> flag, boolean createParent,
      short replication, long blockSize, Progressable progress, int buffersize,
      DataChecksum checksum, String[] favoredNodes) throws IOException {
    TraceScope scope =
        dfsClient.getPathTraceScope("newStreamForCreate", src);
    try {
      HdfsFileStatus stat = null;

      // Retry the create if we get a RetryStartFileException up to a maximum
      // number of times
      boolean shouldRetry = true;
      int retryCount = CREATE_RETRY_COUNT;

      //TODO 重试的代码结构
      while (shouldRetry) {
        shouldRetry = false;
        try {

          /**
           * HDFS原理总结:
           * 1. 创建目录: 就是在 目录树(元数据)上面添加一个INode (INodeDirectory)
           * 2. 上传文件:
           *      1) 在目录是里面添加一个INodeFile
           *      2) 再往文件里面写数据
           *
           * 更新元数据
           * 添加了契约
           * TODO 往目录树里面添加INodeFile,记录元数据日志和添加契约
           * TODO 这儿都是需要跟NameNode的服务端进行交互的
           */
          stat = dfsClient.namenode.create(src, masked, dfsClient.clientName,
              new EnumSetWritable<CreateFlag>(flag), createParent, replication,
              blockSize, SUPPORTED_CRYPTO_VERSIONS);
          break;
        } catch (RemoteException re) {
          IOException e = re.unwrapRemoteException(
              AccessControlException.class,
              DSQuotaExceededException.class,
              FileAlreadyExistsException.class,
              FileNotFoundException.class,
              ParentNotDirectoryException.class,
              NSQuotaExceededException.class,
              RetryStartFileException.class,
              SafeModeException.class,
              UnresolvedPathException.class,
              SnapshotAccessControlException.class,
              UnknownCryptoProtocolVersionException.class);
          if (e instanceof RetryStartFileException) {
            //TODO 重试
            if (retryCount > 0) {
              shouldRetry = true;
              retryCount--;
            } else {
              throw new IOException("Too many retries because of encryption" +
                  " zone operations", e);
            }
          } else {
            throw e;
          }
        }
      }
      Preconditions.checkNotNull(stat, "HdfsFileStatus should not be null!");
      final DFSOutputStream out = new DFSOutputStream(dfsClient, src, stat,
          flag, progress, checksum, favoredNodes);
      out.start();
      return out;
    } finally {
      scope.close();
    }
  }

1.1.6 NameNodeRpcServer的create方法

为什么会来到这里呢,其实我们看到了DFSclient使用namenode来进行创建,根据我们以往观看源码的经验,这里肯定是NameNodeRpcServer

  @Override // ClientProtocol
  public HdfsFileStatus create(String src, FsPermission masked,
      String clientName, EnumSetWritable<CreateFlag> flag,
      boolean createParent, short replication, long blockSize, 
      CryptoProtocolVersion[] supportedVersions)
      throws IOException {
    //TODO namenode的启动的判断
    checkNNStartup();
    //TODO 获取客户端机器
    String clientMachine = getClientMachine();
    if (stateChangeLog.isDebugEnabled()) {
      stateChangeLog.debug("*DIR* NameNode.create: file "
          +src+" for "+clientName+" at "+clientMachine);
    }
    if (!checkPathLength(src)) {
      throw new IOException("create: Pathname too long.  Limit "
          + MAX_PATH_LENGTH + " characters, " + MAX_PATH_DEPTH + " levels.");
    }

    CacheEntryWithPayload cacheEntry = RetryCache.waitForCompletion(retryCache, null);
    if (cacheEntry != null && cacheEntry.isSuccess()) {
      return (HdfsFileStatus) cacheEntry.getPayload();
    }

    HdfsFileStatus status = null;
    try {
      PermissionStatus perm = new PermissionStatus(getRemoteUser()
          .getShortUserName(), null, masked);
      //TODO 开启文件,等待写入
      status = namesystem.startFile(src, perm, clientName, clientMachine,
          flag.get(), createParent, replication, blockSize, supportedVersions,
          cacheEntry != null);
    } finally {
      RetryCache.setState(cacheEntry, status != null, status);
    }

    metrics.incrFilesCreated();
    metrics.incrCreateFileOps();
    return status;
  }

1.1.7 FSNamesyetem的startFile方法

namesystem 是 FSNamesystem,它的主要工作不就是处理元数据吗,例如更新目录树啊,元数据磁盘刷写啊

/**
   * Create a new file entry in the namespace.
   * 
   * For description of parameters and exceptions thrown see
   * {@link ClientProtocol#create}, except it returns valid file status upon
   * success
   */
  HdfsFileStatus startFile(String src, PermissionStatus permissions,
      String holder, String clientMachine, EnumSet<CreateFlag> flag,
      boolean createParent, short replication, long blockSize, 
      CryptoProtocolVersion[] supportedVersions, boolean logRetryCache)
      throws AccessControlException, SafeModeException,
      FileAlreadyExistsException, UnresolvedLinkException,
      FileNotFoundException, ParentNotDirectoryException, IOException {

    HdfsFileStatus status = null;
    try {
      //TODO 重要
      status = startFileInt(src, permissions, holder, clientMachine, flag,
          createParent, replication, blockSize, supportedVersions,
          logRetryCache);
    } catch (AccessControlException e) {
      logAuditEvent(false, "create", src);
      throw e;
    }
    return status;
  }

1.1.8 startFileInt

  private HdfsFileStatus startFileInt(final String srcArg,
      PermissionStatus permissions, String holder, String clientMachine,
      EnumSet<CreateFlag> flag, boolean createParent, short replication,
      long blockSize, CryptoProtocolVersion[] supportedVersions,
      boolean logRetryCache)
      throws AccessControlException, SafeModeException,
      FileAlreadyExistsException, UnresolvedLinkException,
      FileNotFoundException, ParentNotDirectoryException, IOException {
    String src = srcArg;
  ....

    //TODO 等待元数据加载,这里是为了,namenode可能刚开始启动,
    // 这个时候需要对外进行访问的时候,检查 元数据是否已经加载进来了
    waitForLoadingFSImage();

    /**
     * If the file is in an encryption zone, we optimistically create an
     * EDEK for the file by calling out to the configured KeyProvider.
     * Since this typically involves doing an RPC, we take the readLock
     * initially, then drop it to do the RPC.
     * 
     * Since the path can flip-flop between being in an encryption zone and not
     * in the meantime, we need to recheck the preconditions when we retake the
     * lock to do the create. If the preconditions are not met, we throw a
     * special RetryStartFileException to ask the DFSClient to try the create
     * again later.
     */
    CryptoProtocolVersion protocolVersion = null;
    CipherSuite suite = null;
    String ezKeyName = null;
    EncryptedKeyVersion edek = null;

    if (provider != null) {
      readLock();
      try {
        src = dir.resolvePath(pc, src, pathComponents);
        INodesInPath iip = dir.getINodesInPath4Write(src);
        // Nothing to do if the path is not within an EZ
        final EncryptionZone zone = dir.getEZForPath(iip);
        if (zone != null) {
          protocolVersion = chooseProtocolVersion(zone, supportedVersions);
          suite = zone.getSuite();
          ezKeyName = zone.getKeyName();

          Preconditions.checkNotNull(protocolVersion);
          Preconditions.checkNotNull(suite);
          Preconditions.checkArgument(!suite.equals(CipherSuite.UNKNOWN),
              "Chose an UNKNOWN CipherSuite!");
          Preconditions.checkNotNull(ezKeyName);
        }
      } finally {
        readUnlock();
      }

      Preconditions.checkState(
          (suite == null && ezKeyName == null) ||
              (suite != null && ezKeyName != null),
          "Both suite and ezKeyName should both be null or not null");

      // Generate EDEK if necessary while not holding the lock
      edek = generateEncryptedDataEncryptionKey(ezKeyName);
      EncryptionFaultInjector.getInstance().startFileAfterGenerateKey();
    }

    // Proceed with the create, using the computed cipher suite and 
    // generated EDEK
    BlocksMapUpdateInfo toRemoveBlocks = null;
    writeLock();
    try {
      checkOperation(OperationCategory.WRITE);
      //TODO 检查NameNode是否处于 安全模式
      checkNameNodeSafeMode("Cannot create file" + src);
      dir.writeLock();
      try {
        //TODO 解析路径
        src = dir.resolvePath(pc, src, pathComponents);
        final INodesInPath iip = dir.getINodesInPath4Write(src);
        //TODO 重要的代码
        toRemoveBlocks = startFileInternal(
            pc, iip, permissions, holder,
            clientMachine, create, overwrite,
            createParent, replication, blockSize,
            isLazyPersist, suite, protocolVersion, edek,
            logRetryCache);
...
    return stat;
  }

1.1.9 startFileInternal

 /**
   * Create a new file or overwrite an existing file<br>
   * 
   * Once the file is create the client then allocates a new block with the next
   * call using {@link ClientProtocol#addBlock}.
   * <p>
   * For description of parameters and exceptions thrown see
   * {@link ClientProtocol#create}
   */
  private BlocksMapUpdateInfo startFileInternal(FSPermissionChecker pc, 
      INodesInPath iip, PermissionStatus permissions, String holder,
      String clientMachine, boolean create, boolean overwrite, 
      boolean createParent, short replication, long blockSize, 
      boolean isLazyPersist, CipherSuite suite, CryptoProtocolVersion version,
      EncryptedKeyVersion edek, boolean logRetryEntry)
      throws IOException {
...
      INodeFile newNode = null;

      // Always do an implicit mkdirs for parent directory tree.
      Map.Entry<INodesInPath, String> parent = FSDirMkdirOp
          .createAncestorDirectories(dir, iip, permissions);
      if (parent != null) {
        //TODO 在目录树里面添加一个INodeFile节点
        // dir 就是 FSDirectory
        iip = dir.addFile(parent.getKey(), parent.getValue(), permissions,
            replication, blockSize, holder, clientMachine);
...
  }

1.1.10 addFile

  /**
   * Add the given filename to the fs.
   * @return the new INodesInPath instance that contains the new INode
   */
  INodesInPath addFile(INodesInPath existing, String localName, PermissionStatus
      permissions, short replication, long preferredBlockSize,
      String clientName, String clientMachine)
    throws FileAlreadyExistsException, QuotaExceededException,
      UnresolvedLinkException, SnapshotAccessControlException, AclException {

    long modTime = now();
    //TODO 创建一个INodeFile
    INodeFile newNode = newINodeFile(allocateNewInodeId(), permissions, modTime,
        modTime, replication, preferredBlockSize);
    newNode.setLocalName(localName.getBytes(Charsets.UTF_8));
    newNode.toUnderConstruction(clientName, clientMachine);

    INodesInPath newiip;
    writeLock();
    try {
      //TODO 添加一个文件INodeFile
      newiip = addINode(existing, newNode);
    } finally {
      writeUnlock();
    }
    if (newiip == null) {
      NameNode.stateChangeLog.info("DIR* addFile: failed to add " +
          existing.getPath() + "/" + localName);
      return null;
    }

    if(NameNode.stateChangeLog.isDebugEnabled()) {
      NameNode.stateChangeLog.debug("DIR* addFile: " + localName + " is added");
    }
    return newiip;
  }
  /**
   * Add the given child to the namespace.
   * @param existing the INodesInPath containing all the ancestral INodes
   * @param child the new INode to add
   * @return a new INodesInPath instance containing the new child INode. Null
   * if the adding fails.
   * @throws QuotaExceededException is thrown if it violates quota limit
   */
  INodesInPath addINode(INodesInPath existing, INode child)
      throws QuotaExceededException, UnresolvedLinkException {
    cacheName(child);
    writeLock();
    try {
      //往最后添加
      return addLastINode(existing, child, true);
    } finally {
      writeUnlock();
    }
  }
/**
   * Add a child to the end of the path specified by INodesInPath.
   * @return an INodesInPath instance containing the new INode
   */
  @VisibleForTesting
  public INodesInPath addLastINode(INodesInPath existing, INode inode,
      boolean checkQuota) throws QuotaExceededException {
    assert existing.getLastINode() != null &&
        existing.getLastINode().isDirectory();

    final int pos = existing.length();
    // Disallow creation of /.reserved. This may be created when loading
    // editlog/fsimage during upgrade since /.reserved was a valid name in older
    // release. This may also be called when a user tries to create a file
    // or directory /.reserved.
    if (pos == 1 && existing.getINode(0) == rootDir && isReservedName(inode)) {
      throw new HadoopIllegalArgumentException(
          "File name \"" + inode.getLocalName() + "\" is reserved and cannot "
              + "be created. If this is during upgrade change the name of the "
              + "existing file or directory to another name before upgrading "
              + "to the new release.");
    }

    //TODO 获取父目录 warehouse
    final INodeDirectory parent = existing.getINode(pos - 1).asDirectory();
    // The filesystem limits are not really quotas, so this check may appear
    // odd. It's because a rename operation deletes the src, tries to add
    // to the dest, if that fails, re-adds the src from whence it came.
    // The rename code disables the quota when it's restoring to the
    // original location because a quota violation would cause the the item
    // to go "poof".  The fs limits must be bypassed for the same reason.
    if (checkQuota) {
      final String parentPath = existing.getPath(pos - 1);
      verifyMaxComponentLength(inode.getLocalNameBytes(), parentPath);
      verifyMaxDirItems(parent, parentPath);
    }
    // always verify inode name
    verifyINodeName(inode.getLocalNameBytes());

    final QuotaCounts counts = inode.computeQuotaUsage(getBlockStoragePolicySuite());
    updateCount(existing, pos, counts, checkQuota);

    boolean isRename = (inode.getParent() != null);
    boolean added;
    try {
      //TODO 在父目录下添加
      added = parent.addChild(inode, true, existing.getLatestSnapshotId());
    } catch (QuotaExceededException e) {
      updateCountNoQuotaCheck(existing, pos, counts.negation());
      throw e;
    }
    if (!added) {
      updateCountNoQuotaCheck(existing, pos, counts.negation());
      return null;
    } else {
      if (!isRename) {
        AclStorage.copyINodeDefaultAcl(inode);
      }
      addToInodeMap(inode);
    }
    return INodesInPath.append(existing, inode, inode.getLocalNameBytes());
  }

到这里我们已经不需要再看了,因为我们上一次已经看过了如何更新目录树

1.1.11 总结

上传文件-更新目录树1

  • 假如我们要写1G的文件,首先是通过DistributedFileSystem这个对象
  • DistributedFileSystem 创建出来我们的DFSClient,调用我们的NameNodeRpcServer来进行创建元数据
  • FSDirectory首先创建一个INodeFile然后将INodeFile挂接到目录树上
  • 然后采用双缓冲机制将元数据写入磁盘

1.2 添加契约(其实我理解为就是写锁)

HDFS 契约

我们首先来看一个场景,假如多个客户端同时要写HDFS上的一个文件,我靠,这个时候咋办?

肯定是不行的啊,因为HDFS上的文件是不允许并发写的,所有HDFS里面有一个机制叫做文件契约机制

也就是说,同一时间,只能有一个hdfs客户端来获取NameNode上面的一个契约,然后才可以向获取契约的文件写入数据

也就是说,同一时间只能有一个客户端获取NameNode上面文件的契约,然后才可以向获取到契约的文件上面写入数据

此时如果其他客户端尝试获取文件契约的时候,就会获取不到,只能等待,契约轮到他

其实HDFS就是通过这个契约机制来控制并发写的,可以保证同一时间只有一个客户端在写一个文件.

契约机制是怎么工作的呢?

首先,没有这个文件的契约就创建,在获取到了文件契约之后,就可以写文件了,契约是有过期时间的,你不能给我过期啊,过期了我还怎么写啊

这个时候,就会出现了续约,在写数据的过程中,客户端会新开启一个线程,来不停地向namenode进行续约和心跳机制很像,

namenode内部呢,也有一个专门的后台线程,负责监控每一个契约的续约时间,如果某个契约很长时间没续约了,此时就自动过期掉这个契约,让别人来写

1.2.1 startFileInternal(向hdfs创建文件)

  /**
   * Create a new file or overwrite an existing file<br>
   * 
   * Once the file is create the client then allocates a new block with the next
   * call using {@link ClientProtocol#addBlock}.
   * <p>
   * For description of parameters and exceptions thrown see
   * {@link ClientProtocol#create}
   */
  private BlocksMapUpdateInfo startFileInternal(FSPermissionChecker pc, 
      INodesInPath iip, PermissionStatus permissions, String holder,
      String clientMachine, boolean create, boolean overwrite, 
      boolean createParent, short replication, long blockSize, 
      boolean isLazyPersist, CipherSuite suite, CryptoProtocolVersion version,
      EncryptedKeyVersion edek, boolean logRetryEntry)
      throws IOException {
  ...
      if (parent != null) {
        //TODO 在目录树里面添加一个INodeFile节点
        // dir 就是 FSDirectory
        iip = dir.addFile(parent.getKey(), parent.getValue(), permissions,
            replication, blockSize, holder, clientMachine);
        newNode = iip != null ? iip.getLastINode().asFile() : null;
      }

      if (newNode == null) {
        throw new IOException("Unable to add " + src +  " to namespace");
      }
      //TODO 添加契约
      //TODO 对于我们来说契约这个概念我们不需要深入了解,
      // 其实就是为了在HDFS写数据的时候屏蔽并发写,当前只有一个客户端可以写数据,可以理解为写锁
      //Lease这个但崔的意思就是契约的意思
      leaseManager.addLease(newNode.getFileUnderConstructionFeature()
          .getClientName(), src);

...
  }

1.2.2 addLease

这一步操作,其实就是看一看契约存在吗,不存在就创建,然后将契约装到两个数据结构里面一个是TreeMap 还有一个是TreeSet

leases TreeMap 装的是 <客户端名称,契约>

sortedLeases TreeSet 装的是 TreeSet

  /**
   * Adds (or re-adds) the lease for the specified file.
   */
  synchronized Lease addLease(String holder, String src) {
    //TODO 先看一下契约是否存在了
    Lease lease = getLease(holder);
    if (lease == null) {
      //TODO 如果没有创建一个契约,我们先创建文件,因为我们是创建文件肯定没有契约,我们就创建
      lease = new Lease(holder);
      //TODO 存储到数据结构里面(这个可排序的数据结构)
      //TODO leases is TreeMap
      //TODO sortedLeases is TreeSet
      leases.put(holder, lease);
      sortedLeases.add(lease);
    } else {
      renewLease(lease);
    }
    sortedLeasesByPath.put(src, lease);
    lease.paths.add(src);
    return lease;
  }

1.2.3 总结

  • 在我们向NameNode创建文件以后
  • 会创建一个契约(如果存在就创建)
  • 然后把这个契约存入到LeaseManager的数据结构中
  • 数据结构有两个一个是TreeMap一个是TreeSet
  • TreeMap是 <客户端名称,契约>这样的一个键值对
  • TreeSet是一个TreeSet<契约>这样的一个集合

image-20200711172745778

1.3 启动DataStreamer

1.3.1 newStreamForCreate

static DFSOutputStream newStreamForCreate(DFSClient dfsClient, String src,
      FsPermission masked, EnumSet<CreateFlag> flag, boolean createParent,
      short replication, long blockSize, Progressable progress, int buffersize,
      DataChecksum checksum, String[] favoredNodes) throws IOException {
    TraceScope scope =
        dfsClient.getPathTraceScope("newStreamForCreate", src);
    try {
      HdfsFileStatus stat = null;

      // Retry the create if we get a RetryStartFileException up to a maximum
      // number of times
      boolean shouldRetry = true;
      int retryCount = CREATE_RETRY_COUNT;

      //TODO 重试的代码结构
      while (shouldRetry) {
        shouldRetry = false;
        try {

          /**
           * HDFS原理总结:
           * 1. 创建目录: 就是在 目录树(元数据)上面添加一个INode (INodeDirectory)
           * 2. 上传文件:
           *      1) 在目录是里面添加一个INodeFile
           *      2) 再往文件里面写数据
           *
           * 更新元数据
           * 添加了契约
           * TODO 往目录树里面添加INodeFile,记录元数据日志和添加契约
           * TODO 这儿都是需要跟NameNode的服务端进行交互的
           */
          stat = dfsClient.namenode.create(src, masked, dfsClient.clientName,
              new EnumSetWritable<CreateFlag>(flag), createParent, replication,
              blockSize, SUPPORTED_CRYPTO_VERSIONS);
          break;
        } catch (RemoteException re) {
          IOException e = re.unwrapRemoteException(
              AccessControlException.class,
              DSQuotaExceededException.class,
              FileAlreadyExistsException.class,
              FileNotFoundException.class,
              ParentNotDirectoryException.class,
              NSQuotaExceededException.class,
              RetryStartFileException.class,
              SafeModeException.class,
              UnresolvedPathException.class,
              SnapshotAccessControlException.class,
              UnknownCryptoProtocolVersionException.class);
          if (e instanceof RetryStartFileException) {
            //TODO 重试
            if (retryCount > 0) {
              shouldRetry = true;
              retryCount--;
            } else {
              throw new IOException("Too many retries because of encryption" +
                  " zone operations", e);
            }
          } else {
            throw e;
          }
        }
      }
      Preconditions.checkNotNull(stat, "HdfsFileStatus should not be null!");

      //TODO 里面初始化了DataStreamer,DataStreamer是写数据流程里面重要的对象
      final DFSOutputStream out = new DFSOutputStream(dfsClient, src, stat,
          flag, progress, checksum, favoredNodes);
      //TODO 里面启动了DataStreamer
      out.start();
      return out;
    } finally {
      scope.close();
    }
  }

1.3.2 new DFSOutputStream

  /** Construct a new output stream for creating a file. */
  private DFSOutputStream(DFSClient dfsClient, String src, HdfsFileStatus stat,
      EnumSet<CreateFlag> flag, Progressable progress,
      DataChecksum checksum, String[] favoredNodes) throws IOException {
    this(dfsClient, src, progress, stat, checksum);
    this.shouldSyncBlock = flag.contains(CreateFlag.SYNC_BLOCK);

    /**
     * Directory -> File ->Block(128M) ->package - >chunk
     *
     * TODO chunk 512 byte
     * TODO chunksum 4 byte
     *      chunk size = 516 byte
     * TODO package: 65536 byte 64K
     * TODO block: 128M
     */

    computePacketChunkSize(dfsClient.getConf().writePacketSize, bytesPerChecksum);

    //TODO 创建了DataStreamer(整个写数据流程就是这个类进行处理的)
    streamer = new DataStreamer(stat, null);
    if (favoredNodes != null && favoredNodes.length != 0) {
      streamer.setFavoredNodes(favoredNodes);
    }
  }

1.3.3 DataStremer start(线程 所以是run方法)

 synchronized (dataQueue) {
            // wait for a packet to be sent.
            long now = Time.monotonicNow();
            //TODO 第一次进来的时候,因为没有数据,所以代码走到这里
            //dataQueue.size() == 0
            while ((!streamerClosed && !hasError && dfsClient.clientRunning 
                && dataQueue.size() == 0 && 
                (stage != BlockConstructionStage.DATA_STREAMING || 
                 stage == BlockConstructionStage.DATA_STREAMING && 
                 now - lastPacket < dfsClient.getConf().socketTimeout/2)) || doSleep ) {
              long timeout = dfsClient.getConf().socketTimeout/2 - (now-lastPacket);
              timeout = timeout <= 0 ? 1000 : timeout;
              timeout = (stage == BlockConstructionStage.DATA_STREAMING)?
                 timeout : 1000;
              try {
                //TODO 如果dataQueue没有数据这里就会卡住
                dataQueue.wait(timeout);
              } catch (InterruptedException  e) {
                DFSClient.LOG.warn("Caught exception ", e);
              }
              doSleep = false;
              now = Time.monotonicNow();
            }
            if (streamerClosed || hasError || !dfsClient.clientRunning) {
              continue;
            }
     ...
 }

1.3.4 总结

  • 在我们契约和文件创建完成后,客户端会创建一个DataStreamer(十分重要,写数据就是依靠他)
  • 他首先会检测一个dataQueue 他的数据结构是LinkedList里面存放的是packet,他如果没有发现里面有数据就会等待有人写数据
  • Block->128M Packet->64KB Shunk->516 byte(包含了shunksum 4byte)

image-20200711180105807

1.4 开始续约

1.4.1 beginFileLease

  /** Get a lease and start automatic renewal */
  private void beginFileLease(final long inodeId, final DFSOutputStream out)
      throws IOException {
    //TODO LeaseRenewer put innodeId 其实就是 list的索引 ,和我们的流,还有DFSClient
    getLeaseRenewer().put(inodeId, out, this);
  }

1.4.2 LeaseRenewer的put方法

  synchronized void put(final long inodeId, final DFSOutputStream out,
      final DFSClient dfsc) {
    if (dfsc.isClientRunning()) {
      if (!isRunning() || isRenewerExpired()) {
        //start a new deamon with a new id.
        final int id = ++currentId;
        daemon = new Daemon(new Runnable() {
          @Override
          public void run() {
            try {
              if (LOG.isDebugEnabled()) {
                LOG.debug("Lease renewer daemon for " + clientsString()
                    + " with renew id " + id + " started");
              }
              //TODO LeaseRenewer 续约的线程 就是进行契约续约的
              LeaseRenewer.this.run(id);
            } catch(InterruptedException e) {
              if (LOG.isDebugEnabled()) {
                LOG.debug(LeaseRenewer.this.getClass().getSimpleName()
                    + " is interrupted.", e);
              }
            } finally {
              synchronized(LeaseRenewer.this) {
                Factory.INSTANCE.remove(LeaseRenewer.this);
              }
              if (LOG.isDebugEnabled()) {
                LOG.debug("Lease renewer daemon for " + clientsString()
                    + " with renew id " + id + " exited");
              }
            }
          }
          
          @Override
          public String toString() {
            return String.valueOf(LeaseRenewer.this);
          }
        });
        daemon.start();
      }
      dfsc.putFileBeingWritten(inodeId, out);
      emptyTime = Long.MAX_VALUE;
    }
  }

1.4.3 run

/**
   * Periodically check in with the namenode and renew all the leases
   * when the lease period is half over.
   */
  private void run(final int id) throws InterruptedException {

    //代码就是每个1秒就会检查 这个作者认为 for 比 while快
    for(long lastRenewed = Time.monotonicNow(); !Thread.interrupted();
        Thread.sleep(getSleepPeriod())) {
      final long elapsed = Time.monotonicNow() - lastRenewed;
      //TODO 如果已经超过30秒没有进行续约
      if (elapsed >= getRenewalTime()) {
        try {
          //TODO 进行续约
          renew();
          if (LOG.isDebugEnabled()) {
            LOG.debug("Lease renewer daemon for " + clientsString()
                + " with renew id " + id + " executed");
          }
          lastRenewed = Time.monotonicNow();
        } catch (SocketTimeoutException ie) {
          LOG.warn("Failed to renew lease for " + clientsString() + " for "
              + (elapsed/1000) + " seconds.  Aborting ...", ie);
          synchronized (this) {
            while (!dfsclients.isEmpty()) {
              dfsclients.get(0).abort();
            }
          }
          break;
        } catch (IOException ie) {
          LOG.warn("Failed to renew lease for " + clientsString() + " for "
              + (elapsed/1000) + " seconds.  Will retry shortly ...", ie);
        }
      }

      synchronized(this) {
        if (id != currentId || isRenewerExpired()) {
          if (LOG.isDebugEnabled()) {
            if (id != currentId) {
              LOG.debug("Lease renewer daemon for " + clientsString()
                  + " with renew id " + id + " is not current");
            } else {
               LOG.debug("Lease renewer daemon for " + clientsString()
                  + " with renew id " + id + " expired");
            }
          }
          //no longer the current daemon or expired
          return;
        }

        // if no clients are in running state or there is no more clients
        // registered with this renewer, stop the daemon after the grace
        // period.
        if (!clientsRunning() && emptyTime == Long.MAX_VALUE) {
          emptyTime = Time.monotonicNow();
        }
      }
    }
  }

1.4.4 renew

private void renew() throws IOException {
    final List<DFSClient> copies;
    synchronized(this) {
      copies = new ArrayList<DFSClient>(dfsclients);
    }
    //sort the client names for finding out repeated names.
    Collections.sort(copies, new Comparator<DFSClient>() {
      @Override
      public int compare(final DFSClient left, final DFSClient right) {
        return left.getClientName().compareTo(right.getClientName());
      }
    });
    String previousName = "";
    for(int i = 0; i < copies.size(); i++) {
      final DFSClient c = copies.get(i);
      //skip if current client name is the same as the previous name.
      if (!c.getClientName().equals(previousName)) {
        //TODO 重点代码
        if (!c.renewLease()) {
          if (LOG.isDebugEnabled()) {
            LOG.debug("Did not renew lease for client " +
                c);
          }
          continue;
        }
        previousName = c.getClientName();
        if (LOG.isDebugEnabled()) {
          LOG.debug("Lease renewed for client " + previousName);
        }
      }
    }
  }

1.4.5 renewLease

/**
   * Renew leases.
   * @return true if lease was renewed. May return false if this
   * client has been closed or has no files open.
   **/
  boolean renewLease() throws IOException {
    if (clientRunning && !isFilesBeingWrittenEmpty()) {
      try {
        //获取namenode的代理进行续约
        namenode.renewLease(clientName);
        //修改上一次续约时间
        updateLastLeaseRenewal();
        return true;
      } catch (IOException e) {
        // Abort if the lease has already expired. 
        final long elapsed = Time.monotonicNow() - getLastLeaseRenewal();
        if (elapsed > HdfsConstants.LEASE_HARDLIMIT_PERIOD) {
          LOG.warn("Failed to renew lease for " + clientName + " for "
              + (elapsed/1000) + " seconds (>= hard-limit ="
              + (HdfsConstants.LEASE_HARDLIMIT_PERIOD/1000) + " seconds.) "
              + "Closing all files being written ...", e);
          closeAllFilesBeingWritten(true);
        } else {
          // Let the lease renewer handle it and retry.
          throw e;
        }
      }
    }
    return false;
  }

1.4.6 总结

  • 客户端会通过LeaseRenewer来进行续约,每间隔30秒续约一次
  • LeaseRenewer 通过rpc调用namenode来进行续约
  • FSNamesystem 会去续约,也就是做的 把我们的契约删掉,更新契约时间再放进去这里操作的只是TreeMap

image-20200711184951510

1.5 续约扫描机制

1.5.1 LeaseManager的Monitor线程run方法

 /** Check leases periodically. */
    @Override
    public void run() {
      for(; shouldRunMonitor && fsnamesystem.isRunning(); ) {
        boolean needSync = false;
        try {
          fsnamesystem.writeLockInterruptibly();
          try {
            if (!fsnamesystem.isInSafeMode()) {
              //TODO 检查契约
              needSync = checkLeases();
            }
          } finally {
            fsnamesystem.writeUnlock();
            // lease reassignments should to be sync'ed.
            if (needSync) {
              fsnamesystem.getEditLog().logSync();
            }
          }
          // TODO 每间隔两秒钟就去检查一下契约
          Thread.sleep(HdfsServerConstants.NAMENODE_LEASE_RECHECK_INTERVAL);
        } catch(InterruptedException ie) {
          if (LOG.isDebugEnabled()) {
            LOG.debug(name + " is interrupted", ie);
          }
        }
      }
    }

1.5.2 checkLeases

 /** Check the leases beginning from the oldest.
   *  @return true is sync is needed.
   */
  @VisibleForTesting
  synchronized boolean checkLeases() {
    boolean needSync = false;
    assert fsnamesystem.hasWriteLock();
    Lease leaseToCheck = null;
    try {
      //TODO 从可排序的数据结构里面拿出来第一个契约 (拿出的是最老的那个契约)
      leaseToCheck = sortedLeases.first();
    } catch(NoSuchElementException e) {}

    while(leaseToCheck != null) {
      //TODO 最老的契约是否过期
      if (!leaseToCheck.expiredHardLimit()) {
        //TODO 如果最老的契约都没过期,然后就break了
        //TODO 也就是说连最早的契约都还没过期,比较新的就更不会过期了
        //TODO 所以就不用检查下去了
        break;
      }

      LOG.info(leaseToCheck + " has expired hard limit");

      final List<String> removing = new ArrayList<String>();
      // need to create a copy of the oldest lease paths, because 
      // internalReleaseLease() removes paths corresponding to empty files,
      // i.e. it needs to modify the collection being iterated over
      // causing ConcurrentModificationException
      String[] leasePaths = new String[leaseToCheck.getPaths().size()];
      leaseToCheck.getPaths().toArray(leasePaths);
      for(String p : leasePaths) {
        try {
          INodesInPath iip = fsnamesystem.getFSDirectory().getINodesInPath(p,
              true);
          boolean completed = fsnamesystem.internalReleaseLease(leaseToCheck, p,
              iip, HdfsServerConstants.NAMENODE_LEASE_HOLDER);
          if (LOG.isDebugEnabled()) {
            if (completed) {
              LOG.debug("Lease recovery for " + p + " is complete. File closed.");
            } else {
              LOG.debug("Started block recovery " + p + " lease " + leaseToCheck);
            }
          }
          // If a lease recovery happened, we need to sync later.
          if (!needSync && !completed) {
            needSync = true;
          }
        } catch (IOException e) {
          LOG.error("Cannot release the path " + p + " in the lease "
              + leaseToCheck, e);
          removing.add(p);
        }
      }

      //TODO 过期了就删除掉
      for(String p : removing) {
        removeLease(leaseToCheck, p);
      }
      //TODO 如果最老的那个过期了
      //TODO 就开始找第二个那个老的,第三个,第四个...
      leaseToCheck = sortedLeases.higher(leaseToCheck);
    }

    try {
      if(leaseToCheck != sortedLeases.first()) {
        LOG.warn("Unable to release hard-limit expired lease: "
          + sortedLeases.first());
      }
    } catch(NoSuchElementException e) {}
    return needSync;
  }

1.5.3 leaseToCheck.expiredHardLimit()

    /** @return true if the Hard Limit Timer has expired */
    public boolean expiredHardLimit() {
      //TODO 契约过期的条件
      //TODO 当前时间 - 上一次的时间 >一个小时
      return monotonicNow() - lastUpdate > hardLimit;
    }

1.5.4 总结

  • 契约检测是LeaseManager中的Moniter线程来进行的
  • 每间隔2s,进行一次契约检查
  • 他首先是拿出来了第一个契约,拿当前时间前去契约的时间是否大于一个小时,如果第一个没有过期,那么说明剩下的都没有过期
  • 如果有过期的就移除掉,然后接着找剩下的契约

image-20200711192438429

1.6 chunk是如何写入dataQueue的?

/**
 * 之前: 创建目录(更新元数据) Inode: INodeDirectory
 * 现在: 创建文件(更新元数据) Inode: INodeFile
 *
 * 初始化的工作:
 * 1. 添加INodeFile,修改目录树
 * 2. 添加契约
 * 3. 启动Data Streamer
 * 4. 开启续约的线程
 */
FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path("/data"));

//以上创建了DataStreamer
fsDataOutputStream.write("测试".getBytes());

我们首先需要知道,这个FSDataOutputStream到底是哪个类

  1. 我们首先发现,他返回的是HdfsDataOutputStream

image-20200711202148058

  1. 这个时候我们需要看一看他的构造方法,发现他调用的是他父类的构造方法,其实我们就是要看看这个out到底是什么?

image-20200711202234755

这个时候我们发现了 DFSOutputStream,那么我们是不是就可以认为这里采用了一种设计模式,其实DFSOutputStream是真正实现write方法的,这里写的好隐晦

1.6.1 FSOutputSummer的write方法

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5TQ8hMr6-1594536902865)(C:%5CUsers%5Cwqz%5CAppData%5CRoaming%5CTypora%5Ctypora-user-images%5Cimage-20200711202832736.png)]

我们看DFS的时候发现并没有这个write方法,那么我们这个时候应该要想到 调用的是他父类的方法FSOutputSummer中的write

  /** Write one byte */
  @Override
  public synchronized void write(int b) throws IOException {
    buf[count++] = (byte)b;
    if(count == buf.length) {
      //TODO 写文件
      flushBuffer();
    }
  }

1.6.2 flushBuffer

  /* Forces buffered output bytes to be checksummed and written out to
   * the underlying output stream. If there is a trailing partial chunk in the
   * buffer,
   * 1) flushPartial tells us whether to flush that chunk
   * 2) if flushPartial is true, keep tells us whether to keep that chunk in the
   * buffer (if flushPartial is false, it is always kept in the buffer)
   *
   * Returns the number of bytes that were flushed but are still left in the
   * buffer (can only be non-zero if keep is true).
   */
  protected synchronized int flushBuffer(boolean keep,
      boolean flushPartial) throws IOException {
    int bufLen = count;
    int partialLen = bufLen % sum.getBytesPerChecksum();
    int lenToFlush = flushPartial ? bufLen : bufLen - partialLen;
    if (lenToFlush != 0) {
      //TODO 核心的代码
      //TODO HDFS - >文件 Block文件块(128M) -> packet(64k) -> 127 *chunk ->chunk(512byte) ->chunksum(4byte)
      writeChecksumChunks(buf, 0, lenToFlush);
      if (!flushPartial || keep) {
        count = partialLen;
        System.arraycopy(buf, bufLen - count, buf, 0, count);
      } else {
        count = 0;
      }
    }

    // total bytes left minus unflushed bytes left
    return count - (bufLen - lenToFlush);
  }

1.6.3 writeChecksumChunks

  /** Generate checksums for the given data chunks and output chunks & checksums
   * to the underlying output stream.
   */
  private void writeChecksumChunks(byte b[], int off, int len)
  throws IOException {
    //TODO 计算出来chunk的校验和
    sum.calculateChunkedSums(b, off, len, checksum, 0);
    //TODO 按照chunk的大小遍历数据
    for (int i = 0; i < len; i += sum.getBytesPerChecksum()) {//跨度chunk
      int chunkLen = Math.min(sum.getBytesPerChecksum(), len - i);
      int ckOffset = i / sum.getBytesPerChecksum() * getChecksumSize();
      //TODO 一个一个chunk的写数据
      writeChunk(b, off + i, chunkLen, checksum, ckOffset, getChecksumSize());
    }
  }

1.6.4 DFSOutputStream的writeChunk

  // @see FSOutputSummer#writeChunk()
  @Override
  protected synchronized void writeChunk(byte[] b, int offset, int len,
      byte[] checksum, int ckoff, int cklen) throws IOException {
    TraceScope scope =
        dfsClient.getPathTraceScope("DFSOutputStream#writeChunk", src);
    try {
      //TODO 开始写chunk
      writeChunkImpl(b, offset, len, checksum, ckoff, cklen);
    } finally {
      scope.close();
    }
  }

1.6.5 writeChunkImpl

 if (currentPacket == null) {

      //TODO 创建packet
      currentPacket = createPacket(packetSize, chunksPerPacket, 
          bytesCurBlock, currentSeqno++, false);
      if (DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("DFSClient writeChunk allocating new packet seqno=" + 
            currentPacket.getSeqno() +
            ", src=" + src +
            ", packetSize=" + packetSize +
            ", chunksPerPacket=" + chunksPerPacket +
            ", bytesCurBlock=" + bytesCurBlock);
      }
    }
    //TODO 往packet里面写 chunk的校验和,4byte
    currentPacket.writeChecksum(checksum, ckoff, cklen);
    //TODO 往packet里面写一个chunk 512 byte
    currentPacket.writeData(b, offset, len);
    //TODO 累计一共有多少个chunk  -> packet 如果写满了127 个chunk 那就是一个完整的packet
    currentPacket.incNumChunks();
    //TODO Block -> packet Block -> 128 那就是写满了一个文件块
    bytesCurBlock += len;

    // If packet is full, enqueue it for transmission
    //
    //TODO 两个条件:
    //TODO 如果写满了一个packet(127 chunk) = packet
    //TODO 一个文件块写满了 Block128M Block ->128M 那就是写满了一个文件块
    if (currentPacket.getNumChunks() == currentPacket.getMaxChunks() ||
        bytesCurBlock == blockSize) {
      if (DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("DFSClient writeChunk packet full seqno=" +
            currentPacket.getSeqno() +
            ", src=" + src +
            ", bytesCurBlock=" + bytesCurBlock +
            ", blockSize=" + blockSize +
            ", appendChunk=" + appendChunk);
      }
      //TODO 写满了一个packet ,把 packet 入队
      waitAndQueueCurrentPacket();

1.6.6 waitAndQueueCurrentPacket

private void waitAndQueueCurrentPacket() throws IOException {
    synchronized (dataQueue) {
      try {
      // If queue is full, then wait till we have enough space
        boolean firstWait = true;
        try {
          while (!isClosed() && dataQueue.size() + ackQueue.size() >
              dfsClient.getConf().writeMaxPackets) {
            if (firstWait) {
              Span span = Trace.currentSpan();
              if (span != null) {
                span.addTimelineAnnotation("dataQueue.wait");
              }
              firstWait = false;
            }
            try {
              //TODO 如果队列写满了,那么就等待
              dataQueue.wait();
            } catch (InterruptedException e) {
              // If we get interrupted while waiting to queue data, we still need to get rid
              // of the current packet. This is because we have an invariant that if
              // currentPacket gets full, it will get queued before the next writeChunk.
              //
              // Rather than wait around for space in the queue, we should instead try to
              // return to the caller as soon as possible, even though we slightly overrun
              // the MAX_PACKETS length.
              Thread.currentThread().interrupt();
              break;
            }
          }
        } finally {
          Span span = Trace.currentSpan();
          if ((span != null) && (!firstWait)) {
            span.addTimelineAnnotation("end.wait");
          }
        }
        checkClosed();
        //TODO 把当前的packet加入队列
        queueCurrentPacket();
      } catch (ClosedChannelException e) {
      }
    }
  }

1.6.7 queueCurrentPacket

  private void queueCurrentPacket() {
    synchronized (dataQueue) {
      if (currentPacket == null) return;
      currentPacket.addTraceParent(Trace.currentSpan());
      //TODO 往dataQueue队列里面添加一个packet
      dataQueue.addLast(currentPacket);
      lastQueuedSeqno = currentPacket.getSeqno();
      if (DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("Queued packet " + currentPacket.getSeqno());
      }
      // 把当前的 currentPacket = null
      currentPacket = null;
      //TODO 这儿之所有有着操作是因为之前一开始初始化的时候
      //启动了一个DataStreamer,DataStream 一直监听这个dataQueue
      //如果里面没有数据就一直wait,所以现在往里面添加了packet之后
      //就唤醒等待的线程,所以这唤醒了DataStreamer
      dataQueue.notifyAll();
    }
  }

1.6.8 总结

  • 当我们调用hdfs流写数据的时候调用的是FSOutputSummer的write方法
  • 开始按照shunk为跨度往当前的packet里面写数据,写满127个shunk就是一个packet
  • 当写完packet的时候就把这个packet写入到dataQueue里面,然后把当前的packet置位null
  • 然后唤醒DataStreamer

image-20200711204839784

1.7 申请Block

1.7.1 DataStreamer的run(被唤醒)

          if (stage == BlockConstructionStage.PIPELINE_SETUP_CREATE) {
            if(DFSClient.LOG.isDebugEnabled()) {
              DFSClient.LOG.debug("Allocating new block");
            }
            //TODO 步骤1 : 建立数据管道
            /**
             * nextBlockOutputStream 这个方法里面完成了两个事情:
             * 1. 向NameNode申请block
             * 2. 建立数据管道
             */
            setPipeline(nextBlockOutputStream());
            //TODO
            initDataStreaming();
          }

1.7.2 nextBlockOutputStream

/**
     * Open a DataOutputStream to a DataNode so that it can be written to.
     * This happens when a file is created and each time a new block is allocated.
     * Must get block ID and the IDs of the destinations from the namenode.
     * Returns the list of target datanodes.
     */
    private LocatedBlock nextBlockOutputStream() throws IOException {
      LocatedBlock lb = null;
      DatanodeInfo[] nodes = null;
      StorageType[] storageTypes = null;
      int count = dfsClient.getConf().nBlockWriteRetry;
      boolean success = false;
      ExtendedBlock oldBlock = block;

      /**
       * 因为申请block 或者 建立数据管道,这些都是重要的操作
       * 务必要执行成功,但是这些操作都设计到了网络的请求,网络的事情说不准
       * 我们代码里面不能说一次失败了就失败了,我们要进行多次尝试
       * 所以大家经常看到HDFS里面的很多地方的代码都是用的循环
       */
      do {
        hasError = false;
        lastException.set(null);
        errorIndex = -1;
        success = false;

        DatanodeInfo[] excluded =
            excludedNodes.getAllPresent(excludedNodes.asMap().keySet())
            .keySet()
            .toArray(new DatanodeInfo[0]);
        block = oldBlock;
        //TODO 向NameNode 申请block
        /**
         * 服务端的操作:
         * 1. 创建了一个block,往文件目录树里面挂载了block的信息
         * 2. 在磁盘上面记录了元数据信息
         * 3. 在BlockManager里面记录了block的元数据信息
         */
        lb = locateFollowingBlock(excluded.length > 0 ? excluded : null);
        block = lb.getBlock();
        block.setNumBytes(0);
        bytesSent = 0;
        accessToken = lb.getBlockToken();
        nodes = lb.getLocations();
        storageTypes = lb.getStorageTypes();

        //
        // Connect to first DataNode in the list.
        //
        success = createBlockOutputStream(nodes, storageTypes, 0L, false);

        if (!success) {
          DFSClient.LOG.info("Abandoning " + block);
          dfsClient.namenode.abandonBlock(block, fileId, src,
              dfsClient.clientName);
          block = null;
          DFSClient.LOG.info("Excluding datanode " + nodes[errorIndex]);
          excludedNodes.put(nodes[errorIndex], nodes[errorIndex]);
        }
      } while (!success && --count >= 0);

      if (!success) {
        throw new IOException("Unable to create new block.");
      }
      return lb;
    }
    //TODO 通过ROC 调用Namenode的服务端的代码
            return dfsClient.namenode.addBlock(src, dfsClient.clientName,
                block, excludedNodes, fileId, favoredNodes);

1.7.3 NamenodeRpcServer的addBlock

@Override
  public LocatedBlock addBlock(String src, String clientName,
      ExtendedBlock previous, DatanodeInfo[] excludedNodes, long fileId,
      String[] favoredNodes)
      throws IOException {
    checkNNStartup();
    if (stateChangeLog.isDebugEnabled()) {
      stateChangeLog.debug("*BLOCK* NameNode.addBlock: file " + src
          + " fileId=" + fileId + " for " + clientName);
    }
    Set<Node> excludedNodesSet = null;
    if (excludedNodes != null) {
      excludedNodesSet = new HashSet<Node>(excludedNodes.length);
      for (Node node : excludedNodes) {
        excludedNodesSet.add(node);
      }
    }
    List<String> favoredNodesList = (favoredNodes == null) ? null
        : Arrays.asList(favoredNodes);
    //TODO 添加了一个block
      /**
       * 1. 选择三台DataNode的副本机器
       * 2. 修改了目录树
       * 3. 存储元数据信息
       */
    LocatedBlock locatedBlock = namesystem.getAdditionalBlock(src, fileId,
        clientName, previous, excludedNodesSet, favoredNodesList);
    if (locatedBlock != null)
      metrics.incrAddBlockOps();
    return locatedBlock;
  }

1.7.4 FSNamesytem的getAdditionalBlock

   // choose targets for the new block to be allocated.
    //TODO  选择存放block的datanode的主机(负载均衡)
    //TODO  HDFS的机架感知策略
    final DatanodeStorageInfo targets[] = getBlockManager().chooseTarget4NewBlock( 
        src, replication, clientNode, excludedNodes, blockSize, favoredNodes,
        storagePolicyID);

   ...

      // allocate new block, record block locations in INode.
      newBlock = createNewBlock();
      INodesInPath inodesInPath = INodesInPath.fromINode(pendingFile);
      //TODO 修改了内存里面的目录树 (修改内存里面的元数据信息)
      saveAllocatedBlock(src, inodesInPath, newBlock, targets);

      //TODO 把元数据写入到磁盘
      persistNewBlock(src, pendingFile);

1.7.5 FSNamesytem的getAdditionalBlock

  /**
   * Save allocated block at the given pending filename
   * 
   * @param src path to the file
   * @param inodesInPath representing each of the components of src.
   *                     The last INode is the INode for {@code src} file.
   * @param newBlock newly allocated block to be save
   * @param targets target datanodes where replicas of the new block is placed
   * @throws QuotaExceededException If addition of block exceeds space quota
   */
  BlockInfoContiguous saveAllocatedBlock(String src, INodesInPath inodesInPath,
      Block newBlock, DatanodeStorageInfo[] targets)
          throws IOException {
    assert hasWriteLock();
    //TODO 添加block
    BlockInfoContiguous b = dir.addBlock(src, inodesInPath, newBlock, targets);
    NameNode.stateChangeLog.info("BLOCK* allocate " + b + " for " + src);
    DatanodeStorageInfo.incrementBlocksScheduled(targets);
    return b;
  }

1.7.6 FSDirectory的addBlock

/**
   * Add a block to the file. Returns a reference to the added block.
   */
  BlockInfoContiguous addBlock(String path, INodesInPath inodesInPath,
      Block block, DatanodeStorageInfo[] targets) throws IOException {
    writeLock();
    try {
      final INodeFile fileINode = inodesInPath.getLastINode().asFile();
      Preconditions.checkState(fileINode.isUnderConstruction());

      // check quota limits and updated space consumed
      updateCount(inodesInPath, 0, fileINode.getPreferredBlockSize(),
          fileINode.getBlockReplication(), true);

      // associate new last block for the file
      BlockInfoContiguousUnderConstruction blockInfo =
        new BlockInfoContiguousUnderConstruction(
            block,
            fileINode.getFileReplication(),
            BlockUCState.UNDER_CONSTRUCTION,
            targets);
      //TODO 往BlockManager里面记录这个block的信息
      getBlockManager().addBlockCollection(blockInfo, fileINode);
      //TODO 新产生的block 放到文件节点下面
      fileINode.addBlock(blockInfo);

      if(NameNode.stateChangeLog.isDebugEnabled()) {
        NameNode.stateChangeLog.debug("DIR* FSDirectory.addBlock: "
            + path + " with " + block
            + " block is added to the in-memory "
            + "file system");
      }
      return blockInfo;
    } finally {
      writeUnlock();
    }
  }

1.7.7 BlockManager的addBlockCollection

  public BlockInfoContiguous addBlockCollection(BlockInfoContiguous block,
      BlockCollection bc) {
    //TODO 在内存缓冲里面记录了新的block的信息
    return blocksMap.addBlockCollection(block, bc);
  }

1.7.8 总结

  • DataStreamer通过远程RPC调用了NamenodeRpcSerser的FSNameSystem来创建block
  • 首先通过机架感知策略获取我们的块应该存储在那些机器上
  • 创建块信息,往blockmanager的 blockmap里面添加block信息
  • 将我们的Block信息挂载到他的目录树的INodeFile下,等待刷写磁盘元数据
  • 将元数据返回

1.8 建立pipline数据管道

为什么我们要建立管道呢,也就是第二种方式呢

image-20200711215422552

因为减少连接,和内网传输快

1.8.1 DFSOutputStream的nextBlockOutputStream

 /**
     * Open a DataOutputStream to a DataNode so that it can be written to.
     * This happens when a file is created and each time a new block is allocated.
     * Must get block ID and the IDs of the destinations from the namenode.
     * Returns the list of target datanodes.
     */
    private LocatedBlock nextBlockOutputStream() throws IOException {
      LocatedBlock lb = null;
      DatanodeInfo[] nodes = null;
      StorageType[] storageTypes = null;
      int count = dfsClient.getConf().nBlockWriteRetry;
      boolean success = false;
      ExtendedBlock oldBlock = block;

      /**
       * 因为申请block 或者 建立数据管道,这些都是重要的操作
       * 务必要执行成功,但是这些操作都设计到了网络的请求,网络的事情说不准
       * 我们代码里面不能说一次失败了就失败了,我们要进行多次尝试
       * 所以大家经常看到HDFS里面的很多地方的代码都是用的循环
       */
      do {
        hasError = false;
        lastException.set(null);
        errorIndex = -1;
        success = false;

        DatanodeInfo[] excluded =
            excludedNodes.getAllPresent(excludedNodes.asMap().keySet())
            .keySet()
            .toArray(new DatanodeInfo[0]);
        block = oldBlock;
        //TODO 向NameNode 申请block
        /**
         * 服务端的操作:
         * 1. 创建了一个block,往文件目录树里面挂载了block的信息
         * 2. 在磁盘上面记录了元数据信息
         * 3. 在BlockManager里面记录了block的元数据信息
         */
        lb = locateFollowingBlock(excluded.length > 0 ? excluded : null);
        block = lb.getBlock();
        block.setNumBytes(0);
        bytesSent = 0;
        accessToken = lb.getBlockToken();
        nodes = lb.getLocations();
        storageTypes = lb.getStorageTypes();

        //
        // Connect to first DataNode in the list.
        //
        //TODO 其实HDFS管道的建立就是靠的这段代码完成的
        //hadoop1 hadoop2 [hadoop3]
        //block 不要让我把副本发往hadoop3
        success = createBlockOutputStream(nodes, storageTypes, 0L, false);

        if (!success) {
          DFSClient.LOG.info("Abandoning " + block);
          //TODO 如果管道建立不成功,那么就是放弃这个block
          dfsClient.namenode.abandonBlock(block, fileId, src,
              dfsClient.clientName);
          block = null;
          DFSClient.LOG.info("Excluding datanode " + nodes[errorIndex]);
          excludedNodes.put(nodes[errorIndex], nodes[errorIndex]);
        }
      } while (!success && --count >= 0);

      if (!success) {
        throw new IOException("Unable to create new block.");
      }
      return lb;
    }

1.8.2 DFSOutputStream的createBlockOutputStream

 assert null == s : "Previous socket unclosed";
          assert null == blockReplyStream : "Previous blockReplyStream unclosed";
          //TODO rpc socket
          s = createSocketForPipeline(nodes[0], nodes.length, dfsClient);
          long writeTimeout = dfsClient.getDatanodeWriteTimeout(nodes.length);

          //创建输出流 (肯定是用来把数据写到datanode)
          OutputStream unbufOut = NetUtils.getOutputStream(s, writeTimeout);
          //创建输入流(肯定是读取响应的结果)
          InputStream unbufIn = NetUtils.getInputStream(s);
          //TODO 这里是一个socket
          IOStreamPair saslStreams = dfsClient.saslClient.socketSend(s,
            unbufOut, unbufIn, dfsClient, accessToken, nodes[0]);
          unbufOut = saslStreams.out;
          unbufIn = saslStreams.in;

          //TODO 这个输出流是把客户端的数据写到DataNode上面
          out = new DataOutputStream(new BufferedOutputStream(unbufOut,
              HdfsConstants.SMALL_BUFFER_SIZE));
          //TODO 客户端通过这个输入流来读DataNode返回来的信息
          blockReplyStream = new DataInputStream(unbufIn);
  
          //
          // Xmit header info to datanode
          //
  
          BlockConstructionStage bcs = recoveryFlag? stage.getRecoveryStage(): stage;

          // We cannot change the block length in 'block' as it counts the number
          // of bytes ack'ed.
          ExtendedBlock blockCopy = new ExtendedBlock(block);
          blockCopy.setNumBytes(blockSize);

          boolean[] targetPinnings = getPinnings(nodes, true);
          // send the request
          //TODO 发送写数据的请求
          //这发的是socket请求
          //TODO datanode 之前启动了一个DataXceiver服务来接受socket请求
          new Sender(out).writeBlock(blockCopy, nodeStorageTypes[0], accessToken,
              dfsClient.clientName, nodes, nodeStorageTypes, null, bcs, 
              nodes.length, block.getNumBytes(), bytesSent, newGS,
              checksum4WriteBlock, cachingStrategy.get(), isLazyPersistFile,
            (targetPinnings == null ? false : targetPinnings[0]), targetPinnings);

1.8.3 new Sender(out).writeBlock

  @Override
  public void writeBlock(final ExtendedBlock blk,
      final StorageType storageType, 
      final Token<BlockTokenIdentifier> blockToken,
      final String clientName,
      final DatanodeInfo[] targets,
      final StorageType[] targetStorageTypes, 
      final DatanodeInfo source,
      final BlockConstructionStage stage,
      final int pipelineSize,
      final long minBytesRcvd,
      final long maxBytesRcvd,
      final long latestGenerationStamp,
      DataChecksum requestedChecksum,
      final CachingStrategy cachingStrategy,
      final boolean allowLazyPersist,
      final boolean pinning,
      final boolean[] targetPinnings) throws IOException {
    ClientOperationHeaderProto header = DataTransferProtoUtil.buildClientHeader(
        blk, clientName, blockToken);
    
    ChecksumProto checksumProto =
      DataTransferProtoUtil.toProto(requestedChecksum);

    OpWriteBlockProto.Builder proto = OpWriteBlockProto.newBuilder()
      .setHeader(header)
      .setStorageType(PBHelper.convertStorageType(storageType))
      .addAllTargets(PBHelper.convert(targets, 1))
      .addAllTargetStorageTypes(PBHelper.convertStorageTypes(targetStorageTypes, 1))
      .setStage(toProto(stage))
      .setPipelineSize(pipelineSize)
      .setMinBytesRcvd(minBytesRcvd)
      .setMaxBytesRcvd(maxBytesRcvd)
      .setLatestGenerationStamp(latestGenerationStamp)
      .setRequestedChecksum(checksumProto)
      .setCachingStrategy(getCachingStrategy(cachingStrategy))
      .setAllowLazyPersist(allowLazyPersist)
      .setPinning(pinning)
      .addAllTargetPinnings(PBHelper.convert(targetPinnings, 1));
    
    if (source != null) {
      proto.setSource(PBHelper.convertDatanodeInfo(source));
    }

    //TODO 写数据
    // 这里的写操作是 WRITE_BLOCK的
    send(out, Op.WRITE_BLOCK, proto.build());
  }

1.8.4 DataXceiverServer的run

当初我们DataNode启动的时候,就启动了这个服务,这个就是为了接收写数据请求的

@Override
  public void run() {
    Peer peer = null;
    while (datanode.shouldRun && !datanode.shutdownForUpgrade) {
      try {
        //TODO 接受socket的请求
        peer = peerServer.accept();

        // Make sure the xceiver count is not exceeded
        int curXceiverCount = datanode.getXceiverCount();
        if (curXceiverCount > maxXceiverCount) {
          throw new IOException("Xceiver count " + curXceiverCount
              + " exceeds the limit of concurrent xcievers: "
              + maxXceiverCount);
        }

        //TODO 每发送过来一个block 都启动一个DataXceiver 去处理这个block
        new Daemon(datanode.threadGroup,
            DataXceiver.create(peer, datanode, this))
            .start();
      } catch (SocketTimeoutException ignored) {
        // wake up to see if should continue to run
      } catch (AsynchronousCloseException ace) {
        // another thread closed our listener socket - that's expected during shutdown,
        // but not in other circumstances
        if (datanode.shouldRun && !datanode.shutdownForUpgrade) {
          LOG.warn(datanode.getDisplayName() + ":DataXceiverServer: ", ace);
        }
      } catch (IOException ie) {
        IOUtils.cleanup(null, peer);
        LOG.warn(datanode.getDisplayName() + ":DataXceiverServer: ", ie);
      } catch (OutOfMemoryError ie) {
        IOUtils.cleanup(null, peer);
        // DataNode can run out of memory if there is too many transfers.
        // Log the event, Sleep for 30 seconds, other transfers may complete by
        // then.
        LOG.error("DataNode is out of memory. Will retry in 30 seconds.", ie);
        try {
          Thread.sleep(30 * 1000);
        } catch (InterruptedException e) {
          // ignore
        }
      } catch (Throwable te) {
        LOG.error(datanode.getDisplayName()
            + ":DataXceiverServer: Exiting due to: ", te);
        datanode.shouldRun = false;
      }
    }

    // Close the server to stop reception of more requests.
    try {
      peerServer.close();
      closed = true;
    } catch (IOException ie) {
      LOG.warn(datanode.getDisplayName()
          + " :DataXceiverServer: close exception", ie);
    }

    // if in restart prep stage, notify peers before closing them.
    if (datanode.shutdownForUpgrade) {
      restartNotifyPeers();
      // Each thread needs some time to process it. If a thread needs
      // to send an OOB message to the client, but blocked on network for
      // long time, we need to force its termination.
      LOG.info("Shutting down DataXceiverServer before restart");
      // Allow roughly up to 2 seconds.
      for (int i = 0; getNumPeers() > 0 && i < 10; i++) {
        try {
          Thread.sleep(200);
        } catch (InterruptedException e) {
          // ignore
        }
      }
    }
    // Close all peers.
    closeAllPeers();
  }

1.8.5 DataXceiver的run方法

@Override
  public void run() {
    Peer peer = null;
    while (datanode.shouldRun && !datanode.shutdownForUpgrade) {
      try {
        //TODO 接受socket的请求
        peer = peerServer.accept();

        // Make sure the xceiver count is not exceeded
        int curXceiverCount = datanode.getXceiverCount();
        if (curXceiverCount > maxXceiverCount) {
          throw new IOException("Xceiver count " + curXceiverCount
              + " exceeds the limit of concurrent xcievers: "
              + maxXceiverCount);
        }

        //TODO 每发送过来一个block 都启动一个DataXceiver 去处理这个block
        new Daemon(datanode.threadGroup,
            DataXceiver.create(peer, datanode, this))
            .start();
      } catch (SocketTimeoutException ignored) {
        // wake up to see if should continue to run
      } catch (AsynchronousCloseException ace) {
        // another thread closed our listener socket - that's expected during shutdown,
        // but not in other circumstances
        if (datanode.shouldRun && !datanode.shutdownForUpgrade) {
          LOG.warn(datanode.getDisplayName() + ":DataXceiverServer: ", ace);
        }
      } catch (IOException ie) {
        IOUtils.cleanup(null, peer);
        LOG.warn(datanode.getDisplayName() + ":DataXceiverServer: ", ie);
      } catch (OutOfMemoryError ie) {
        IOUtils.cleanup(null, peer);
        // DataNode can run out of memory if there is too many transfers.
        // Log the event, Sleep for 30 seconds, other transfers may complete by
        // then.
        LOG.error("DataNode is out of memory. Will retry in 30 seconds.", ie);
        try {
          Thread.sleep(30 * 1000);
        } catch (InterruptedException e) {
          // ignore
        }
      } catch (Throwable te) {
        LOG.error(datanode.getDisplayName()
            + ":DataXceiverServer: Exiting due to: ", te);
        datanode.shouldRun = false;
      }
    }

    // Close the server to stop reception of more requests.
    try {
      peerServer.close();
      closed = true;
    } catch (IOException ie) {
      LOG.warn(datanode.getDisplayName()
          + " :DataXceiverServer: close exception", ie);
    }

    // if in restart prep stage, notify peers before closing them.
    if (datanode.shutdownForUpgrade) {
      restartNotifyPeers();
      // Each thread needs some time to process it. If a thread needs
      // to send an OOB message to the client, but blocked on network for
      // long time, we need to force its termination.
      LOG.info("Shutting down DataXceiverServer before restart");
      // Allow roughly up to 2 seconds.
      for (int i = 0; getNumPeers() > 0 && i < 10; i++) {
        try {
          Thread.sleep(200);
        } catch (InterruptedException e) {
          // ignore
        }
      }
    }
    // Close all peers.
    closeAllPeers();
  }

1.8.6 processOp

      //TODO 写block是走的这里
    case WRITE_BLOCK:
      opWriteBlock(in);
      break;

1.8.7 opWriteBlock

/** Receive OP_WRITE_BLOCK */
  private void opWriteBlock(DataInputStream in) throws IOException {
    final OpWriteBlockProto proto = OpWriteBlockProto.parseFrom(vintPrefixed(in));
    final DatanodeInfo[] targets = PBHelper.convert(proto.getTargetsList());
    TraceScope traceScope = continueTraceSpan(proto.getHeader(),
        proto.getClass().getSimpleName());
    try {
      //TODO 写数据
      writeBlock(PBHelper.convert(proto.getHeader().getBaseHeader().getBlock()),
          PBHelper.convertStorageType(proto.getStorageType()),
          PBHelper.convert(proto.getHeader().getBaseHeader().getToken()),
          proto.getHeader().getClientName(),
          targets,
          PBHelper.convertStorageTypes(proto.getTargetStorageTypesList(), targets.length),
          PBHelper.convert(proto.getSource()),
          fromProto(proto.getStage()),
          proto.getPipelineSize(),
          proto.getMinBytesRcvd(), proto.getMaxBytesRcvd(),
          proto.getLatestGenerationStamp(),
          fromProto(proto.getRequestedChecksum()),
          (proto.hasCachingStrategy() ?
              getCachingStrategy(proto.getCachingStrategy()) :
            CachingStrategy.newDefaultStrategy()),
          (proto.hasAllowLazyPersist() ? proto.getAllowLazyPersist() : false),
          (proto.hasPinning() ? proto.getPinning(): false),
          (PBHelper.convertBooleanList(proto.getTargetPinningsList())));
    } finally {
     if (traceScope != null) traceScope.close();
    }
  }

1.8.8 DataXceiver的opWriteBlock

   if (isDatanode || 
          stage != BlockConstructionStage.PIPELINE_CLOSE_RECOVERY) {
        // open a block receiver
        //TODO 创建了BlockReceiver
        // 要看构造函数
        // 还要看run方法
        blockReceiver = new BlockReceiver(block, storageType, in,
            peer.getRemoteAddressString(),
            peer.getLocalAddressString(),
            stage, latestGenerationStamp, minBytesRcvd, maxBytesRcvd,
            clientname, srcDataNode, datanode, requestedChecksum,
            cachingStrategy, allowLazyPersist, pinning);

1.8.9 BlockReceiver的构造方法

       //TODO 调用这儿的代码,管道建立之前的初始化操作 rbw
          //TODO finalized/rbw 写入磁盘的时候 完成的 finalized 正在写的 rbw
        case PIPELINE_SETUP_CREATE:
          replicaHandler = datanode.data.createRbw(storageType, block, allowLazyPersist);
          datanode.notifyNamenodeReceivingBlock(
              block, replicaHandler.getReplica().getStorageUuid());
          break;

1.8.10 createRbw

  @Override // FsDatasetSpi
  public synchronized ReplicaHandler createRbw(
      StorageType storageType, ExtendedBlock b, boolean allowLazyPersist)
      throws IOException {
    ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(),
        b.getBlockId());
    if (replicaInfo != null) {
      throw new ReplicaAlreadyExistsException("Block " + b +
      " already exists in state " + replicaInfo.getState() +
      " and thus cannot be created.");
    }
    // create a new block
    FsVolumeReference ref;
    while (true) {
      try {
        if (allowLazyPersist) {
          // First try to place the block on a transient volume.
          //datanode -> 配置多块硬盘
          ref = volumes.getNextTransientVolume(b.getNumBytes());
          datanode.getMetrics().incrRamDiskBlocksWrite();
        } else {
          ref = volumes.getNextVolume(storageType, b.getNumBytes());
        }
      } catch (DiskOutOfSpaceException de) {
        if (allowLazyPersist) {
          datanode.getMetrics().incrRamDiskBlocksWriteFallback();
          allowLazyPersist = false;
          continue;
        }
        throw de;
      }
      break;
    }
    FsVolumeImpl v = (FsVolumeImpl) ref.getVolume();
    // create an rbw file to hold block in the designated volume
    File f;
    try {
      //创建一个文件
      f = v.createRbwFile(b.getBlockPoolId(), b.getLocalBlock());
    } catch (IOException e) {
      IOUtils.cleanup(null, ref);
      throw e;
    }

    ReplicaBeingWritten newReplicaInfo = new ReplicaBeingWritten(b.getBlockId(), 
        b.getGenerationStamp(), v, f.getParentFile(), b.getNumBytes());
    volumeMap.add(b.getBlockPoolId(), newReplicaInfo);
    return new ReplicaHandler(newReplicaInfo, ref);
  }

1.8.11 DataXceiver

//
      // Connect to downstream machine, if appropriate
      //TODO 继续连接下游的机器
      //
      if (targets.length > 0) {
        InetSocketAddress mirrorTarget = null;
        // Connect to backup machine
        //TODO mirror就是镜像,这里代表的是副本
        mirrorNode = targets[0].getXferAddr(connectToDnViaHostname);
        if (LOG.isDebugEnabled()) {
          LOG.debug("Connecting to datanode " + mirrorNode);
        }
        mirrorTarget = NetUtils.createSocketAddr(mirrorNode);
        mirrorSock = datanode.newSocket();
        try {
          int timeoutValue = dnConf.socketTimeout
              + (HdfsServerConstants.READ_TIMEOUT_EXTENSION * targets.length);
          int writeTimeout = dnConf.socketWriteTimeout + 
                      (HdfsServerConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
          NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
          mirrorSock.setSoTimeout(timeoutValue);
          mirrorSock.setSendBufferSize(HdfsConstants.DEFAULT_DATA_SOCKET_SIZE);
          
          OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
              writeTimeout);
          InputStream unbufMirrorIn = NetUtils.getInputStream(mirrorSock);
          DataEncryptionKeyFactory keyFactory =
            datanode.getDataEncryptionKeyFactoryForBlock(block);
          IOStreamPair saslStreams = datanode.saslClient.socketSend(mirrorSock,
            unbufMirrorOut, unbufMirrorIn, keyFactory, blockToken, targets[0]);
          unbufMirrorOut = saslStreams.out;
          unbufMirrorIn = saslStreams.in;
          mirrorOut = new DataOutputStream(new BufferedOutputStream(unbufMirrorOut,
              HdfsConstants.SMALL_BUFFER_SIZE));
          mirrorIn = new DataInputStream(unbufMirrorIn);

          // Do not propagate allowLazyPersist to downstream DataNodes.
          //TODO
          if (targetPinnings != null && targetPinnings.length > 0) {
            //TODO 往下游发送socket连接
            // 这段代码是socket的客户端
            new Sender(mirrorOut).writeBlock(originalBlock, targetStorageTypes[0],
              blockToken, clientname, targets, targetStorageTypes, srcDataNode,
              stage, pipelineSize, minBytesRcvd, maxBytesRcvd,
              latestGenerationStamp, requestedChecksum, cachingStrategy,
              false, targetPinnings[0], targetPinnings);
          } else {
            new Sender(mirrorOut).writeBlock(originalBlock, targetStorageTypes[0],
              blockToken, clientname, targets, targetStorageTypes, srcDataNode,
              stage, pipelineSize, minBytesRcvd, maxBytesRcvd,
              latestGenerationStamp, requestedChecksum, cachingStrategy,
              false, false, targetPinnings);
          }

到这里我们就走通了,我们发现,他是通过Xceiver 通过socket对下游datanode进行建立管道连接

1.8.10 总结

  • DataStreamer这个对象会通过socket请求找到Datanode的DataXceiverServer
  • DataXceiverServer就会创建一个DataXceiver线程,他内部启动了BlockReceiver
  • BlockReceiver 首先会创建rbw文件,因为还没有写完成
  • BlockReceiver又会向下游datanode发送socket连接来进行写数据操作,执行流程都是一样的

image-20200712120750677

1.9 管道建立失败了怎么办?

      if (!success) {
          DFSClient.LOG.info("Abandoning " + block);
          //TODO 如果管道建立不成功,那么就是放弃这个block
          dfsClient.namenode.abandonBlock(block, fileId, src,
              dfsClient.clientName);
          block = null;
          DFSClient.LOG.info("Excluding datanode " + nodes[errorIndex]);
          //hadoop3
          excludedNodes.put(nodes[errorIndex], nodes[errorIndex]);
        }

image-20200712121646619

  1. 首先从INodeFile下把申请的这个block清理
  2. 再计算出来不可用的节点
  3. 把不可用的节点信息在申请块的时候交给NameNode计算,其实就是把不可用的排出,让可用的来,继续pipline

1.10 初始化ResponseProcessor

1.10.1 DFSOutputStream的run方法

          //TODO 步骤1 : 建立数据管道
            /**
             * nextBlockOutputStream 这个方法里面完成了两个事情:
             * 1. 向NameNode申请block
             * 2. 建立数据管道
             */
            setPipeline(nextBlockOutputStream());
            //TODO 步骤二: 启动了ResponseProcessor 用来监听我们一个packet发送是否成功
            //TODO 往下还有步骤3 步骤4
            initDataStreaming();

1.10.2 initDataStreaming

    /**
     * Initialize for data streaming
     */
    private void initDataStreaming() {
      this.setName("DataStreamer for file " + src +
          " block " + block);
      //TODO 启动ResponseProcessor
      response = new ResponseProcessor(nodes);
      response.start();
      stage = BlockConstructionStage.DATA_STREAMING;
    }

1.10.3 ResponseProcessor 的run

  // read an ack from the pipeline
            long begin = Time.monotonicNow();
            //TODO 读取下游的处理结果
            ack.readFields(blockReplyStream);
            synchronized (dataQueue) {
              scope = Trace.continueSpan(one.getTraceSpan());
              one.setTraceSpan(null);
              lastAckedSeqno = seqno;
              //TODO 如果发送成功那么就会把ackQueue里面的packet移除掉
              ackQueue.removeFirst();
              dataQueue.notifyAll();

              one.releaseBuffer(byteArrayManager);
            }

1.10.4 DFSOutputStream的run方法

          // send the packet
          Span span = null;
          synchronized (dataQueue) {
            // move packet from dataQueue to ackQueue
            if (!one.isHeartbeatPacket()) {
              span = scope.detach();
              one.setTraceSpan(span);
              //TODO 步骤三: 从dataQueue把发送的这个packet移除
              dataQueue.removeFirst();
              //TODO 步骤四: 然后往ackQueue里面添加这个packet
              ackQueue.addLast(one);
              dataQueue.notifyAll();
            }
          }

1.10.5 总结

  • 在数据管道建立之后,block文件也创建好了,就会会开始写我们的packet了
  • 这个时候HDFS为了容错考量,就引入了一个ack机制
  • ResponseProcessor这个主要就是负责,处理ackQueue和接收datanode写数据的上报
  • 首先把要写的packet写入到ackQueue,写出去之后,删除dataQueue的那个packet
  • 如果所有主机全部完成就移除ackpacket上面的packect,如果没有就把ackQueue上面的packet放到dataQueue上面继续开始操作

image-20200712132632527

1.11 BlockReceiver接收数据写入磁盘和PacketResponder初始化

1.11.1 DataXceiver的writeBlock

      if (blockReceiver != null) {
        String mirrorAddr = (mirrorSock == null) ? null : mirrorNode;
        //TODO 接受block
        // 里面启动了PacketResponder
        blockReceiver.receiveBlock(mirrorOut, mirrorIn, replyOut,
            mirrorAddr, null, targets, false);

        // send close-ack for transfer-RBW/Finalized 
        if (isTransfer) {
          if (LOG.isTraceEnabled()) {
            LOG.trace("TRANSFER: send close-ack");
          }
          //TODO 返回响应
          writeResponse(SUCCESS, null, replyOut);
        }
      }

1.11.2 BlockReceiver的receiveBlock

    try {
      if (isClient && !isTransfer) {
        //TODO 启动了 PacketResponder
        //TODO 注意这个对象
        responder = new Daemon(datanode.threadGroup, 
            new PacketResponder(replyOut, mirrIn, downstreams));
        responder.start(); // start thread to processes responses
      }

      //TODO 不断的接收数据 receivePacket()
      while (receivePacket() >= 0) { /* Receive until the last packet */ }

      // wait for all outstanding packet responses. And then
      // indicate responder to gracefully shutdown.
      // Mark that responder has been closed for future processing
      if (responder != null) {
        ((PacketResponder)responder.getRunnable()).close();
        responderClosed = true;
      }

1.11.3 PacketResponder的run

    /**
     * Thread to process incoming acks.
     * @see java.lang.Runnable#run()
     */
    @Override
    public void run() {
      boolean lastPacketInBlock = false;
      final long startTime = ClientTraceLog.isInfoEnabled() ? System.nanoTime() : 0;
      while (isRunning() && !lastPacketInBlock) {
        long totalAckTimeNanos = 0;
        boolean isInterrupted = false;
        try {
          Packet pkt = null;
          long expected = -2;
          PipelineAck ack = new PipelineAck();
          long seqno = PipelineAck.UNKOWN_SEQNO;
          long ackRecvNanoTime = 0;
          try {
            //TODO 如果你不是数据管道里面的最后一个节点
            if (type != PacketResponderType.LAST_IN_PIPELINE && !mirrorError) {
              // read an ack from downstream datanode
              //TODO 读取下游数据的处理结果
              ack.readFields(downstreamIn);
              ackRecvNanoTime = System.nanoTime();
              if (LOG.isDebugEnabled()) {
                LOG.debug(myString + " got " + ack);
              }
              // Process an OOB ACK.
              Status oobStatus = ack.getOOBStatus();
              if (oobStatus != null) {
                LOG.info("Relaying an out of band ack of type " + oobStatus);
                sendAckUpstream(ack, PipelineAck.UNKOWN_SEQNO, 0L, 0L,
                    PipelineAck.combineHeader(datanode.getECN(),
                      Status.SUCCESS));
                continue;
              }
              seqno = ack.getSeqno();
            }
            if (seqno != PipelineAck.UNKOWN_SEQNO
                || type == PacketResponderType.LAST_IN_PIPELINE) {
              pkt = waitForAckHead(seqno);
              if (!isRunning()) {
                break;
              }
              expected = pkt.seqno;
              if (type == PacketResponderType.HAS_DOWNSTREAM_IN_PIPELINE
                  && seqno != expected) {
                throw new IOException(myString + "seqno: expected=" + expected
                    + ", received=" + seqno);
              }
              if (type == PacketResponderType.HAS_DOWNSTREAM_IN_PIPELINE) {
                // The total ack time includes the ack times of downstream
                // nodes.
                // The value is 0 if this responder doesn't have a downstream
                // DN in the pipeline.
                totalAckTimeNanos = ackRecvNanoTime - pkt.ackEnqueueNanoTime;
                // Report the elapsed time from ack send to ack receive minus
                // the downstream ack time.
                long ackTimeNanos = totalAckTimeNanos
                    - ack.getDownstreamAckTimeNanos();
                if (ackTimeNanos < 0) {
                  if (LOG.isDebugEnabled()) {
                    LOG.debug("Calculated invalid ack time: " + ackTimeNanos
                        + "ns.");
                  }
                } else {
                  datanode.metrics.addPacketAckRoundTripTimeNanos(ackTimeNanos);
                }
              }
              lastPacketInBlock = pkt.lastPacketInBlock;
            }
          } catch (InterruptedException ine) {
            isInterrupted = true;
          } catch (IOException ioe) {
            if (Thread.interrupted()) {
              isInterrupted = true;
            } else {
              // continue to run even if can not read from mirror
              // notify client of the error
              // and wait for the client to shut down the pipeline
              mirrorError = true;
              LOG.info(myString, ioe);
            }
          }

          if (Thread.interrupted() || isInterrupted) {
            /*
             * The receiver thread cancelled this thread. We could also check
             * any other status updates from the receiver thread (e.g. if it is
             * ok to write to replyOut). It is prudent to not send any more
             * status back to the client because this datanode has a problem.
             * The upstream datanode will detect that this datanode is bad, and
             * rightly so.
             *
             * The receiver thread can also interrupt this thread for sending
             * an out-of-band response upstream.
             */
            LOG.info(myString + ": Thread is interrupted.");
            running = false;
            continue;
          }

          if (lastPacketInBlock) {
            // Finalize the block and close the block file
            finalizeBlock(startTime);
          }

          Status myStatus = pkt != null ? pkt.ackStatus : Status.SUCCESS;
          //TODO 直接往上游节点发送处理结果
          sendAckUpstream(ack, expected, totalAckTimeNanos,
            (pkt != null ? pkt.offsetInBlock : 0),
            PipelineAck.combineHeader(datanode.getECN(), myStatus));
          if (pkt != null) {
            // remove the packet from the ack queue
            //TODO 如果下游数据处理成功,当前datanode就会从ackQueue里面移除packet
            removeAckHead();
          }
        } catch (IOException e) {
          LOG.warn("IOException in BlockReceiver.run(): ", e);
          if (running) {
            datanode.checkDiskErrorAsync();
            LOG.info(myString, e);
            running = false;
            if (!Thread.interrupted()) { // failure not caused by interruption
              receiverThread.interrupt();
            }
          }
        } catch (Throwable e) {
          if (running) {
            LOG.info(myString, e);
            running = false;
            receiverThread.interrupt();
          }
        }
      }
      LOG.info(myString + " terminating");
    }

1.11.4 receivePacket()

// put in queue for pending acks, unless sync was requested
    //TODO 步骤一: blockReceiver接收数据的第一个事情
    if (responder != null && !syncBlock && !shouldVerifyChecksum()) {
      //TODO 先把packet写入到当前datanode的ackQueue里面去
      ((PacketResponder) responder.getRunnable()).enqueue(seqno,
          lastPacketInBlock, offsetInBlock, Status.SUCCESS);
    }

    //First write the packet to the mirror:
    //TODO 步骤二: 把当前这个packet发送给下游的节点 (数据管道里面的)
    if (mirrorOut != null && !mirrorError) {
      try {
        long begin = Time.monotonicNow();
        //TODO 发送给下游
        //TODO 因为下游肯定也是启动了blockRecive来接收我们的数据
        packetReceiver.mirrorPacketTo(mirrorOut);
        mirrorOut.flush();
        long duration = Time.monotonicNow() - begin;
        if (duration > datanodeSlowLogThresholdMs) {
          LOG.warn("Slow BlockReceiver write packet to mirror took " + duration
              + "ms (threshold=" + datanodeSlowLogThresholdMs + "ms)");
        }
      } catch (IOException e) {
        handleMirrorOutError(e);
      }
    }
    
          //TODO 步骤三:写数据  写入本地磁盘
          out.write(dataBuf.array(), startByteToDisk, numBytesToDisk);
        

1.11.5 总结

  • 首先datanode的blockreceiver会接收数据
  • 接收到数据之后写入ackQueue中
  • 把这个packet发送给下游的datanode的blockreceiver
  • 然后把这个packet写入磁盘

image-20200712134359602

1.12 层层上报

              //TODO 读取下游数据的处理结果
              ack.readFields(downstreamIn);
  Status myStatus = pkt != null ? pkt.ackStatus : Status.SUCCESS;
          //TODO 直接往上游节点发送处理结果
          sendAckUpstream(ack, expected, totalAckTimeNanos,
            (pkt != null ? pkt.offsetInBlock : 0),
            PipelineAck.combineHeader(datanode.getECN(), myStatus));
          if (pkt != null) {
            // remove the packet from the ack queue
            //TODO 如果下游数据处理成功,当前datanode就会从ackQueue里面移除packet
            removeAckHead();
          }
  • 我们再写数据的过程汇总,datanode会把自己写的情况进行层层上报
  • 读取下游的处理结果,上报我和我的下游的处理结果

image-20200712135134740

1.13 写数据容错分析

HDFS在写数据管道的时候,如果传输过程中错了怎么办

  1. 如果建立管道的时候出现问题了,先移除掉原来申请的block,再次向我们的namenode去申请新的block,但是申请新的block的时候回排出之前出现过问题的datanode主机

  2. 数据管道建立起来了,然后又出现传输失败的问题

    1. 关闭responseProcessor服务
    2. 关闭输入流输出流,socket服务
    3. 把ackQueue里面的所有packet都发回到dataQueue里面去,清空ackQueue
    4. 假如我们是三个副本,其中有两个或者两个以上的datanode出现问题了,那么就申请新的datanode来进行建立数据管道
    5. 如果我们三个副本里面只有一个副本出现问题了,那么这个时候没事.我们只需要根据剩余的datanode节点建立数据管道继续发送就好了(副本恢复就靠心跳机制恢复就好了)

2. HDFS写数据流程总结

写数据流程

  1. 往hdfs目录树中添加INodeFile
  2. 添加契约,也就是当前仅有一个客户端可以操作这个文件(后台会有线程一直进行扫描契约是否过期)
  3. 启动DataStreamer来进行写数据的操作
  4. 进行启动线程进行续约操作
  5. 将chunk写到packet然后将packet加入到dataQueue
  6. 向namenode去申请block,也就是返回可用的datanode节点(数据的负载均衡,也就是机架感知和其他的一些策略)
  7. 建立datanode之间的数据管道
  8. 先把要写出的数据添加到ackQueue里面,然后写数据,移除dataQueue中的packet,假如此时写数据失败了,就把ack队列里面的packet添加到dataqueue里面重新写
  9. 然后datanode会初始化一个PacketResponder的线程,进行将我们的数据写入到ack队列,然后把数据写入其他的datanode同时获取其他datanode的处理结果,将数据写入磁盘,然后依次层层上报给我们的客户端,写失败了,还是会用ack机制
  10. 假如数据管道建立失败了,就删除向namenode申请的block,然后重新建立block和建立数据管道
  11. 假如在datanode写数据流程中出现问题,那么这个是否分为两个流程,先判断是几个副本写失败了,如果3个副本的情况下,>=2的时候,这个时候重新去申请建立管道,如果只有一个的话,那么这个时候没事,重新建立管道,这个管道此时就两个 也就是正常的,失败的踢出去,等namenode的心跳来处理他的问题

猜你喜欢

转载自blog.csdn.net/weixin_43704599/article/details/107299531