[Big Data Hadoop] HDFS-Namenode-bootstrapStandby source code step analysis of synchronous metadata

process

  1. Obtain nameserviceId and namenodeId according to configuration items
  2. Obtain other namenode information and establish rpc communication.
  3. Determine whether formatting is allowed for configuration items dfs.namenode.support.allow.format. It is recommended to configure in general production environments to prevent formatting of existing data by mistake.
  4. Get the formatted directory (fsImage and edits storage directory, and sharedEditsDirs configuration).
  5. format directory, create current directory, write VERSION file and seen_txid file
  6. Check whether the editlog file between the last checkpoint and the latest curtxid exists in qjm.
  7. Download the fsImage file generated by the latest checkpoint from the remote namenode
  8. The whole process is formatted.

Synchronize metadata command

hdfs namenode [-bootstrapStandby [-force] [-nonInteractive] [-skipSharedEditsCheck] ]


# 常用的命令
hdfs namenode -bootstrapStandby

Source code interpretation

Configuration parsing

Entry org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.runmethod
This step does the following:

  • Get cluster configuration information
  • Find the remote Namenode and get the first one
  • Check whether it can be formatted
  • Call the specific synchronization process
  public int run(String[] args) throws Exception {
    
    
    // 解析命令行参数
    parseArgs(args);
    // Disable using the RPC tailing mechanism for bootstrapping the standby
    // since it is less efficient in this case; see HDFS-14806
    conf.setBoolean(DFSConfigKeys.DFS_HA_TAILEDITS_INPROGRESS_KEY, false);
    // 解析配置,获取集群信息,找到remoteNN
    parseConfAndFindOtherNN();
    NameNode.checkAllowFormat(conf);

    InetSocketAddress myAddr = DFSUtilClient.getNNAddress(conf);
    SecurityUtil.login(conf, DFS_NAMENODE_KEYTAB_FILE_KEY,
        DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY, myAddr.getHostName());

    return SecurityUtil.doAsLoginUserOrFatal(new PrivilegedAction<Integer>() {
    
    
      @Override
      public Integer run() {
    
    
        try {
    
    
          // 执行 同步元数据
          return doRun();
        } catch (IOException e) {
    
    
          throw new RuntimeException(e);
        }
      }
    });
  }

sync metadata

When executing doRun, it integrates the whole process, mainly doing the following things:

  • Create a proxy object for remoteNN
  • format directory file, create VERSION/seen_txid file
  • Ready to download fsImage
private int doRun() throws IOException {
    
    
    // find the active NN
    NamenodeProtocol proxy = null;
    NamespaceInfo nsInfo = null;
    boolean isUpgradeFinalized = false;
    RemoteNameNodeInfo proxyInfo = null;
    // 整个一大段就是在创建nn的代理对象。通过循环,找到第一个符合要求的。
    for (int i = 0; i < remoteNNs.size(); i++) {
    
    
      proxyInfo = remoteNNs.get(i);
      InetSocketAddress otherIpcAddress = proxyInfo.getIpcAddress();
      proxy = createNNProtocolProxy(otherIpcAddress);
      try {
    
    
        // Get the namespace from any active NN. If you just formatted the primary NN and are
        // bootstrapping the other NNs from that layout, it will only contact the single NN.
        // However, if there cluster is already running and you are adding a NN later (e.g.
        // replacing a failed NN), then this will bootstrap from any node in the cluster.
        nsInfo = proxy.versionRequest();
        isUpgradeFinalized = proxy.isUpgradeFinalized();
        break;
      } catch (IOException ioe) {
    
    
        LOG.warn("Unable to fetch namespace information from remote NN at " + otherIpcAddress
            + ": " + ioe.getMessage());
        if (LOG.isDebugEnabled()) {
    
    
          LOG.debug("Full exception trace", ioe);
        }
      }
    }

    if (nsInfo == null) {
    
    
      LOG.error(
          "Unable to fetch namespace information from any remote NN. Possible NameNodes: "
              + remoteNNs);
      return ERR_CODE_FAILED_CONNECT;
    }
	// 判断layout,目前是-66
    if (!checkLayoutVersion(nsInfo)) {
    
    
      LOG.error("Layout version on remote node (" + nsInfo.getLayoutVersion()
          + ") does not match " + "this node's layout version ("
          + HdfsServerConstants.NAMENODE_LAYOUT_VERSION + ")");
      return ERR_CODE_INVALID_VERSION;
    }
	// 打印集群信息
    System.out.println(
        "=====================================================\n" +
        "About to bootstrap Standby ID " + nnId + " from:\n" +
        "           Nameservice ID: " + nsId + "\n" +
        "        Other Namenode ID: " + proxyInfo.getNameNodeID() + "\n" +
        "  Other NN's HTTP address: " + proxyInfo.getHttpAddress() + "\n" +
        "  Other NN's IPC  address: " + proxyInfo.getIpcAddress() + "\n" +
        "             Namespace ID: " + nsInfo.getNamespaceID() + "\n" +
        "            Block pool ID: " + nsInfo.getBlockPoolID() + "\n" +
        "               Cluster ID: " + nsInfo.getClusterID() + "\n" +
        "           Layout version: " + nsInfo.getLayoutVersion() + "\n" +
        "       isUpgradeFinalized: " + isUpgradeFinalized + "\n" +
        "=====================================================");
    // 创建待格式化的存储对象
    NNStorage storage = new NNStorage(conf, dirsToFormat, editUrisToFormat);

    if (!isUpgradeFinalized) {
    
    
      //...省略升级相关部分代码
    } else if (!format(storage, nsInfo)) {
    
     // prompt the user to format storage 此步骤就是创建 VERSION/seen_txid文件
      return ERR_CODE_ALREADY_FORMATTED;
    }

    // download the fsimage from active namenode
    // 从remoteNN通过http下载fsImage文件了。
    int download = downloadImage(storage, proxy, proxyInfo);
    if (download != 0) {
    
    
      return download;
    }

    //...省略部分代码
  }

Download fsImage file

private int downloadImage(NNStorage storage, NamenodeProtocol proxy, RemoteNameNodeInfo proxyInfo)
      throws IOException {
    
    
    // Load the newly formatted image, using all of the directories
    // (including shared edits)
    // 获取最近的checkpointTxid
    final long imageTxId = proxy.getMostRecentCheckpointTxId();
    // 获取当前事务id
    final long curTxId = proxy.getTransactionID();
    FSImage image = new FSImage(conf);
    try {
    
    
      // 赋值集群信息给image
      image.getStorage().setStorageInfo(storage);
      // 创建journalSet对象,置状态为OPEN_FOR_READING
      image.initEditLog(StartupOption.REGULAR);
      assert image.getEditLog().isOpenForRead() :
          "Expected edit log to be open for read";

      // Ensure that we have enough edits already in the shared directory to
      // start up from the last checkpoint on the active.
      // 从共享的qjm中获取curTxId到imageTxId的editLogs数据
      if (!skipSharedEditsCheck &&
          !checkLogsAvailableForRead(image, imageTxId, curTxId)) {
    
    
        return ERR_CODE_LOGS_UNAVAILABLE;
      }
	  // 通过http下载fsImage,名称为fsimage.ckpt文件,写到存储目录中。
      // Download that checkpoint into our storage directories.
      MD5Hash hash = TransferFsImage.downloadImageToStorage(
        proxyInfo.getHttpAddress(), imageTxId, storage, true, true);
        // 保存fsImage的md5值,并且重命名fsImage为正式的无ckpt的。
      image.saveDigestAndRenameCheckpointImage(NameNodeFile.IMAGE, imageTxId,
          hash);
	  // 写seen_txid到目录中
      // Write seen_txid to the formatted image directories.
      storage.writeTransactionIdFileToStorage(imageTxId, NameNodeDirType.IMAGE);
    } catch (IOException ioe) {
    
    
      throw ioe;
    } finally {
    
    
      image.close();
    }
    return 0;
  }

Check whether shareEditsLog exists

Look first. checkLogsAvailableForRead
This step is mainly to obtain the log stream of editlogs between imageTxId and curTxId from QJM.
Directly look at the key
org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreamsmethod

public Collection<EditLogInputStream> selectInputStreams(long fromTxId,
      long toAtLeastTxId, MetaRecoveryContext recovery, boolean inProgressOk,
      boolean onlyDurableTxns) throws IOException {
    
    

    List<EditLogInputStream> streams = new ArrayList<EditLogInputStream>();
    synchronized(journalSetLock) {
    
    
      Preconditions.checkState(journalSet.isOpen(), "Cannot call " +
          "selectInputStreams() on closed FSEditLog");
      // 从共享qjm中获取editLogs,并保存
      selectInputStreams(streams, fromTxId, inProgressOk, onlyDurableTxns);
    }

    try {
    
    
       // 校验是否有间隔 
      checkForGaps(streams, fromTxId, toAtLeastTxId, inProgressOk);
    } catch (IOException e) {
    
    
      if (recovery != null) {
    
    
        // If recovery mode is enabled, continue loading even if we know we
        // can't load up to toAtLeastTxId.
        LOG.error("Exception while selecting input streams", e);
      } else {
    
    
        closeAllStreams(streams);
        throw e;
      }
    }
    return streams;
  }

Download fsImage

public static MD5Hash downloadImageToStorage(URL fsName, long imageTxId,
      Storage dstStorage, boolean needDigest, boolean isBootstrapStandby)
      throws IOException {
    
    
    String fileid = ImageServlet.getParamStringForImage(null,
        imageTxId, dstStorage, isBootstrapStandby);
    String fileName = NNStorage.getCheckpointImageFileName(imageTxId);
    
    List<File> dstFiles = dstStorage.getFiles(
        NameNodeDirType.IMAGE, fileName);
    if (dstFiles.isEmpty()) {
    
    
      throw new IOException("No targets in destination storage!");
    }
    // 下载并返回 md5值
    MD5Hash hash = getFileClient(fsName, fileid, dstFiles, dstStorage, needDigest);
    LOG.info("Downloaded file " + dstFiles.get(0).getName() + " size " +
        dstFiles.get(0).length() + " bytes.");
    return hash;
  }

Finally sync metadata complete

The following data is stored in the data directory of another node:

── current
    ├── fsimage_0000000000000000000
    ├── fsimage_0000000000000000000.md5
    ├── seen_txid
    └── VERSION

1 directory, 4 files

I hope it will be helpful to you who are viewing the article, remember to pay attention, comment, and favorite, thank you

Guess you like

Origin blog.csdn.net/u013412066/article/details/129679674