Namenode -bootstrapStandby
process
- Obtain nameserviceId and namenodeId according to configuration items
- Obtain other namenode information and establish rpc communication.
- Determine whether formatting is allowed for configuration items
dfs.namenode.support.allow.format
. It is recommended to configure in general production environments to prevent formatting of existing data by mistake. - Get the formatted directory (fsImage and edits storage directory, and sharedEditsDirs configuration).
- format directory, create current directory, write VERSION file and seen_txid file
- Check whether the editlog file between the last checkpoint and the latest curtxid exists in qjm.
- Download the fsImage file generated by the latest checkpoint from the remote namenode
- The whole process is formatted.
Synchronize metadata command
hdfs namenode [-bootstrapStandby [-force] [-nonInteractive] [-skipSharedEditsCheck] ]
# 常用的命令
hdfs namenode -bootstrapStandby
Source code interpretation
Configuration parsing
Entry org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.run
method
This step does the following:
- Get cluster configuration information
- Find the remote Namenode and get the first one
- Check whether it can be formatted
- Call the specific synchronization process
public int run(String[] args) throws Exception {
// 解析命令行参数
parseArgs(args);
// Disable using the RPC tailing mechanism for bootstrapping the standby
// since it is less efficient in this case; see HDFS-14806
conf.setBoolean(DFSConfigKeys.DFS_HA_TAILEDITS_INPROGRESS_KEY, false);
// 解析配置,获取集群信息,找到remoteNN
parseConfAndFindOtherNN();
NameNode.checkAllowFormat(conf);
InetSocketAddress myAddr = DFSUtilClient.getNNAddress(conf);
SecurityUtil.login(conf, DFS_NAMENODE_KEYTAB_FILE_KEY,
DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY, myAddr.getHostName());
return SecurityUtil.doAsLoginUserOrFatal(new PrivilegedAction<Integer>() {
@Override
public Integer run() {
try {
// 执行 同步元数据
return doRun();
} catch (IOException e) {
throw new RuntimeException(e);
}
}
});
}
sync metadata
When executing doRun, it integrates the whole process, mainly doing the following things:
- Create a proxy object for remoteNN
- format directory file, create VERSION/seen_txid file
- Ready to download fsImage
private int doRun() throws IOException {
// find the active NN
NamenodeProtocol proxy = null;
NamespaceInfo nsInfo = null;
boolean isUpgradeFinalized = false;
RemoteNameNodeInfo proxyInfo = null;
// 整个一大段就是在创建nn的代理对象。通过循环,找到第一个符合要求的。
for (int i = 0; i < remoteNNs.size(); i++) {
proxyInfo = remoteNNs.get(i);
InetSocketAddress otherIpcAddress = proxyInfo.getIpcAddress();
proxy = createNNProtocolProxy(otherIpcAddress);
try {
// Get the namespace from any active NN. If you just formatted the primary NN and are
// bootstrapping the other NNs from that layout, it will only contact the single NN.
// However, if there cluster is already running and you are adding a NN later (e.g.
// replacing a failed NN), then this will bootstrap from any node in the cluster.
nsInfo = proxy.versionRequest();
isUpgradeFinalized = proxy.isUpgradeFinalized();
break;
} catch (IOException ioe) {
LOG.warn("Unable to fetch namespace information from remote NN at " + otherIpcAddress
+ ": " + ioe.getMessage());
if (LOG.isDebugEnabled()) {
LOG.debug("Full exception trace", ioe);
}
}
}
if (nsInfo == null) {
LOG.error(
"Unable to fetch namespace information from any remote NN. Possible NameNodes: "
+ remoteNNs);
return ERR_CODE_FAILED_CONNECT;
}
// 判断layout,目前是-66
if (!checkLayoutVersion(nsInfo)) {
LOG.error("Layout version on remote node (" + nsInfo.getLayoutVersion()
+ ") does not match " + "this node's layout version ("
+ HdfsServerConstants.NAMENODE_LAYOUT_VERSION + ")");
return ERR_CODE_INVALID_VERSION;
}
// 打印集群信息
System.out.println(
"=====================================================\n" +
"About to bootstrap Standby ID " + nnId + " from:\n" +
" Nameservice ID: " + nsId + "\n" +
" Other Namenode ID: " + proxyInfo.getNameNodeID() + "\n" +
" Other NN's HTTP address: " + proxyInfo.getHttpAddress() + "\n" +
" Other NN's IPC address: " + proxyInfo.getIpcAddress() + "\n" +
" Namespace ID: " + nsInfo.getNamespaceID() + "\n" +
" Block pool ID: " + nsInfo.getBlockPoolID() + "\n" +
" Cluster ID: " + nsInfo.getClusterID() + "\n" +
" Layout version: " + nsInfo.getLayoutVersion() + "\n" +
" isUpgradeFinalized: " + isUpgradeFinalized + "\n" +
"=====================================================");
// 创建待格式化的存储对象
NNStorage storage = new NNStorage(conf, dirsToFormat, editUrisToFormat);
if (!isUpgradeFinalized) {
//...省略升级相关部分代码
} else if (!format(storage, nsInfo)) {
// prompt the user to format storage 此步骤就是创建 VERSION/seen_txid文件
return ERR_CODE_ALREADY_FORMATTED;
}
// download the fsimage from active namenode
// 从remoteNN通过http下载fsImage文件了。
int download = downloadImage(storage, proxy, proxyInfo);
if (download != 0) {
return download;
}
//...省略部分代码
}
Download fsImage file
private int downloadImage(NNStorage storage, NamenodeProtocol proxy, RemoteNameNodeInfo proxyInfo)
throws IOException {
// Load the newly formatted image, using all of the directories
// (including shared edits)
// 获取最近的checkpointTxid
final long imageTxId = proxy.getMostRecentCheckpointTxId();
// 获取当前事务id
final long curTxId = proxy.getTransactionID();
FSImage image = new FSImage(conf);
try {
// 赋值集群信息给image
image.getStorage().setStorageInfo(storage);
// 创建journalSet对象,置状态为OPEN_FOR_READING
image.initEditLog(StartupOption.REGULAR);
assert image.getEditLog().isOpenForRead() :
"Expected edit log to be open for read";
// Ensure that we have enough edits already in the shared directory to
// start up from the last checkpoint on the active.
// 从共享的qjm中获取curTxId到imageTxId的editLogs数据
if (!skipSharedEditsCheck &&
!checkLogsAvailableForRead(image, imageTxId, curTxId)) {
return ERR_CODE_LOGS_UNAVAILABLE;
}
// 通过http下载fsImage,名称为fsimage.ckpt文件,写到存储目录中。
// Download that checkpoint into our storage directories.
MD5Hash hash = TransferFsImage.downloadImageToStorage(
proxyInfo.getHttpAddress(), imageTxId, storage, true, true);
// 保存fsImage的md5值,并且重命名fsImage为正式的无ckpt的。
image.saveDigestAndRenameCheckpointImage(NameNodeFile.IMAGE, imageTxId,
hash);
// 写seen_txid到目录中
// Write seen_txid to the formatted image directories.
storage.writeTransactionIdFileToStorage(imageTxId, NameNodeDirType.IMAGE);
} catch (IOException ioe) {
throw ioe;
} finally {
image.close();
}
return 0;
}
Check whether shareEditsLog exists
Look first. checkLogsAvailableForRead
This step is mainly to obtain the log stream of editlogs between imageTxId and curTxId from QJM.
Directly look at the key
org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams
method
public Collection<EditLogInputStream> selectInputStreams(long fromTxId,
long toAtLeastTxId, MetaRecoveryContext recovery, boolean inProgressOk,
boolean onlyDurableTxns) throws IOException {
List<EditLogInputStream> streams = new ArrayList<EditLogInputStream>();
synchronized(journalSetLock) {
Preconditions.checkState(journalSet.isOpen(), "Cannot call " +
"selectInputStreams() on closed FSEditLog");
// 从共享qjm中获取editLogs,并保存
selectInputStreams(streams, fromTxId, inProgressOk, onlyDurableTxns);
}
try {
// 校验是否有间隔
checkForGaps(streams, fromTxId, toAtLeastTxId, inProgressOk);
} catch (IOException e) {
if (recovery != null) {
// If recovery mode is enabled, continue loading even if we know we
// can't load up to toAtLeastTxId.
LOG.error("Exception while selecting input streams", e);
} else {
closeAllStreams(streams);
throw e;
}
}
return streams;
}
Download fsImage
public static MD5Hash downloadImageToStorage(URL fsName, long imageTxId,
Storage dstStorage, boolean needDigest, boolean isBootstrapStandby)
throws IOException {
String fileid = ImageServlet.getParamStringForImage(null,
imageTxId, dstStorage, isBootstrapStandby);
String fileName = NNStorage.getCheckpointImageFileName(imageTxId);
List<File> dstFiles = dstStorage.getFiles(
NameNodeDirType.IMAGE, fileName);
if (dstFiles.isEmpty()) {
throw new IOException("No targets in destination storage!");
}
// 下载并返回 md5值
MD5Hash hash = getFileClient(fsName, fileid, dstFiles, dstStorage, needDigest);
LOG.info("Downloaded file " + dstFiles.get(0).getName() + " size " +
dstFiles.get(0).length() + " bytes.");
return hash;
}
Finally sync metadata complete
The following data is stored in the data directory of another node:
── current
├── fsimage_0000000000000000000
├── fsimage_0000000000000000000.md5
├── seen_txid
└── VERSION
1 directory, 4 files
I hope it will be helpful to you who are viewing the article, remember to pay attention, comment, and favorite, thank you