Namenode format
process
- Obtain nameserviceId and namenodeId according to configuration items
- Determine whether formatting is allowed for configuration items
dfs.namenode.support.allow.format
. It is recommended to configure in general production environments to prevent formatting of existing data by mistake. - Get the formatted directory (fsImage and edits storage directory, and sharedEditsDirs configuration).
- If clusterId is not specified in the startup parameters, randomly generate clusterId (CID-UUID)
- Create FsImage object and FSNamesystem, ready for format operation.
- Determine whether the above three directories have data. If it is in HA mode, you need to connect to connect to journalnode, and call rpc to determine whether there is data in the directory.
- The call
NNStorage.newNamespaceInfo()
generates random NamespaceInfo information, including namespaceID, clusterID, cTime, storageType, layoutVersion, blockPoolID and other information - Create current directory, write VERSION file and seen_txid file
- Connect to journalnode and perform format operation. During this period, file directories such as ${nameserviceId}/current, ${nameserviceId}/edits.sync will be created on journalnode, and current will also have VERSION, paxos, committed-txid information
- The last and most important step is
saveFSImageInAllDirs
to save the FSNamesystem created above through FSImageSaver. Generate FsImage and md5 image files. At this time, txid is 0. - The whole process is formatted.
format command
hdfs namenode [-format [-clusterid cid ] [-force] [-nonInteractive] ]
# 常用的命令,不指定clusterId
hdfs namenode -format
# 手工指定 clusterId
hdfs namenode -format -clusterId aaaaa
Source code interpretation
initialization operation
Entry org.apache.hadoop.hdfs.server.namenode.NameNode.format
method
This step does the following:
- Get cluster configuration information
- Initialize Journals information and set the status to BETWEEN_LOG_SEGMENTS
- Check whether it can be reformatted and whether there is historical data.
- Start the formatting process
/**
* 验证配置的目录是否存在
* 交互式的确认是否格式化每一个目录
*/
private static boolean format(Configuration conf, boolean force,
boolean isInteractive) throws IOException {
// 获取 配置的 nameserviceId
String nsId = DFSUtil.getNamenodeNameServiceId(conf);
// 根据当前机器 IP 获取 nn1
String namenodeId = HAUtil.getNameNodeId(conf, nsId);
// 初始化通用的配置项
initializeGenericKeys(conf, nsId, namenodeId);
// 校验是否允许格式化操作,生产建议关闭了
checkAllowFormat(conf);
// 安全配置
if (UserGroupInformation.isSecurityEnabled()) {
InetSocketAddress socAddr = DFSUtilClient.getNNAddress(conf);
SecurityUtil.login(conf, DFS_NAMENODE_KEYTAB_FILE_KEY,
DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY, socAddr.getHostName());
}
// 获取Namenode的目录,存放FsImage的地方。
Collection<URI> nameDirsToFormat = FSNamesystem.getNamespaceDirs(conf);
// 获取sharedEditsDirs的配置,共享journalnode的地址。
List<URI> sharedDirs = FSNamesystem.getSharedEditsDirs(conf);
List<URI> dirsToPrompt = new ArrayList<URI>();
dirsToPrompt.addAll(nameDirsToFormat);
dirsToPrompt.addAll(sharedDirs);
// 获取edits目录,一般不配置的话,和fsImage放在一起。
List<URI> editDirsToFormat =
FSNamesystem.getNamespaceEditsDirs(conf);
// if clusterID is not provided - see if you can find the current one
// 判断启动项中是否有指定 clusterID,没有的话,就生成一个随机的。
String clusterId = StartupOption.FORMAT.getClusterId();
if(clusterId == null || clusterId.equals("")) {
//Generate a new cluster id
// CID-uuid()
clusterId = NNStorage.newClusterID();
}
LOG.info("Formatting using clusterid: {}", clusterId);
// 创建 fsImage对象
FSImage fsImage = new FSImage(conf, nameDirsToFormat, editDirsToFormat);
FSNamesystem fsn = null;
try {
// 创建 FSNamesystem 对象,存放Namenode元数据的对象。
fsn = new FSNamesystem(conf, fsImage);
// 默认的editLog是UNINITIALIZED状态,准备写的状态是BETWEEN_LOG_SEGMENTS
// 此步骤是,改变其状态为BETWEEN_LOG_SEGMENTS,初始化journalSet,里面存放了JournalManager对象。
// JournalManager分为两类FileJournalManager,QuorumJournalManager
// FileJournalManager是写edits到本地
// QuorumJournalManager是写edits到远端共享服务中。
fsImage.getEditLog().initJournalsForWrite();
// Abort NameNode format if reformat is disabled and if
// meta-dir already exists
// 判断是否允许重新格式化操作。默认是 false
if (conf.getBoolean(DFSConfigKeys.DFS_REFORMAT_DISABLED,
DFSConfigKeys.DFS_REFORMAT_DISABLED_DEFAULT)) {
force = false;
isInteractive = false;
for (StorageDirectory sd : fsImage.storage.dirIterable(null)) {
if (sd.hasSomeData()) {
throw new NameNodeFormatException(
"NameNode format aborted as reformat is disabled for "
+ "this cluster.");
}
}
}
// 交互式的提示是否要格式化。
if (!fsImage.confirmFormat(force, isInteractive)) {
return true; // aborted
}
// 开始格式化了
fsImage.format(fsn, clusterId, force);
} catch (IOException ioe) {
LOG.warn("Encountered exception during format", ioe);
throw ioe;
} finally {
if (fsImage != null) {
fsImage.close();
}
if (fsn != null) {
fsn.close();
}
}
return false;
}
format operation
Entry org.apache.hadoop.hdfs.server.namenode.FSImage.format
method
This step does the following:
- Automatically generate cluster information NamespaceInfo (namespaceID, clusterID, cTime, storageType, layoutVersion, blockPoolID) and other information according to the rules
- Create current directory, write VERSION/seen_txid file
- Call the RPC of journalnode for format operation
- Serialize and save fsImage and md5 files to all configured namenode directories
void format(FSNamesystem fsn, String clusterId, boolean force)
throws IOException {
long fileCount = fsn.getFilesTotal();
// Expect 1 file, which is the root inode
Preconditions.checkState(fileCount == 1,
"FSImage.format should be called with an uninitialized namesystem, has " +
fileCount + " files");
// 生成集群信息,最终写入到VERSION文件中
NamespaceInfo ns = NNStorage.newNamespaceInfo();
LOG.info("Allocated new BlockPoolId: " + ns.getBlockPoolID());
ns.clusterID = clusterId;
// 往配置的namenode的每个目录创建current,写 VERSION / seen_txid
storage.format(ns);
// 调用journalnode的RPC进行format操作
editLog.formatNonFileJournals(ns, force);
// 保存fsImage到所有配置的namenode目录中
saveFSImageInAllDirs(fsn, 0);
}
The default VERSION file content is as follows:
#Mon Mar 20 07:50:13 CST 2023
namespaceID=1262756384
clusterID=CID-b6aa4a27-242a-49b0-98ae-23f4122f3f6d
cTime=1679269550070
storageType=NAME_NODE
blockpoolID=BP-389782493-10.253.128.31-1679269550070
layoutVersion=-66
The default seen_txid file content is as follows:
0
Write VERSION file locally
Entry org.apache.hadoop.hdfs.server.namenode.NNStorage.format
method
private void format(StorageDirectory sd) throws IOException {
// 创建 current,如果存在就清空里面的内容,并且设置posix权限。
sd.clearDirectory(); // create currrent dir
// 写VERSION文件
writeProperties(sd);
// 写 seen_txid 文件
writeTransactionIdFile(sd, 0);
LOG.info("Storage directory {} has been successfully formatted.",
sd.getRoot());
}
Formatting of JournalManager
Directly see how the server is implemented
Entry org.apache.hadoop.hdfs.qjournal.server.JournalNode.format
method
/**
* nsInfo通过rpc传输过来的集群信息
*/
void format(NamespaceInfo nsInfo, boolean force) throws IOException {
Preconditions.checkState(nsInfo.getNamespaceID() != 0,
"can't format with uninitialized namespace info: %s",
nsInfo);
LOG.info("Formatting journal id : " + journalId + " with namespace info: " +
nsInfo + " and force: " + force);
// 格式化处理
storage.format(nsInfo, force);
this.cache = createCache();
refreshCachedData();
}
call org.apache.hadoop.hdfs.qjournal.server.JNStorage.format
method
- Analyzing the status of the directory is mainly to determine whether the directory has been created and whether there is historical data in it, and then find out what state it is in at this time, such as rollback, upgrade, checkpoint, etc.
- Create the current directory
- Write VERSION cluster data information to persist to the file system
- Create paxos directory
void format(NamespaceInfo nsInfo, boolean force) throws IOException {
unlockAll();
try {
sd.analyzeStorage(StartupOption.FORMAT, this, !force);
} finally {
sd.unlock();
}
// 赋值集群信息
setStorageInfo(nsInfo);
LOG.info("Formatting journal {} with nsid: {}", sd, getNamespaceID());
// Unlock the directory before formatting, because we will
// re-analyze it after format(). The analyzeStorage() call
// below is reponsible for re-locking it. This is a no-op
// if the storage is not currently locked.
unlockAll();
// 创建 current 目录,和上面namenode处似曾相识
sd.clearDirectory();
// 写 VERSION信息
writeProperties(sd);
// 创建paxos目录
getOrCreatePaxosDir();
// 分析目录状态。
analyzeStorage();
}
So far, the basic cluster information (VERSION, seen_txid, etc.) of the first stage has been persisted in the file system of different components, and the rest is to start the persistence operation of fsImage
Back saveFSImageInAllDirs
in the above function
Persisting FsImage files
Entry org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs
method
- Create SaveNamespaceContext context
- Start multithreading and use FSImageSaver to perform persistent operations
- Call FSImageFormatProtobuf to serialize the memory object.
- Determine whether the image needs compression
- Execute persistence
- Save the md5 properties file
private synchronized void saveFSImageInAllDirs(FSNamesystem source,
NameNodeFile nnf, long txid, Canceler canceler) throws IOException {
// 记录启动进程
StartupProgress prog = NameNode.getStartupProgress();
prog.beginPhase(Phase.SAVING_CHECKPOINT);
if (storage.getNumStorageDirs(NameNodeDirType.IMAGE) == 0) {
throw new IOException("No image directories available!");
}
// 取消器
if (canceler == null) {
canceler = new Canceler();
}
SaveNamespaceContext ctx = new SaveNamespaceContext(
source, txid, canceler);
try {
List<Thread> saveThreads = new ArrayList<Thread>();
// 保存fsImage到current目录中
// save images into current
for (Iterator<StorageDirectory> it
= storage.dirIterator(NameNodeDirType.IMAGE); it.hasNext();) {
StorageDirectory sd = it.next();
FSImageSaver saver = new FSImageSaver(ctx, sd, nnf);
Thread saveThread = new Thread(saver, saver.toString());
saveThreads.add(saveThread);
saveThread.start();
}
waitForThreads(saveThreads);
saveThreads.clear();
storage.reportErrorsOnDirectories(ctx.getErrorSDs());
if (storage.getNumStorageDirs(NameNodeDirType.IMAGE) == 0) {
throw new IOException(
"Failed to save in any storage directories while saving namespace.");
}
if (canceler.isCancelled()) {
deleteCancelledCheckpoint(txid);
ctx.checkCancelled(); // throws
assert false : "should have thrown above!";
}
renameCheckpoint(txid, NameNodeFile.IMAGE_NEW, nnf, false);
// Since we now have a new checkpoint, we can clean up some
// old edit logs and checkpoints.
// Do not purge anything if we just wrote a corrupted FsImage.
if (!exitAfterSave.get()) {
purgeOldStorage(nnf);
archivalManager.purgeCheckpoints(NameNodeFile.IMAGE_NEW);
}
} finally {
// Notify any threads waiting on the checkpoint to be canceled
// that it is complete.
ctx.markComplete();
ctx = null;
}
prog.endPhase(Phase.SAVING_CHECKPOINT);
}
saveImage process
void saveFSImage(SaveNamespaceContext context, StorageDirectory sd,
NameNodeFile dstType) throws IOException {
long txid = context.getTxId();
File newFile = NNStorage.getStorageFile(sd, NameNodeFile.IMAGE_NEW, txid);
File dstFile = NNStorage.getStorageFile(sd, dstType, txid);
FSImageFormatProtobuf.Saver saver = new FSImageFormatProtobuf.Saver(context,
conf);
// 获取压缩配置
FSImageCompression compression = FSImageCompression.createCompression(conf);
// 保存
long numErrors = saver.save(newFile, compression);
if (numErrors > 0) {
// The image is likely corrupted.
LOG.error("Detected " + numErrors + " errors while saving FsImage " +
dstFile);
exitAfterSave.set(true);
}
// 保存md5属性文件
MD5FileUtils.saveMD5File(dstFile, saver.getSavedDigest());
storage.setMostRecentCheckpointInfo(txid, Time.now());
}
At this point, the entire format process is completed. The namenode directory file has more current directory, VERSION/seen_txid file, fsimage and fsimage_md5 files. In the case of high availability ha, the journalnode directory also creates the current directory, and there are more VERSION files and paxos files.
The specific content is as follows:
.
├── journal
│ └── cdp-cluster
│ ├── current
│ │ ├── committed-txid
│ │ ├── paxos
│ │ └── VERSION
│ ├── edits.sync
│ └── in_use.lock
└── namenode
└── current
├── fsimage_0000000000000000000
├── fsimage_0000000000000000000.md5
├── seen_txid
└── VERSION
7 directories, 7 files
I hope it will be helpful to you who are viewing the article, remember to pay attention, comment, and favorite, thank you