(Transfer) HDFS NameNode Memory Panorama

Reprinted from: http://tech.meituan.com/namenode.html

Xiaoqiao  2016-08-26 11:20

I. Overview

From the perspective of the entire HDFS system architecture, the NameNode is the most important, complex and most prone to problems, and once the NameNode fails, the entire Hadoop cluster will be in an unserviceable state. As it continues to grow, many problems that were hidden at a small scale are gradually exposed. Therefore, it is especially important to grasp the internal structure and operation mechanism of NameNode from a higher level. Unless otherwise specified, this article is based on the community version Hadoop-2.4.1[1][2]. Although there have been multiple version iterations since 2.4.1, the basic principles are the same.

The NameNode manages the metadata of the entire HDFS file system. From the perspective of architectural design, metadata is roughly divided into two levels: the Namespace management layer, which is responsible for managing the tree-like directory structure in the file system and the mapping relationship between files and data blocks; the block management layer, which is responsible for managing the physical blocks of files in the file system. The mapping relationship BlocksMap with the actual storage location is shown in Figure 1 [1]. In addition to being resident in memory, the metadata managed by Namespace will also be periodically flushed to the FsImage file on the persistent device; BlocksMap metadata only exists in memory; when the NameNode restarts, it first reads the FsImage from the persistent device to build the Namespace, and then Reconstruct BlocksMap according to the report information of DataNode. These two data structures occupy most of the JVM Heap space of the NameNode.


HDFS structure diagram

Figure 1 HDFS structure diagram

In addition to the management of the metadata of the file system itself, the NameNode also needs to maintain the rack and DataNode information of the entire cluster, lease management, and cache management introduced by the centralized cache. The space occupation of these parts of the data structure is relatively fixed, and the occupation is small.

The test data shows that when the total number of namespace directories and files reaches 200 million, and the total number of data blocks reaches 300 million, the resident memory usage exceeds 90 GB.

Second, the memory panorama

As mentioned above, the entire memory structure of NameNode can be roughly divided into four parts: Namespace, BlocksMap, NetworkTopology and others. Figure 2 shows the logical distribution of the memory of each data structure.


namenode memory panorama

Figure 2 NameNode memory panorama

Namespace: Maintain the directory tree structure of the entire file system and state changes on the directory tree;
BlockManager: Maintain information related to data blocks in the entire file system and state changes of data blocks;
NetworkTopology: Maintain rack topology and DataNode information, rack The basis of perception;
Others:
LeaseManager: The mutual exclusion synchronization of read and write is realized by Lease, which supports the core data structure of Write-Once-Read-Many of HDFS;
CacheManager: The new centralized cache feature introduced by Hadoop 2.3.0 supports centralized Snapshot Manager
: The new Snapshot feature introduced in Hadoop 2.1.0 is used for data backup and rollback to prevent data problems in the cluster caused by user misoperations;
DelegationTokenSecretManager: Manage HDFS Secure access; in
addition to temporary data information, statistical information metrics, and more.

The resident memory of NameNode is mainly used by Namespace and BlockManager, which account for nearly 50% respectively. The memory overhead of other parts is small and relatively fixed, and can be basically ignored compared with Namespace and BlockManager.

3. Memory Analysis

3.1 Namespace

Similar to the stand-alone file system, HDFS maintains the directory structure of the file system in a tree-like structure. Namespace saves the directory tree and the attributes of each directory/file node. In addition to being resident in memory, this part of the data will be periodically flushed to the persistent device to generate a new FsImage file, so that when the NameNode restarts, the entire Namespace can be restored from the FsImage in time. Figure 3 shows the Namespace memory structure. The total number of directories and files in the aforementioned cluster is the total number of nodes contained in the entire Namespace directory tree. It can be seen that the Namespace itself is actually a very huge tree.


Namespace memory structure

Figure 3 Namespace memory structure

There are two different types of INode data structures throughout the Namespace directory tree: INodeDirectory and INodeFile. Wherein INodeDirectory identifies the directory in the directory tree, and INodeFile identifies the file in the directory tree. Since both inherit from INode, they have most of the same public information INodeWithAdditionalFields. In addition to the common basic attributes, they also provide extended attribute features, such as Quota, Snapshot, etc., which are all added through Feature. If new attributes appear in the future, you can also pass Feature is easy to expand. The difference is that INodeFile has a unique header that identifies the combination of the number of copies and data block size (the information that identifies the storage policy ID was added after 2.6.1) and the ordered Blocks array contained in the file; INodeDirectory has a unique list of child nodes. children. It should be noted here that children is an ArrayList with a default size of 5, which is stored in an orderly manner according to the child node name. Although some write performance will be lost during insertion, it can facilitate subsequent fast binary search and improve read performance. For general storage systems, read operations are faster than The proportion of write operations is high. The specific inheritance relationship is shown in Figure 4.


INode inheritance relationship

Figure 4 INode inheritance relationship

3.2 BlockManager

BlocksMap occupies a large proportion of NameNode memory space and is managed by BlockManager. Compared with Namespace, this part of the data managed by BlockManager is much more complicated. The Namespace and BlockManager are linked together by the aforementioned INodeFile ordered Blocks array. Figure 5 shows the memory structure managed by BlockManager.


Memory structures managed by BlockManager

Figure 5 Memory structure managed by BlockManager

Each INodeFile will contain blocks of varying numbers. The specific number is determined by the size of the file and the ratio of the size of each block (64M by default). These Blocks form the BlockInfo array according to the order of the files, as shown in Figure 5. BlockInfo[A ~K], BlockInfo maintains the metadata of the Block. The structure is shown in Figure 6. The data itself is managed by the DataNode, so the BlockInfo needs to contain information about which DataNodes actually manage the actual data. The core here is the Object named triplets. An array of size 3*replicas, where replicas is the number of Block replicas. Information included in triplets:

  • triplets[i]: DataNode where the Block is located;
  • triplets[i+1]: the previous Block on the DataNode;
  • triplets[i+2]: the next Block on the DataNode;

where i represents the ith replica of the Block, and i takes the value [0, replicas).


BlockInfo inheritance relationship

Figure 6 BlockInfo inheritance relationship

From the previous description, you can see several important pieces of information in BlockInfo: which Blocks the file contains, which DataNodes these Blocks are actually stored on, and the linked list relationship between all Blocks on the DataNode.

From the perspective of information integrity, the above data is sufficient to support all normal operations of the HDFS file system, but there is still a problem with many usage scenarios: Blocks cannot be quickly located by blockid, so BlocksMap is introduced.

The bottom layer of BlocksMap is implemented by LightWeightGSet, which is essentially a hash table that resolves conflicts in a chain. In order to avoid the performance overhead caused by the rehash process, during initialization, the index space is directly given 2% of the available memory of the entire JVM, and will not change. During the cluster startup process, the DataNode will perform BR (BlockReport), calculate its HashCode according to each block of the BR, and then insert the corresponding BlockInfo into the corresponding position to gradually build a huge BlocksMap. The BlockInfo collection mentioned earlier in INodeFile, if we collect the BlockInfo in BlocksMap and the BlockInfo in all INodeFiles separately, we can find that the two collections are exactly the same. In fact, all BlockInfo in BlocksMap is the reference to the corresponding BlockInfo in INodeFile ; When searching for the corresponding BlockInfo through the Block, the HashCode is also calculated for the Block first, and the corresponding BlockInfo information is quickly located according to the result. So far, the problem involving the metadata of the HDFS file system itself has been basically solved.

The parts mentioned above belong to the static data part. All the data in the NameNode memory must change with the read and write conditions. Of course, the BlockManager also needs to manage this part of the dynamic data. The main reason is that the distribution of Blocks needs to be adjusted in time when the changes in the Block do not meet expectations. There are several core data structures involved here:

excessReplicateMap: If the actual number of replicas stored in a block is more than the preset number of replicas, the redundant replicas need to be deleted at this time, and the redundant replicas will be placed in excessReplicateMap. excessReplicateMap is the set of mappings from the StorageID of the DataNode to the Block collection.
neededReplications: If the actual number of replicas stored in a block is less than the preset number of replicas, it is necessary to supplement the missing replicas. Here, which blocks are missing and how many replicas are stored in the neededReplications, in essence, the neededReplications is a priority queue, and the number of replicas is missing. The more blocks, the more they will be processed first.
invalidateBlocks: If a block is about to be deleted, it will be placed in invalidateBlocks. invalidateBlocks is a set of mappings from the DataNode's StorageID to the Block collection. If a file is deleted by the client, all blocks to which the file belongs will be placed in invalidateBlocks first.
corruptReplicas: In some scenarios, Blocks are not available due to timestamp/length mismatch, etc., and will be temporarily stored in corruptReplicas and processed later.

The first few core data structures that involve dynamic changes in Block distribution are actually transitional in nature. The ReplicationMonitor thread inside BlockManager (identified as Thread/Monitor in Figure 5) will continue to fetch data from it and distribute it through logical processing. The data structure corresponding to the specific DatanodeDescriptor (there will be a brief introduction in 3.3 NetworkTopology), when the heartbeat of the corresponding DataNode comes, the NameNode will traverse the data temporarily stored in the DatanodeDescriptor, convert it into the corresponding command and return it to the DataNode, the DataNode receives the task and After the execution is completed, it is fed back to the NameNode, and then the corresponding information in the DatanodeDescriptor is cleared. For example, the preset number of copies of BlockB is 3, and the actual copy becomes 4 for some reason (for example, the previously offline DataNode D is back online, where B happens to have a copy data of BlockB), BlockManager can detect the copy change in time, and put the redundant copies. The copy of BlockB on DataNode D is placed in excessReplicateMap. The ReplicationMonitor thread periodically checks the data in excessReplicateMap and moves it to invalidateBlocks in DatanodeDescriptor corresponding to DataNode D. When DataNode D comes over next heartbeat, it returns the instruction to delete Block B with the heartbeat. , DataNode D actually deletes the data of Block B on it and feeds it back to NameNode. After that, BlockManager clears Block B on DataNode D from the memory. So far, the copy of Block B is as expected. The whole process is shown in Figure 7.


Processing procedure when the number of replicas is abnormal

Figure 7 Processing process when the number of copies is abnormal

3.3 NetworkTopology

The relationship between Block and DataNode has been mentioned many times before. In fact, NameNode also needs to manage all DataNodes. Not only that, because the data block writing location needs to be determined before data writing, NameNode also maintains the entire rack topology NetworkTopology. Figure 8 shows the in-memory rack topology diagram.


NetworkTopology memory structure

Figure 8 NetworkTopology memory structure

It can be seen from Figure 8 that there are two parts here: rack topology NetworkTopology and DataNode node information. The tree-like rack topology is established after the cluster is started according to the rack perception (usually calculated by external scripts), and the topology of the entire rack generally does not change during the life cycle of the NameNode; the other part is Comparing the key DataNode information, BlockManager has mentioned that the Blocks collection on each DataNode will form a doubly linked list. More accurately, all Blocks collections on DatanodeStorageInfo of each DataNode storage unit will form a doubly linked list. The entry of this linked list It is the DatanodeStorageInfo managed by the leaf node of the rack topology, that is, the DataNode. In addition, since the addition, deletion and query of data by upper-layer applications change at any time, the Blocks on DatanodeStorageInfo will also change dynamically, so the DataNode object on NetworkTopology will also manage these dynamically changing data structures, such as replicateBlocks/recoverBlocks/invalidateBlocks, these data structures It just corresponds to the dynamic data structure managed by BlockManager, which realizes the dynamic change of data transmitted by BlockManager to DataNode memory objects and finally reaches the actual execution flow process of physical DataNode through instructions. The process has been introduced in 3.2 BlockManager.

There is a question here, why are all Blocks under DatanodeStorageInfo organized in a doubly linked list instead of other data structures? If combined with the actual scene, it is not difficult to find that the operations of blocks under each DatanodeStorageInfo focus on rapid addition/deletion (block dynamic increase and decrease changes) and sequential traversal (during BlockReport), so the doubly linked list is a very suitable data structure.

3.4 Lease Manager

The Lease mechanism is an important distributed protocol and is widely used in various practical distributed systems. HDFS supports Write-Once-Read-Many, and the mutual exclusion synchronization of file write operations is implemented by Lease. Lease is actually a time-bound lock whose main feature is exclusivity. When a client writes a file, it needs to apply for a lease first. Once a client holds a lease of a file, it is impossible for other clients to apply for the lease of the file, which ensures that a file can be written at the same time. Can only happen on one client. The LeaseManager of the NameNode is the core of the Lease mechanism. It maintains the correspondence between the file and the Lease, the client and the Lease. This kind of information will change in real time with the change of the written data.


In-memory data structure of LeaseManager

Figure 9 Memory data structure of LeaseManager

Figure 9 shows the LeaseManager memory structure, including the following three main core data structures:

sortedLeases: Lease collection, organized in chronological order, easy to check whether the Lease has timed out;
leases: the mapping relationship between clients and leases;
sortedLeasesByPath: the mapping relationship between file paths and leases;

Each client that writes data corresponds to a Lease, and each Lease contains at least one Path that identifies the file path. Lease itself has maintained its holder (client) and the set of file paths that the Lease is operating on. The reason why leases and sortedLeasesByPath are added is to improve the performance of fast indexing to Lease through Lease holders or file paths.

Due to the time-constrained characteristics of Lease itself, when the Lease times out, it needs to be reclaimed forcibly, and the content related to the Lease in the memory must be cleared in time. The timeout check and the processing logic after the timeout are uniformly executed by LeaseManager.Monitor. Two Lease-related timeouts are maintained in the LeaseManager: soft timeout (softLimit) and hard timeout (hardLimit). The usage scenarios are slightly different.

Under normal circumstances, the client needs to apply for a lease to the LeaseManager of the NameNode before writing files to the cluster; the lease time is regularly updated during the file writing process to prevent the lease from expiring, and the period is related to the softLimit; after writing the data, apply for a lease release. Two types of problems may occur in the whole process: (1) the client does not update the lease time in time during the file writing process; (2) the lease is not successfully released after the file is written. The two problems correspond to softLimit and hardLimit respectively. Both scenarios will trigger the LeaseManager to forcibly recycle the Lease timeout. If the client fails to update the Lease in time during the file writing process and exceeds the softLimit time, another client tries to write the same file and triggers the Lease soft timeout forcibly reclaimed; if the client completes writing the file but does not release the Lease successfully, the The background thread of LeaseManager, LeaseManager.Monitor, checks whether a hard timeout occurs and triggers the timeout recovery uniformly. Regardless of whether it is a forced Lease recovery triggered by softLimit or hardLimit timeout, the processing logic is the same: FSNamesystem.internalReleaseLease, the logic itself is more complicated, and will not be expanded here. In short, check and repair the last block written before the Lease expires. Then release the lease held by the timeout to ensure that subsequent writes from other clients can normally apply for the lease of the file.

The NameNode memory data structure is very rich. Here is a brief description of several important data structures. In addition to the previous ones, there are actually SnapShotManager/CacheManager, etc. Due to its limited memory footprint and some features that are not yet stable, it is not listed here. Expand again.

Fourth, the problem

As the scale of data in the cluster continues to accumulate, the memory usage of the NameNode increases proportionally. Inevitably, NameNode memory will gradually become the bottleneck of cluster development and start to expose many problems.

1. The startup time becomes longer. The startup process of NameNode can be divided into several stages: FsImage data loading, editlogs playback, Checkpoint, and BlockReport of DataNode. When the data scale is small, the startup time can be controlled within ~10min. When the metadata scale reaches 500 million (the number of INodes in the Namespace exceeds 200 million, and the number of Blocks is close to 300 million), the FsImage file size will be close to 20GB, and it is necessary to load the FsImage data. ~14min, Checkpoint takes ~6min, plus other stages, the entire restart process will last ~50min, and even more than 60min in extreme cases. Although the restart process has been stabilized at ~30min after multiple rounds of optimization, it is also very time-consuming. If the data size continues to increase, the startup process will increase synchronously.

2, the performance began to decline. All metadata-related operations of the HDFS file system are basically completed on the NameNode side. When the memory usage increases due to the increase of the data scale, the performance of adding, deleting, modifying and checking metadata will decrease. The complex processing logic is amplified, and the performance degradation of relatively complex RPC requests (such as addblock) is more obvious.

3. The risk of NameNode JVM FGC (Full GC) is high. It is mainly reflected in two aspects: (1) FGC frequency increases; (2) FGC time increases and the risk is uncontrollable. For the application scenario of NameNode, the current CMS memory recovery algorithm is relatively mainstream. Under normal circumstances, when the memory of more than 100GB is recovered, the pause time can be controlled to the second level. However, if the recovery fails, it is downgraded to serial memory recovery. The pause time of the application will reach hundreds of seconds, which is fatal to the application itself.

4. Oversized JVM Heap Size debugging problem. If the performance of the online cluster deteriorates, it will become an extremely difficult thing to have to analyze the memory to reach a conclusion. Not to mention that Dump itself is extremely time-consuming and labor-intensive, and there is a high probability that the NameNode will become unserviceable when dumping a large memory.

In response to the many problems caused by the growth of NameNode memory, the community and the industry are constantly paying attention and trying different solutions. On the whole, there are two ideas: (1) expanding the NameNode to disperse single-point load; (2) introducing an external system to support the NameNode memory data.

Since 2010, the community has devoted a lot of energy to continuous solutions. The Federation scheme [3] solves the problem of NameNode by horizontally expanding NameNode and dispersing single-point load. After several years of development, the scheme has gradually stabilized and has been widely used in the industry. . In addition, the community is also trying to store the value of the Namespace external KV storage system such as LevelDB [4], thereby reducing the NameNode memory load.

In addition to the community, the industry is also trying its own solutions. Baidu HDFS2[5] provides metadata management services in the form of a master-slave architecture cluster, which is essentially a physical separation of the Namespace and BlockManagement managed by the native NameNode. Namespace is responsible for managing the directory tree of the entire file system and the mapping relationship between files and BlockID sets. The mapping relationship between BlockID and DataNode is divided into multiple service nodes for distributed management according to certain rules. This scheme is similar to Lustre ( Hash-based Partition). Taobao HDFS2[6] tried to adopt another idea, with the help of high-speed storage devices, metadata is stored persistently through external storage devices, keeping NameNode completely stateless, and realizing the possibility of unlimited expansion of NameNode. There are many other similar schemes.

Although both the community and the industry have mature solutions to the NameNode memory bottleneck, they may not be applicable to all scenarios, especially for small and medium-sized clusters. Combined with the practical process and the problems related to NameNode memory that may be encountered during the development of the cluster scale, here are some suggestions:

  1. Merge small files. As mentioned earlier, both directories/files and blocks will occupy the NameNode memory space, and a large number of small files will reduce the memory usage efficiency; in addition, the read and write performance of small files is much lower than that of large files, the main reason is that read and write small files Writing requires switching between multiple data sources, which seriously affects performance.

  2. Adjust the appropriate BlockSize. Mainly for business scenarios with large files in the cluster, you can reduce the memory growth trend of the NameNode by adjusting the default Block Size (parameter: dfs.blocksize, default 128M).

  3. HDFS Federation scheme. When both the cluster and the data reach a certain scale, vertical expansion of NameNode alone cannot support business development. The HDFS Federation solution can be considered to achieve horizontal expansion of NameNode. While solving the memory problem of NameNode, Federation can achieve good business performance. Isolation, will not overwhelm the entire cluster because of a single application.

V. Summary

NameNode occupies a pivotal position in the entire HDFS system architecture, and the internal data and processing logic are relatively complex. This article briefly summarizes the memory panorama of NameNode and several key data structures, and briefly interprets NameNode from the perspective of NameNode memory core data , and combined with the actual scenarios to introduce the problems that the NameNode memory may encounter with the increase of the data scale and various solutions that can be used for reference in the industry. In the follow-up "HDFS NameNode Memory Details", we will interpret several key data structures of NameNode in detail, and analyze the proportion of each data structure used in JVM Heap.

6. Reference

[1] Apache Hadoop, 2016, https://hadoop.apache.org/.
[2] Apache Hadoop Source Code, 2014, https://github.com/apache/hadoop/tree/branch-2.4.1/.
[3] HDFS Federation, 2011, https://issues.apache.org/jira/browse/HDFS-1052.
[4] NemeNode Scalability, 2013, https://issues.apache.org/jira/browse/HDFS-5389.
[5] Baidu HDFS2, 2013, http://static.zhizuzhefu.com/wordpress_cp/uploads/2013/04/a9.pdf.
[6] Taobao HDFS2, 2012, https://github.com/taobao/ADFS.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326475470&siteId=291194637