(Transfer) HDFS NameNode Memory Detailed Explanation

Reprinted from: http://tech.meituan.com/namenode-memory-detail.html

Xiaoqiao 2016-12-09  17:56

foreword

In " HDFS NameNode Memory Panorama ", from the perspective of NameNode's internal data structure, we briefly interpret its memory panorama and several key data structures, and introduce possible problems that NameNode may encounter based on actual scenarios, as well as industry A variety of solutions can be learned from horizontal scaling.

In fact, before the horizontal expansion of the NameNode, the resident memory will continue to grow with the size of the data. For this reason, it is necessary to go through the process of continuously adjusting the size of the heap space of the NameNode memory. During this period, several problems will be encountered:

  • How long the current memory space is expected to support.
  • When to adjust heap space for data size growth.
  • How much heap space to increase.

On the other hand, the NameNode heap space cannot increase indefinitely. After reaching the threshold (related to the model, JVM version, GC strategy, etc.), there will also be potential problems:

  • Reboot time is longer.
  • Potential FGC risk.

It can be seen that fine-grained control of NameNode memory usage can provide better decision support for optimizing memory usage or adjusting memory size.

Based on the previous article " HDFS NameNode Memory Panorama ", aiming at the previous problems, this paper further conducts a detailed quantitative analysis of the memory usage of the core data structure of the NameNode, and provides a memory estimation model for reference. According to the analysis results, the cluster storage resource usage mode can be optimized in a targeted manner. At the same time, the memory estimation model can be used to reasonably plan the memory resources in advance and provide a data reference for the development of HDFS.

memory analysis

NetworkTopology

The NameNode maintains the tree topology of the entire cluster through NetworkTopology. During the startup process of the cluster, the rack topology of the entire cluster is gradually established through rack awareness (usually calculated by external scripts). big change. The leaf node DatanodeDescriptor of the topology structure is the key structure to identify the DataNode, and the inheritance relationship of this class is shown in Figure 1.


DatanodeDescriptor inheritance relationship

Figure 1 DatanodeDescriptor inheritance relationship

In the 64-bit JVM, the memory usage of DatanodeDescriptor is shown in Figure 2 (except for special instructions, the subsequent analysis of the memory usage of other data structures is based on the 64-bit JVM).


Detailed explanation of DatanodeDescriptor memory usage

Figure 2 Detailed memory usage of DatanodeDescriptor

Since the DataNode node generally mounts multiple storage units of different types, such as HDD, SSD, etc., the storageMap in Figure 2 describes the storage medium DatanodeStorageInfo collection, and its detailed data structure is shown in Figure 3.


DatanodeStorageInfo memory usage details

Figure 3 Detailed explanation of DatanodeStorageInfo memory usage

In addition, DatanodeDescriptor also includes some dynamic memory objects, such as replicateBlocks, recoverBlocks and invalidateBlocks and other data structures related to dynamic adjustment of data blocks, pendingCached, cached and pendingUncached and other data structures related to centralized caching. Since these data exist temporarily in a dynamic form and can change at any time, no further detailed statistics are made here (there is a slight error in the results).

According to the previous analysis, assuming that the cluster includes 2000 DataNode nodes, the total amount of memory that the NameNode needs to maintain this part of the information occupies:

64 + 114 + 56 + 109 16)∗ 2000 = ~4MB

In the tree-like rack topology, in addition to the leaf node DatanodeDescriptor, it also includes the internal node InnerNode to describe the rack information in the cluster topology.


Detailed explanation of memory usage of internal nodes in NetworkTopology topology

Figure 4 Detailed explanation of memory usage of internal nodes in NetworkTopology topology

This part describes the node information such as rack information, assuming that the cluster includes 80 racks and 2000 DataNode nodes, the total amount of memory that the NameNode needs to maintain the internal node information in the topology:

(44 + 48) ∗ 80 + 8 ∗ 2000 = ~25KB

As can be seen from the above analysis, in order to maintain the cluster topology NetworkTopology, when the cluster size is 2000, the required memory space does not exceed 5MB. According to the near-linear growth trend, even if the cluster size is close to 10000, this part of the memory space is ~25MB, The memory overhead of the entire NameNode JVM is negligible.

NameSpace

Similar to the traditional stand-alone file system, HDFS maintains the directory structure of the file system in a tree-like structure. NameSpace saves the entire directory tree and the attributes of each directory/file node on the directory tree, including: name, number (id), user (user), group (group), permission (permission), modification time (mtime), access time (atime), subdirectory/file (children) and other information.

Figure 5 below shows the class diagram structure of INode in Namespace. It can be seen from the class diagram that the inheritance relationship between the file INodeFile and the directory INodeDirectory. The directory is represented by the INodeDirectory object in the memory, and the subdirectories or files under the directory are described by the List children member list; the file is represented by the INodeFile in the memory, and the BlockInfo[] blocks array is used to represent which Blocks the file consists of. . Other properties are identified by the respective subclass member variables of the inheritance relationship.


file and directory inheritance

Figure 5 File and directory inheritance relationship

The memory occupancy of each attribute in the inheritance relationship of the directory and file structure is shown in Figure 6.


Detailed explanation of directory and file memory usage

Figure 6 Detailed explanation of directory and file memory usage

In addition to the attribute information mentioned in the figure, some additional non-common attributes such as ACL are not included in the statistical scope. In the default scenario, INodeFile and INodeDirectory.withQuotaFeature are two relatively common and widely used structures.

According to the previous analysis, assuming that the number of HDFS directories and files is 100 million, and the total number of blocks is 100 million, the memory usage of the entire Namespace in the JVM:

Total(Directory) = (24 + 96 + 44 + 48) ∗ 100M + 8 ∗ num(total children)
Total(Files) = (24 + 96 + 48) ∗ 100M + 8 ∗ num(total blocks)
Total = (24 + 96 + 44 + 48) ∗ 100M + 8 ∗ num(total children) + (24 + 96 + 48) ∗ 100M + 8 ∗ num(total blocks) = ~38GB

A few notes on the estimation method:

  1. All Directories in the directory tree structure are estimated according to the default INodeDirectory.withQuotaFeature structure. If the cluster enables ACL/Snapshotd and other features, this part of the memory overhead needs to be increased.
  2. All files in the directory tree structure are estimated according to INodeFile.
  3. From the parent-child relationship of the entire directory tree, num (total children) is the sum of the number of directory nodes and the number of file nodes.
  4. Some data structures include strings, and the average length is estimated to be 8. The actual situation may be slightly larger.

The Namespace resides in the JVM heap memory space, and exists in the memory for the entire life cycle of the NameNode. At the same time, to ensure the reliability of the data, the NameNode will periodically checkpoint it and materialize the Namespace to an external storage device. As the data size increases, the number of files/directory trees will also increase, and the JVM memory space occupied by the entire Namespace will also increase linearly and synchronously.

BlocksMap

HDFS divides files into multiple blocks according to a certain size. In order to ensure data reliability, each block corresponds to multiple copies, which are stored on different DataNodes. In addition to maintaining the information of the Block itself, the NameNode also needs to maintain the corresponding relationship from the Block to the DataNode list, which is used to describe the physical location of the actual storage of each Block copy. The BlocksMap structure in the BlockManager is used for the mapping relationship between the Block and the DataNode list. The internal data structure of BlocksMap is shown in Figure 7.


BlockInfo inheritance relationship

Figure 7 BlockInfo inheritance relationship

BlocksMap has been optimized several times to form the current structure. The initial version directly uses HashMap to solve the mapping from Block to BlockInfo. Due to problems in memory usage, collision conflict resolution and performance, the reimplemented LightWeightGSet was used instead of HashMap. This data structure is essentially a HashTable that uses linked lists to resolve collision conflicts, but in terms of ease of use, memory usage and performance, etc. perform better. For details on introducing LightWeightGSet, please refer to HDFS-1114 .

Compared with HashMap, in order to avoid collision conflicts as much as possible, BlocksMap directly allocates 2% of the entire JVM heap space as the index space of LightWeightGSet during initialization. Of course, 2% is not an absolute value. Integer.MAX_VALUE/8 (Note: the result of Object.hashCode() is int, and the object reference for 64-bit JVM occupies 8Bytes) will automatically adjust it to the upper limit of the threshold. The 2% limit of the JVM heap space is basically derived from empirical values. It is assumed that for a 64-bit JVM environment, if a 64GB memory size is provided, the index entries can exceed 100 million. If the Hash function is appropriate, collision conflicts can basically be avoided.

The core function of BlocksMap is to quickly locate a specific BlockInfo through BlockID. The detailed data structure of BlockInfo is shown in Figure 8. BlockInfo inherits from Block. In addition to the BlockID, numbytes and timestamp information in the Block object, the most important thing is the triplets of the corresponding DataNode list where the Block is physically stored.


Detailed explanation of BlocksMap memory usage

Figure 8 Detailed explanation of BlocksMap memory usage

The memory space corresponding to LightWeightGSet is globally unique. Although the memory usage is optimized by LightWeightGSet, BlocksMap still occupies a large amount of JVM memory space. Assuming that there are 100 million blocks in the cluster and the available memory space of the NameNode is fixed at 128GB, the memory usage of BlocksMap is as follows:

16 + 24 + 2% ∗ 128GB +( 40 + 128 )∗ 100M = ~20GB

BlocksMap data resides in memory during the entire life cycle of the NameNode. As the data size increases, the number of corresponding blocks will increase, and the JVM heap memory space occupied by BlocksMap will also increase linearly and synchronously.

summary

The NameNode memory data structure is very rich. In addition to the core data structure analyzed in detail above, it also includes data managed by LeaseManager/SnapShotManager/CacheManager. Due to very limited memory usage, or the characteristics are not stable and not enabled, or there is no universality, here are no longer expanded.

According to the aforementioned estimation of the NameNode memory, compared with the actual historical data of the Hadoop cluster: the total number of file directories is ~140M, the total amount of data blocks is ~160M, and the NameNode JVM configuration is 72GB. The estimated memory usage:

Namespace:(24 + 96 + 44 + 48) ∗ 70M + 8 ∗ 140M + (24 + 96 + 48) ∗ 70M + 8 ∗ 160M = ~27GB
BlocksMap:16 + 24 + 2% ∗ 72GB +( 40 + 128 )∗ 160M = ~26GB

Note: The simplification is performed here according to the ratio of the number of directory files to 1:1, which is basically consistent with the actual situation, and the simplification has very little impact on the memory estimation result.

The combined result of the two is ~53GB, which is basically the same as the monitoring data showing that the resident memory ~52GB, which is in line with the actual situation.

As can be seen from the previous discussion, in the entire NameNode heap memory, the two structures that occupy the largest space are Namespace and BlocksMap. When the data scale increases, the huge memory footprint will inevitably bring challenges to JVM memory management, and may even restrict the NameNode service. Capability Boundaries.

There are two optimization directions for the space occupation scale of Namespace and BlocksMap:

  • Merge small files. When using Hive for data production, in order to avoid some special reasons such as serious data skew and artificially small partition granularity, a large number of small files may be written on HDFS, which will have a potential impact on the NameNode. Timely merge small files and maintain a stable growth trend of directory files, which can effectively avoid NameNode memory jitter.
  • Adjust BlockSize appropriately. As mentioned above, fewer blocks can also reduce memory usage, but BlockSize adjustment will indirectly affect computing tasks, and appropriate trade-offs need to be made.

Compared with other Java services, the NameNode scenario is relatively special, and some default parameters of the JVM need to be adjusted appropriately. For example, the Young/Old space ratio, in order to avoid the downgrade of CMS GC to FGC to affect service availability, appropriately adjust the threshold for triggering the start of CMS GC, etc. For details on JVM-related parameter adjustment strategies, it is recommended to refer to the official documentation.

Here, the author provides some experience related to NameNode memory for reference based on practice:

  • According to the growth trend of metadata, referring to the estimation method of memory space occupation mentioned above in this article, the resident memory size of NameNode can be roughly obtained. Generally, the JVM memory size can be basically satisfied by adjusting the JVM memory size according to the resident memory accounts for ~60% of the total memory.
  • To avoid GC degradation, adjust CMSInitiatingOccupancyFraction to ~70.
  • During the restart of the NameNode, especially during the BlockReport process of the DataNode, a large number of temporary objects will be created. To avoid frequent GC or even FGC caused by promotion to the Old area, you can appropriately increase the Young area (-XX:NewRatio) to 10~15.

It is understood that according to the usage scenario of NameNode, using the CMS memory recovery strategy to adjust the HotSpot JVM memory space to 180GB can provide stable services. Continued upward adjustment may bring challenges to the JVM's memory management capabilities, especially in terms of memory reclamation. Once FGC occurs, it is fatal to the application. The size of 180GB mentioned here is not an absolute value. Whether it can continue to be increased on this basis and can stabilize the service is not within the scope of this article. Combined with the aforementioned estimation method, when the available JVM memory reaches 180GB, the total amount of manageable metadata reaches ~700M, which can basically meet the needs of small and medium-sized clusters.

Summarize

Based on " HDFS NameNode Memory Panorama ", this article introduces in detail several core data structures that account for a high proportion of NameNode memory usage. On this basis, an estimation model of NameNode memory data space occupation for reference is provided:

Total = 198 ∗ num(Directory + Files) + 176 ∗ num(blocks) + 2% ∗ size(JVM Memory Size)

Quantitative analysis of the memory usage of NameNode can provide reference data for HDFS optimization and development planning.

references

[1] Apache Hadoop.  https://hadoop.apache.org/ . 2016.
[2] Apache Issues.  https://issues.apache.org/ . 2016.
[3] Apache Hadoop Source Code.  https:// github.com/apache/hadoop/tree/branch-2.4.1/ . 2014.
[4] HDFS NameNode Memory Panorama.  http://tech.meituan.com/namenode.html . 2016.
[5] Java HotSpot VM Options .  http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326474607&siteId=291194637