HDFS storage mechanism

Basic concepts in HDFS

Block : The storage unit in HDFS is each data block block, and the most basic storage unit by default in HDFS is a 64M data block. Similar to the ordinary file system, the files in HDFS are also stored in 64M data blocks. The difference is that in HDFS, if the size of a file is smaller than the size of a data block, it does not need to occupy the storage space of the entire data block.

NameNode : Metadata node. This node is used to manage the namespace in the file system and is the master. It saves the metadata of all files and folders in a file system tree. This information is saved on the hard disk for: namespace image and edit log, which will be discussed later. In addition, the NameNode also saves which data blocks a file contains and which data nodes are distributed on. However, this information is not stored on the hard disk, but is collected from the data node when the system is started.

DataNode : Data node. This is where HDFS really stores data. Clients and metadata nodes (NameNode) can request data nodes to write or read data blocks. In addition, the DataNode needs to periodically report the stored data block information to the metadata node.

Secondary NameNode: secondary metadata node. The secondary metadata node is not a backup node when the NameNode has a problem. Its main function is to periodically merge the namespace image and edit log in the NameNode to prevent the log file from being too large. In addition, the merged namespace image file will also be saved on the Secondary NameNode, in case the NameNode fails, it can be restored.

edit log : modify the log. When the file system client client performs ------write------operation, we will put this record in the modification log. After recording the modification log, the NameNode modifies the data structure in the memory. Before each write operation is successful, the edit log will be synchronized to the file system.

fsimage : Namespace mirroring. It is a checkpoint of the metadata in the memory on the hard disk. When the NameNode fails, the latest checkpoint metadata information will be loaded from fsimage into the memory, and then pay attention to re-execute the operation in the modification log. The Secondary NameNode is used to help the metadata node checkpoint the metadata information in the memory to the hard disk.

The checkpoint process is as follows : The Secondary NameNode notifies the NameNode to generate a new log file, and all future logs are written to the new log file. The Secondary NameNode uses http get to obtain the fsimage file and the old log file from the NameNode. The Secondary NameNode loads the fsimage file into the memory, executes the operations in the log file, and then generates a new fsimage file. The Secondary NameNode sends the new fsimage file back to the NameNode via http post. The NameNode can replace the old fsimage file and the old log file with the new fsimage file and the new log file (generated in the first step), and then update the fstime file to write the time of this checkpoint. In this way, the fsimage file in the NameNode saves the metadata information of the latest checkpoint, and the log file is restarted and will not become too large.

 

Guess you like

Origin blog.csdn.net/qq_32445015/article/details/102882847