Chapter 2: Detailed Explanation of the Distributed File System HDFS

1. HDFS system architecture diagram

Role introduction, according to the above figure to introduce the role of each node:

NameNode

  • The master node stores metadata information of files, such as file name, file directory structure, file attributes (  generation time, number of copies, file permissions ), as well as the block list of each file and the DataNode where the block is located.
  • These data information  are stored on the local disk in the form of two files , "fsimage" (HDFS metadata image file) and "editlog" (HDFS file change log)  , which are reconstructed when HDFS is restarted.
  • Single point, solve the single point of failure problem through hot standby HA.
  • storage location:
    • part in memory
    • The other part is stored on the local disk including (fsimage image file and edites edit log) to ensure that memory is lost and data can be retrieved from the disk

DataNode

  • Store file block data in the local file system  , along with the checksum of the block data  (length, creation time, CRC32 checksum).
  • The yellow circle in the above figure indicates that there are 7 replicas on all slave nodes, and the number of replicas ensures the security of data

Secondary NameNode

  • Auxiliary daemon used to monitor the status of HDFS, synchronize the metadata backup (snapshot) of the NameNode every 1 hour. This backup is used for recovery when the NameNode metadata is damaged and cannot be recovered. Theoretically a maximum of 1 hour of data is lost.
  • Regularly merge the NameNode image file  (fsimage)  and edit log file  (editlog)  into one file, called a new image file
  • It can assist in restoring the NameNode, but the Secondary NameNode is not a hot backup of the NameNode.

NameNode startup process

1. When Name starts, first load the fsimage (image) into the memory, and perform (replay) the operations of the edit log editlog;

2. Once the file system metadata mapping is established in memory, create a new fsimage file (this process does not require SecondaryNameNode) and an empty editlog;

3. In safe mode, each datanode will send the latest status of the block list to the namenode;

4. The namenode is now running in safe mode. That is, the file system of the NameNode is read-only for the client. (Display directory, display file content, etc. Write, delete, rename will fail);

5. NameNode starts to listen to RPC and HTTP requests. Explain RPC: RPC (Remote Procedure Call Protocol) - Remote Procedure Call Protocol, which is a protocol that requests services from a remote computer program through a network without knowing the underlying network technology;

6. The location of data blocks in the system is not maintained by the namenode, but is stored in the datanode in the form of a block list;

7. During the normal operation of the system, the namenode will keep the mapping information of all block information in memory.

Summarize

  • NameNode metadata/namespace persistence fsimage and edits

  • NameNode formatting, what to do

Create fsimage file to store fsimage information

Create edits file

  • NameNode startup process

1. Load the fsimage and edits files

2. Generate new fsimage and edits files

3. Wait for DataNode to register and send Block Report

  • DataNode startup process

Register with NameNode and send Block Report

  • NameNode SafeMode Safe Mode

Block is the smallest unit of HDFS storage

The file is divided into blocks (the default size is 64M), which is generally set to 128M, with Block (block) as the unit. By default, a total of 3 copies (including itself) of each block are stored on different machines, and the number of copies can be set.

Set in hdfs-site.xml, dfs.block.size: 134217728

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325311520&siteId=291194637