How to maintain consistency of HDFS data

Data consistency usually refers to whether the logical relationship between related data is correct and complete, and how HDFS ensures data consistency.

 hdfs backs up metadata on the namenode and stores the data on the datanode. By default, 3 copies are backed up. So how does hdfs ensure data consistency?

1. The namenode mechanism of hdfs

There is only one namenode in hdfs. Once there is a problem with the namenode, the data block information cannot be found. The metadata information in the namenode will cache the metadata information in memory when it is working. Namenode backs up the data in the memory to fsimage on the disk. Whenever there is data to be stored, the metadata information will be appended to the editLog file. The memory is actually a collection of these two parts. So when will you merge editlog with fsimage? HDFS introduces the secondnamenode node. The main function of this node is to periodically or wait until the editlog reaches a certain number (can be set in hdfs_site.xml), it will copy the editlog and fsimage files in the namenode to the secondnamenode, and update the namenode after the secondenamenode merge On the fsimage file. If the data in the namenode is lost, the data in the secondnamenode will be used as a backup to ensure that the metadata information will not be lost.

2. Heartbeat mechanism

The namenode and datanode confirm and update the metadata information of the datanode through the heartbeat (can be set in the configuration file every 3 seconds) information. If the datanode fails, the namenode will cancel the trust of the datanode (then no read or write will be on the datanode), and the namenode will back up the data on the node. The choice of namenode's backup node is mainly based on the topological distance and specific node load.

3. Safe Mode

HDFS will enter safe mode during initialization, and namenode is not allowed to operate in safe mode. The namenode will conduct a security check with the connected datanode. Only when the ratio of the safe data block reaches the set threshold will it exit the safe mode.

4. Rollback mechanism

When hdfs is upgraded or data is written, the relevant data will be kept and backed up. If successful, the backup will be updated. If it fails, the backup information will be used.

5. Security check

To avoid data errors caused by network transmission, HDFS uses a checksum mechanism. Data backup and data reading between each node, the verification passes the data backup is successful, otherwise the backup fails, and the backup is restarted

6. Recycle Bin

When the data file is deleted from hdfs, the file does not disappear, but is transferred to the / trash directory. If you delete it by mistake, you can retrieve the file in this directory, and the storage time of the file can be configured with fs.trash.interval. After this time, the namenode deletes the metadata of the file, and the file on the datanode will be deleted.

 

Published 42 original articles · praised 4 · 10,000+ views

Guess you like

Origin blog.csdn.net/wangyhwyh753/article/details/105563348