Java big data road--HDFS detailed explanation (4)--recycle bin mechanism and dfs directory

HDFS (Distributed File Storage System)--recycle bin mechanism and dfs directory

Table of contents

HDFS (Distributed File Storage System)--recycle bin mechanism and dfs directory

1. Recycle bin mechanism

overview

configuration

Precautions

Two, dfs directory

overview

3. View the edits file and fsimage file


1. Recycle bin mechanism

overview

  1. In HDFS, the recycle bin mechanism is turned off by default, that is, when files are deleted from HDFS, they are deleted immediately

  2. You can manually open the recycle bin through configuration and specify the retention time of deleted files

configuration

  1. Enter the subdirectory etc/hadoop under the Hadoop installation directory: cd hadoop-2.7.1/etc/hadoop
  2. 配置core-site.xml    vim core-site.xml
  3. Add the following content
<property>
    <name>fs.trash.interval</name>
    <value>1440</value>
</property>

Among them, 1440 is the storage time in the recycle bin after the file is moved into the recycle bin. If not configured, the default is 0, and the unit is min

Precautions

  1. In the configuration of the recycle bin, the time unit of value is minute. If the configuration is 0, it means that the recycle bin mechanism of HDFS is not enabled.
  2. The configuration is 1440, indicating that the recycling interval is 1 day, that is, the file will be cleared after 1 day in the recycle bin
  3. When deleting files after starting the recycle bin:
  4. You can view the instructions recursively to view the files in the recycle bin: hadoop fs -lsr /user/root/.Trash
  5. If you want to restore the deleted file, just execute the mv command of hdfs

Two, dfs directory

overview

  1. The dfs directory represents the storage directory of HDFS (appears after the NameNode is formatted)
    1. dfs/name represents the persistent directory of NameNode
    2. dfs/data indicates the storage directory of DataNode
    3. dfs/namesecondary indicates the storage directory of SecondaryNameNode
  2. in_use.lock is used to mark that the current node has opened the corresponding process. Its function is to prevent multiple NameNodes from being started on the same server and avoid management disorder.
  3. When HDFS starts for the first time, it will automatically roll the edits file after 1 minute
  4. In HDFS, a globally incremented number is assigned to each write operation, and the number is called transaction id - txid
  5. In HDFS, start logging and end logging as a write operation -- the beginning and end of each edits file are OP_START_LOG_SEGMENT and OP_END_LOG_SEGMENT
  6. upload files
    1. OP_ADD adds the file to the specified HDFS directory, and ends with ._Copyging_, indicating that the file has not been written yet
    2. OP_ALLOCATE_BLOCK_ID allocate block ID for the file
    3. OP_SET_GENSTAMP_V2 generates a timestamp version number for the block, which is globally unique
    4. OP_ADD_BLOCK write block data
    5. OP_CLOSE indicates that the block data has been written
    6. OP_RENAME_OLD renames the file, indicating that it has been written
  7. After the file is uploaded, it cannot be changed
  8. The md5 file is to prevent file tampering and verify the fsimage file
  9. version file:
    1. clusterID - Cluster ID. It is automatically calculated when the NameNode is formatted, which means that every time the NameNode is formatted, the clusterID will be recalculated (the original DataNode cannot be found after formatting, and the ClusterID needs to be reassigned to the DataNode), and the NameNode will The ClusterID is distributed to each DataNode, and the DataNode only accepts it once . Each communication will carry the ClusterID and check whether the cluster ID is the same.
    2. BlockpoolID - Block pool ID. First of all, you need to understand that federated HDFS (in HDFS, because there is only one NameNode, so NameNode is easy to become the concurrency bottleneck of HDFS, divide the original NameNode into multiple NameNodes according to the directory, and the operation is to treat multiple NameNodes as one NameNode, It is federated HDFS, that is, multiple nodes replace a NameNode, and the path needs to be fixed, and each path corresponds to a node, which can effectively increase the amount of concurrency, but the disadvantage is that the path cannot be changed). --------Originally one NameNode, now multiple NameNodes, need to send multiple clusterIDs, which increases the burden on DataNodes and also increases the burden on the network. It is required that the BlockPoolIDs of the NameNodes in the same federation must be consistent in order to reflect a whole

3. View the edits file and fsimage file

  1. Check the edits file: hdfs oev -i edits file -o xxx.xml. For example: hdfs oev -i edits_0000000000000000001-0000000000000000003 -o edits.xml
  2. Check the fsimage file: hdfs oiv -i fsimage file -o xxx.xml -p XML. For example: hdfs oiv -i fsimage_0000000000000000012 -o fsimage.xml -p XML
  3. Introduction to fsimage: fsimage is a binary file that records the metadata information of all files and directories in HDFS.

Guess you like

Origin blog.csdn.net/a34651714/article/details/102819263