Trash bin recycling mechanism in HDFS

File system trash background

  • The Recycle Bin (trash can) is a system folder in the Windows operating system. It is mainly used to store documents and files that users have deleted historically. Files stored in the Recycle Bin can be restored.
  • The function of the recycle bin gives us a dose of "regret medicine". The Recycle Bin saves deleted files, folders, pictures, etc. These items will remain in the Recycle Bin until the Recycle Bin is emptied
  • HDFS itself is also a file system, so it will involve the deletion of file data.
  • By default, there is no concept of recycle bin in HDFS, and the data deleted will be deleted directly.
    Insert image description here

Functional Overview

  • The HDFS Trash mechanism is designed to prevent inadvertent deletion of something. Not enabled by default
  • After the Trash function is enabled, when certain content is deleted from HDFS, the files or directories will not be cleared immediately. They will be moved to the Current directory of the recycle bin (/user/${username}.Trash/current)
  • Files in .Trash are permanently deleted after a user-configurable time delay
  • You can also simply move the files in the Recycle Bin to a location outside the .Trash directory to restore the files and directories in the Recycle Bin.

Trash Checkpoint

  • A checkpoint is simply a directory under the user's Recycle Bin that stores any files or directories that were deleted before the checkpoint was created.
  • The recycle bin directory is /user/${username}/.Trash/{timestamp_of_checkpoint_creation}
  • Recently deleted files are moved to the Current directory of the recycle bin, and within a configurable time interval, HDFS will create a checkpoint /user/${username}/.Trash/<date> for the files in the Current recycle bin directory, and Delete old checkpoints when they expire

Insert image description here

Trash function turned on

Shut down the HDFS cluster

  • On the node1 node, execute the command to shut down the HDFS cluster: stop-dfs.sh
    Insert image description here

Modify core-site.xml

  • Modify the core-site.xml file on the node1 node to add the following two attributes
  • fs.trash.interval: How many minutes will it take for files in the recycle bin to be permanently deleted by the system. If 0, trash functionality will be disabled
  • fs.trash.checkpoint.interval: The time interval (also minutes) between the creation of two checkpoints before and after. After a new checkpoint is created, the older checkpoint will be permanently deleted by the system. If 0, set the value to the value of fs.trash.interval

Insert image description here

Delete files to trash

  • After turning on the trash function and performing a normal deletion operation, the file will not be deleted directly, but will be moved to the trash recycle bin.

Insert image description here

Delete file skip

  • Add a parameter when performing the deletion operation: -skipTrash, which does not require direct deletion through the recycle bin.
    Insert image description here

Recover files from trash

  • Files in the recycle bin can be restored through commands before they are automatically deleted after expiration.
  • Just use the mv and cp commands to copy and move the data files from the trash directory.

Insert image description here

clear trash

  • In addition to the fs.trash.interval parameter controlling automatic deletion upon expiration, users can also manually empty the recycle bin through commands to free up HDFS disk storage space.
  • HDFS provides a command line tool to do this: Hadoop fs -expunge. This command immediately deletes expired checkpoints from the file system

Guess you like

Origin blog.csdn.net/weixin_49750432/article/details/132163232