The Recycle Bin (trash can) is a system folder in the Windows operating system. It is mainly used to store documents and files that users have deleted historically. Files stored in the Recycle Bin can be restored.
The function of the recycle bin gives us a dose of "regret medicine". The Recycle Bin saves deleted files, folders, pictures, etc. These items will remain in the Recycle Bin until the Recycle Bin is emptied
HDFS itself is also a file system, so it will involve the deletion of file data.
By default, there is no concept of recycle bin in HDFS, and the data deleted will be deleted directly.
Functional Overview
The HDFS Trash mechanism is designed to prevent inadvertent deletion of something. Not enabled by default
After the Trash function is enabled, when certain content is deleted from HDFS, the files or directories will not be cleared immediately. They will be moved to the Current directory of the recycle bin (/user/${username}.Trash/current)
Files in .Trash are permanently deleted after a user-configurable time delay
You can also simply move the files in the Recycle Bin to a location outside the .Trash directory to restore the files and directories in the Recycle Bin.
Trash Checkpoint
A checkpoint is simply a directory under the user's Recycle Bin that stores any files or directories that were deleted before the checkpoint was created.
The recycle bin directory is /user/${username}/.Trash/{timestamp_of_checkpoint_creation}
Recently deleted files are moved to the Current directory of the recycle bin, and within a configurable time interval, HDFS will create a checkpoint /user/${username}/.Trash/<date> for the files in the Current recycle bin directory, and Delete old checkpoints when they expire
Trash function turned on
Shut down the HDFS cluster
On the node1 node, execute the command to shut down the HDFS cluster: stop-dfs.sh
Modify core-site.xml
Modify the core-site.xml file on the node1 node to add the following two attributes
fs.trash.interval: How many minutes will it take for files in the recycle bin to be permanently deleted by the system. If 0, trash functionality will be disabled
fs.trash.checkpoint.interval: The time interval (also minutes) between the creation of two checkpoints before and after. After a new checkpoint is created, the older checkpoint will be permanently deleted by the system. If 0, set the value to the value of fs.trash.interval
Delete files to trash
After turning on the trash function and performing a normal deletion operation, the file will not be deleted directly, but will be moved to the trash recycle bin.
Delete file skip
Add a parameter when performing the deletion operation: -skipTrash, which does not require direct deletion through the recycle bin.
Recover files from trash
Files in the recycle bin can be restored through commands before they are automatically deleted after expiration.
Just use the mv and cp commands to copy and move the data files from the trash directory.
clear trash
In addition to the fs.trash.interval parameter controlling automatic deletion upon expiration, users can also manually empty the recycle bin through commands to free up HDFS disk storage space.
HDFS provides a command line tool to do this: Hadoop fs -expunge. This command immediately deletes expired checkpoints from the file system