1. Background
When viewing the status of the CDH nodes, it is found that the disk of one of the nodes is about to be full
2. Problem location
By checking the disk usage of the linux host, it is found that the HDFS file usage is too large
du -x -h --max-depth=1
View HDFS files, file size ranking reverse TOP10
hdfs dfs -ls -R / | sort -r -n -k 5 | head -10
It is found that it is mainly the checkpoint of flink:
Some clusters are found to be two other types of garbage:
Log files, Jar package files (flink spark)
(1) /tmp/logs/bigdata/logs/application_1634782320178_0251/node6_8041
(2) /user/bigdata/.flink/application_1634782320178_0079/xxx.jar
(3) /user/spark/applicationHistory /application_1671610454634_001
3. Solution
Clean up files, as long as it is not a running application, it can be deleted
hdfs dfs -rm -r /user/bigdata/.flink/application_*
hdfs dfs -rm -r /tmp/logs/root/logs/application_*
hdfs dfs -rm -r /user/spark/applicationHistory/application_*
In HDFS, after the recycle bin function is turned on, when a file is deleted, it will not be actually deleted. It will be temporarily put into the recycle bin.trash, and the files in the recycle bin can be quickly restored.
You can set a time threshold. When the storage time of the file in the recycle bin exceeds this threshold or the recycle bin is emptied, the file will be completely deleted and the occupied data blocks will be released.
Here choose to directly delete the .Trash directory file and empty the recycle bin file
Delete .Trash directory (clean up trash)
hdfs dfs -rm -r /user/root/.Trash/*
Empty Recycle Bin
hdfs dfs -expunge
4. Inspection results
Check the node disk status
, you can see that the disk has been emptied a lot