Clean up the files left behind by the flink program on hdfs

1. Background

When viewing the status of the CDH nodes, it is found that the disk of one of the nodes is about to be full
insert image description here

2. Problem location

By checking the disk usage of the linux host, it is found that the HDFS file usage is too large​

du -x -h --max-depth=1

insert image description here
View HDFS files, file size ranking reverse TOP10

hdfs dfs -ls -R / | sort -r -n -k 5 | head -10 

It is found that it is mainly the checkpoint of flink:

insert image description here

Some clusters are found to be two other types of garbage:

Log files, Jar package files (flink spark)
(1) /tmp/logs/bigdata/logs/application_1634782320178_0251/node6_8041
(2) /user/bigdata/.flink/application_1634782320178_0079/xxx.jar
(3) /user/spark/applicationHistory /application_1671610454634_001

3. Solution

Clean up files, as long as it is not a running application, it can be deleted

hdfs dfs -rm -r /user/bigdata/.flink/application_*
hdfs dfs -rm -r /tmp/logs/root/logs/application_*
hdfs dfs -rm -r /user/spark/applicationHistory/application_*

In HDFS, after the recycle bin function is turned on, when a file is deleted, it will not be actually deleted. It will be temporarily put into the recycle bin.trash, and the files in the recycle bin can be quickly restored.
You can set a time threshold. When the storage time of the file in the recycle bin exceeds this threshold or the recycle bin is emptied, the file will be completely deleted and the occupied data blocks will be released.

Here choose to directly delete the .Trash directory file and empty the recycle bin file

Delete .Trash directory (clean up trash)

hdfs dfs -rm -r /user/root/.Trash/*

Empty Recycle Bin

hdfs dfs -expunge

4. Inspection results

Check the node disk status
insert image description here
, you can see that the disk has been emptied a lot

Guess you like

Origin blog.csdn.net/xfp1007907124/article/details/130266411