Files that need to be cleaned up regularly in big data clusters (to save space)

Files that need to be cleaned up regularly in big data clusters (to save space)

1. Since HDFS has a recycle bin, if the setting is unreasonable, it will occupy cluster resources for a long time, so we first clear the HDFS recycle bin.
When deleting HDFS files, you can use the command: hdfs dfs -rm -skipTrash /path/to/file/you/want/to/remove/permanently, then the files will be deleted directly and will not be placed in the recycle bin (Note : This kind of deletion is permanent deletion, and the data cannot be recovered). If you use this command to delete HDFS data, you don’t need to empty the HDFS recycle bin later. Empty the recycle bin
Command: hdfs dfs -expunge It will be cleared immediately, but a checkpoint will be hit first. It will be cleared after one minute.)
2. Clear the spark task execution history (if a large amount of data is written to the big data platform through the spark task, then the spark task history will be It takes up a lot of space, so we need to clean it regularly)
Clear the files under the /user/spark/applicationHistory/* path (view the file command under this path: hadoop fs -ls /user/spark/applicationHistory, check the file occupation under this path Disk size: hadoop fs -du -h /user/spark/applicationHistory)
Clear the spark task execution history and then clean up the recycle bin.
3. Yarn cache file cleaning
under /yarn/nm, mainly clears the files in filecahe, this part is the most stationed disk
under /yarn/container-logs, if the data volume is large, it should also be cleaned up
4.hdfs data for regular processing
5. The logs of each component of the cluster are processed regularly

Guess you like

Origin blog.csdn.net/qq_43688472/article/details/132490255