Efficiently delete redundant files in two folders

Efficiently delete redundant files in two folders

This method was used before, but there were too many images, 700,000 of which needed to be deleted, which took more than 10 days to delete. .

Delete duplicate images from two folders

Insert image description here
Insert image description here

solution

First make a copy of the image and then rename it to txt

import os

def change_file_extension(path, old_ext, new_ext):
    for root, dirs, files in os.walk(path):
        for file in files:
            if file.endswith(old_ext):
                old_file = os.path.join(root, file)
                new_file = os.path.splitext(old_file)[0] + new_ext
                os.rename(old_file, new_file)

# 示例:将路径为 "C:\test" 下所有 .txt 文件的后缀修改为 .md
change_file_extension("D:\dataset\image\\1", ".jpg", ".txt")

Insert image description here

Now the names of the image and the label are the same. Compare and find the most common file names in the image and write them into the txt file.

import os
import codecs
# 指定两个文件夹的路径
folder1 = r'D:\dataset\image\1'
folder2 = r'D:\dataset\ann'

# 获取文件夹1中所有文件的文件名  (这里的场景是jpg)
files1 = set(os.listdir(folder1))

# 获取文件夹2中所有文件的文件名  (这里的场景是txt)
files2 = set(os.listdir(folder2))



# 计算出缺少的文件,即在文件夹1中出现但不在文件夹2中出现的文件
missing_files = files1 - files2

#将缺少的文件保存到txt中
path=  'D:\code\yolov8-pytorch-master\\needRM.txt'
lt = open(path, "w")

# 打印缺少的文件名
print("缺少的文件:")
for file in missing_files:
    print(file)
    lt.writelines(file + '\n')  # 每个元素以空格间隔,一行元素写完并换行

Insert image description here
Insert image description here
After reading the needRM.txt file, change the suffix back to jpg

import codecs
import os

path = 'D:/code/yolov8-pytorch-master/needRM.txt'  # 标签文件train路径
newpath = 'D:/code/yolov8-pytorch-master/needRMNew.txt'  # 标签文件train路径

file = open(path, "r", encoding="utf8")
txt = file.read()
a = txt.replace(".txt", ".jpg")  # read默认内容读出来是字符串格式
file.close()  # 这一步必须关闭
file = open(newpath, "w", encoding='utf8')
file.write(a)  # 把修改后的a写入文件
file.close()

Insert image description here

In this way, you will get the file name of the file that needs to be deleted (all saved in this txt file)

Delete redundant files in the Image folder according to file name

import os

# 将待删除文件夹图片路径补全
file = open(r"D:\code\yolov8-pytorch-master\needRMNew.txt", "r")
list1 = file.readlines()
for i in list1:
    i = i.strip('\n')
    delPath = "D:\dataset\image\\1\\"+i
    print("remove pic:  "+ delPath, end="\n")
    os.remove(delPath)
file.close()



Insert image description here

Finish!
Insert image description here

The improvement over the previous method is that it
used to be a double traversal, which had high time complexity.
Now it uses set directly to find the non-duplicate ones, and the time complexity is reduced.

Use Python to batch modify file name suffixes

Python compares different files in two folders

Python file operations, operations on .txt text files (read, write, modify, copy, merge), operations on json text files, and conversion between json strings and dictionaries.

Guess you like

Origin blog.csdn.net/qq_41701723/article/details/134691848