python选取文件夹，然后计算该文件夹下所有文件的md5值，并列出md5值相同的文件到log中

因为在处理手机相册的时候，发现有些照片保存了好多次，为了保证一张图片不被多次保存，所以想到通过计算图片md5值的方式来进行筛选。

图片的md5值计算，使用python非常方便。

执行该py之后，会有一个对话框，通过选择目录，即可遍历该目录及子目录下所有文件，计算出md5值，并将md5值重复的文件列出，从而可手动删除重复文件，保留一个即可。

以下是python源代码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import hashlib
import datetime
from Tkinter import *
import tkFileDialog
import io


def get_file_md5(file_path):
	if not os.path.isfile(file_path):
		return
	myhash=hashlib.md5()
	f=file(file_path,'rb')
	while True:
		b = f.read(8096)
		if not b:
			break
		myhash.update(b)
	f.close()
	return myhash.hexdigest()

#print file md5 value and time costed

def file_name(file_dir):
	log_file_path = os.path.dirname(os.path.realpath(__file__))
	py_name=os.path.basename(os.path.realpath(__file__))
	log_file_path = log_file_path + '\\' + py_name + '.log'
	print(log_file_path)
	log_file=io.open(log_file_path,'w',1048576,'utf-8');
	list_md5={'':[]}
	for root,dirs,files in os.walk(file_dir):
		for file in files:
			file_full_path=os.path.join(root,file)
			md5_val=get_file_md5(file_full_path)
			list_md5.setdefault(md5_val,[]).append(file_full_path)
	for key in list_md5.keys():
		if len(list_md5[key])>1:
			for value in list_md5[key]:
				line_data=key + '	' + value + '\n'
				log_file.write(line_data)
			switch_line=u'\n'
			log_file.write(switch_line)
	log_file.close()

	
starttime=datetime.datetime.now()
file_dir_path=tkFileDialog.askdirectory()
if file_dir_path.strip() != '':
	file_name(file_dir_path)
endtime=datetime.datetime.now()
print 'execute time: %ds'%((endtime-starttime).seconds)

python选取文件夹，然后计算该文件夹下所有文件的md5值，并列出md5值相同的文件到log中

猜你喜欢