Use python to merge and deduplicate all tables in the folder

Sometimes, we need to merge and de-duplicate multiple tables.
If we need to merge and de-duplicate the three tables table01.xlsx, table02.xlsx, and table03.xlsx stored in the table folder on the E drive, and de-duplicate them into the file merge_table
table01 in .xlsx :
table01
table02:
table02
table03:
table03
The overall idea:
traverse the excel files in the folder - add the read file data to the list - use the pd.concat() method to add the list to it and merge all the data - use data.drop_duplicates() deduplicates the merged data - creates a new excel file - converts the deduplicated data into a dataframe format and stores it in the excel file - saves the file The code is as follows
:

import pandas as pd
import os
os.chdir('E:\\origin_file\\table')
list=[]   #建立新列表
#1.遍历目标文件夹
for root,dirs,files in os.walk('./'):
	for file in files:
#2.读取excel文件
        data=pd.read_excel(file)
#3.将excel文件加入到新建列表中
        list.append(data)
#4.合并
merge_data=pd.concat(list,axis=0)
#pd.concat()第一个参数为连接对象,格式为列表,axis=0为连接方向,这里等于0表示水平方向连接(默认也是0),等于1表示垂直方向
#更多参数可查看博文https://blog.csdn.net/smf1208/article/details/110726271
#5.去重
merge_quchong=merge_data.drop_duplicates(subset=['filename'],keep='first',inplace=False)
#data.drop_duplicates的参数subset=['filename']表示需要去重的列名,这里是需要去重“filename”字段;keep='first'表示保留第几次出现的重复行,删除后面的重复行,这里是保留第一次出现的行(默认也是‘first’);inplace=False表示是否删除所有重复项,这里表示是(默认也是'False')
#详细内容可查看https://zhuanlan.zhihu.com/p/116884554
#6.转化为dataframe格式
df=pd.DataFrame(merge_quchong)
#7.建立excel文件
writer=pd.ExcelWriter('./merge_table.xlsx')
#8.储存到excel文件
df.to_excel(writer,'sheet1',startcol=0,index=False)
#9.保存文件
writer.save()

merge_table:
insert image description hereKnowledge points:
1. Traversing folders
2. Merging
3. Deduplication
4. Saving
Creation is not easy, please like, bookmark, follow, and support! Bloggers will have more detailed and practical tutorials in succession!

Guess you like

Origin blog.csdn.net/weixin_47970003/article/details/121792711