Sometimes, we need to merge and de-duplicate multiple tables.
If we need to merge and de-duplicate the three tables table01.xlsx, table02.xlsx, and table03.xlsx stored in the table folder on the E drive, and de-duplicate them into the file merge_table
table01 in .xlsx :
table02:
table03:
The overall idea:
traverse the excel files in the folder - add the read file data to the list - use the pd.concat() method to add the list to it and merge all the data - use data.drop_duplicates() deduplicates the merged data - creates a new excel file - converts the deduplicated data into a dataframe format and stores it in the excel file - saves the file The code is as follows
:
import pandas as pd
import os
os.chdir('E:\\origin_file\\table')
list=[] #建立新列表
#1.遍历目标文件夹
for root,dirs,files in os.walk('./'):
for file in files:
#2.读取excel文件
data=pd.read_excel(file)
#3.将excel文件加入到新建列表中
list.append(data)
#4.合并
merge_data=pd.concat(list,axis=0)
#pd.concat()第一个参数为连接对象,格式为列表,axis=0为连接方向,这里等于0表示水平方向连接(默认也是0),等于1表示垂直方向
#更多参数可查看博文https://blog.csdn.net/smf1208/article/details/110726271
#5.去重
merge_quchong=merge_data.drop_duplicates(subset=['filename'],keep='first',inplace=False)
#data.drop_duplicates的参数subset=['filename']表示需要去重的列名,这里是需要去重“filename”字段;keep='first'表示保留第几次出现的重复行,删除后面的重复行,这里是保留第一次出现的行(默认也是‘first’);inplace=False表示是否删除所有重复项,这里表示是(默认也是'False')
#详细内容可查看https://zhuanlan.zhihu.com/p/116884554
#6.转化为dataframe格式
df=pd.DataFrame(merge_quchong)
#7.建立excel文件
writer=pd.ExcelWriter('./merge_table.xlsx')
#8.储存到excel文件
df.to_excel(writer,'sheet1',startcol=0,index=False)
#9.保存文件
writer.save()
merge_table:
Knowledge points:
1. Traversing folders
2. Merging
3. Deduplication
4. Saving
Creation is not easy, please like, bookmark, follow, and support! Bloggers will have more detailed and practical tutorials in succession!