Python uses pandas for big data Excel two file comparison and deduplication 300w big data processing

Python uses pandas to compare and deduplicate two files of big data Excel

Background introduction:

Popular understanding has two excel files named A and B

I want to remove the data contained in A from B, the amount of data is about 300w

Because of the large amount of data, neither wps nor office's built-in deduplication can be used normally, so scripts are needed

Not much to say, the code is as follows:

import pandas as pd
from tqdm import tqdm


# 引号内填写需要去重的表格路径

targetExcel = r'./222.xlsx'

# 引号内填写依据表格的路径

basisExcel = r'./11.xlsx'

# 引号内填写输出字段

field = 'removeRepeatResult'


def removeRepeat():

    count = 0
    ind   = 1
    targetIndex = field + str(ind)
    resultExcel  = {
        field+'1': []
    }
    header = ['A','B','C','D','E','F','G','H','I','J','K']

    print('读取数据')
    target_Excel = pd.read_excel(targetExcel,header=None,names=header, dtype='object')
    basis_Excel  = pd.read_excel(basisExcel,header=None,names=['A'], dtype='object')
    print('读取成功')

    for index in tqdm(header):
        for i in tqdm(target_Excel[index], leave=False):
            if pd.isnull(i):
                continue
            elif i in list(basis_Excel['A']):
                continue
            else:
                resultExcel[targetIndex].append(i)
                count += 1
                if count >= 1020000:
                    count = 0
                    ind += 1
                    targetIndex = field + str(ind)
                    resultExcel[targetIndex] = []
    
    print('等待数据合并')
    df = pd.concat([pd.DataFrame(i) for i in resultExcel.values()], axis=1)  
    df.fillna(0) # 取消长短不一致问题 
    df.to_excel('resultExcel.xlsx', header=None, index=False) # 取消表头与行号
    #上一行中自定义文件名!


removeRepeat()
input('>>> 任意键退出...')

Running effect diagram:

 Welcome everyone to guide and communicate, learn together, and make progress together!

Guess you like

Origin blog.csdn.net/xiaozhang0316/article/details/128807913