[Python] pandas matching and splicing two excel columns

In the process of processing a large amount of data matching in excel, although vlookup can be used, when the data volume exceeds 100,000 for batch matching, the efficiency is very poor, so python is used. After investigation, it is found that python can achieve a function similar to join in SQL through the merge of the pandas library. For details, refer to the following:

 

https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#compare-with-sql-join

 

import pandas as pd
import numpy as np

# %%
with pd.ExcelFile('xx.xlsx') as xls:
    df1 = pd.read_excel(xls,'Sheet1')
    df2 = pd.read_excel(xls,'Sheet2')

outer=pd.merge(df1,df2,on='key')

outer.to_excel('outer_function.xlsx',index=False,encoding='utf-8')

Finally realize the matching and splicing of Sheet1 and Sheet2 based on the same key field.

I don’t know why, the above method is always more or less. In order to match in order, I can add it manually if there are omissions. The following method is used. Among them, the result format and content order are based on df1 Basically, it is convenient to directly copy the missing data in df1 and supplement it.

outer=pd.merge(df1.drop_duplicates(),df2.drop_duplicates(),left_on='链接',right_on='链接',how='outer')

outer.to_excel(r'H:\e\outer_function3.xlsx',index=False,encoding='utf-8')

 

Guess you like

Origin blog.csdn.net/u010472858/article/details/106196854