There is now an order source data, which is collected and integrated from various business personnel. Since the order data provided by some business personnel will have overlapping parts, the duplicate records in this part need to be marked in the source data. After confirmation by the business personnel, the duplicate records in this part will be deleted to obtain the final and accurate order source data. Used for subsequent order statistics.
The source data sample is as follows, where 订单id
, 用户id
, 国条码
, 购买数量
The four combined fields can be used as the unique identifier of a row of records.
Step 1: Read the data
import pandas as pd
file_path=r"E:\临时\20220214\临时.xlsx"
data=pd.read_excel(file_path)
data.head()
Step 2: Check the number of duplicate record rows
data.duplicated(subset=["订单id","用户id","国条码","购买数量"],keep="first").sum()#计算重复记录行数,其中第一次出现的重复记录不纳入计数
Output: 96
data.duplicated(subset=["订单id","用户id","国条码","购买数量"],keep="last").sum()#计算重复记录行数,其中最后一次出现的重复记录不纳入计数
Output: 96
data.duplicated(subset=["订单id","用户id","国条码","购买数量"],keep=False).sum()##计算重复记录行数,只要记录重复出现,就会被计入
Output: 121
To sum up, what we can get is that a total of (121-96=25) records appear repeatedly 121 times. These 25 records all appear repeatedly at least once, and some records appear twice or more.
Step 3: Mark duplicate records
data.loc[data.duplicated(subset=["订单id","用户id","国条码","购买数量"],keep=False),"该行是否重复"]="是"#将重复出现的121条记录打上标记
Step 4: Mark duplicate records that need to be eliminated
data.loc[data.duplicated(subset=["订单id","用户id","国条码","购买数量"],keep="first"),"该重复项是否应被剔除"]="是"#若某记录多次出现,第一次出现的记录保留,之后出现的记录剔除
Step 5: Delete duplicate records after confirming with business personnel
Method 1:
data_drop_dup=data.drop_duplicates(subset=["订单id","用户id","国条码","购买数量"],keep="first")
Houji:
data_drop_dup=data[data["该重复项是否应被剔除"]!="是"]