Python pandas mark duplicate records in source data

There is now an order source data, which is collected and integrated from various business personnel. Since the order data provided by some business personnel will have overlapping parts, the duplicate records in this part need to be marked in the source data. After confirmation by the business personnel, the duplicate records in this part will be deleted to obtain the final and accurate order source data. Used for subsequent order statistics.
The source data sample is as follows, where 订单id, 用户id, 国条码, 购买数量The four combined fields can be used as the unique identifier of a row of records.
Insert image description here

Step 1: Read the data

import pandas as pd
file_path=r"E:\临时\20220214\临时.xlsx"
data=pd.read_excel(file_path)
data.head()

Insert image description here

Step 2: Check the number of duplicate record rows

data.duplicated(subset=["订单id","用户id","国条码","购买数量"],keep="first").sum()#计算重复记录行数,其中第一次出现的重复记录不纳入计数

Output: 96

data.duplicated(subset=["订单id","用户id","国条码","购买数量"],keep="last").sum()#计算重复记录行数,其中最后一次出现的重复记录不纳入计数

Output: 96

data.duplicated(subset=["订单id","用户id","国条码","购买数量"],keep=False).sum()##计算重复记录行数,只要记录重复出现,就会被计入

Output: 121

To sum up, what we can get is that a total of (121-96=25) records appear repeatedly 121 times. These 25 records all appear repeatedly at least once, and some records appear twice or more.

Step 3: Mark duplicate records

data.loc[data.duplicated(subset=["订单id","用户id","国条码","购买数量"],keep=False),"该行是否重复"]="是"#将重复出现的121条记录打上标记

Step 4: Mark duplicate records that need to be eliminated

data.loc[data.duplicated(subset=["订单id","用户id","国条码","购买数量"],keep="first"),"该重复项是否应被剔除"]="是"#若某记录多次出现,第一次出现的记录保留,之后出现的记录剔除

Step 5: Delete duplicate records after confirming with business personnel

Method 1:

data_drop_dup=data.drop_duplicates(subset=["订单id","用户id","国条码","购买数量"],keep="first")

Houji:

data_drop_dup=data[data["该重复项是否应被剔除"]!="是"]

Supongo que te gusta

Origin blog.csdn.net/p1306252/article/details/122941554
Recomendado
Clasificación