pandas dataframe duplicate data view. Analyzing. deduplication

This article explains how to use the pandas to see dataframe duplicate data, to determine whether to repeat, and how to weight

dataframe data samples:

import pandas as pd
df = pd.DataFrame({'name':['苹果','梨','草莓','苹果'], 'price':[7,8,9,8], 'cnt':[3,4,5,4]})

   name cnt price
0   苹果   3  7
1    梨   4   8
2   草莓   5  9
3   苹果   6  8

>> See dataframe of duplicate data

a = df.groupby('price').count()>1
price = a[a['cnt'] == True].index
repeat_df = df[df['price'].isin(price)]

>> duplicated () method of determining

1. The determination whether to repeat a column data dataframe

flag = df.price.duplicated()

0    False
1    False
2    False
3     True
Name: price, dtype: bool

flag.any()结果为True  (any等于对flag or判断)
flag.all()结果为False  (all等于对flag and判断)

2. determining whether to repeat the entire row of data dataframe

flag = df.duplicated()
判断方法同1

3. determining whether the data is repeated a plurality of columns dataframe data (multiple-column combo check)

df.duplicated(subset = ['price','cnt'])
判断方法同1

>> drop_duplicats () method to weight

1. dataframe data deduplication

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

示例:
df.drop_duplicats(subset = ['price','cnt'],keep='last',inplace=True)

drop_duplicats参数说明:
  参数subset
    subset用来指定特定的列,默认所有列
  参数keep
    keep可以为first和last,表示是选择最前一项还是最后一项保留,默认first
  参数inplace
    inplace是直接在原来数据上修改还是保留一个副本,默认为False

Guess you like

Origin www.cnblogs.com/trotl/p/11876292.html