import numpy as np
import pandas as pd
from pandas import Series, DataFrame
df = pd.read_csv('demo_duplicate.csv')
df.head()
Out[40]:
Unnamed: 0 Price Seqno Symbol time
0 0 1623.0 0.0 APPL 1473411962
1 1 1623.0 0.0 APPL 1473411962
2 2 1623.0 0.0 APPL 1473411963
3 3 1623.0 0.0 APPL 1473411963
4 4 1649.0 1.0 APPL 1473411963
#
df.size
Out[41]: 19945
len(df)
Out[42]: 3989
len(df['Seqno'].unique())
Out[46]: 1000
# 某一行Series的duplicate
df['Seqno'].duplicated().head()
Out[47]:
0 False
1 True
2 True
3 True
4 False
Name: Seqno, dtype: bool
type(df['Seqno'].duplicated())
Out[48]: pandas.core.series.Series
# DataFrame去重,drop_dupliates
df.drop_duplicates(['Seqno']).head()
Out[49]:
Unnamed: 0 Price Seqno Symbol time
0 0 1623.0 0.0 APPL 1473411962
4 4 1649.0 1.0 APPL 1473411963
8 8 1642.0 2.0 APPL 1473411964
12 12 1636.0 3.0 APPL 1473411965
16 16 1669.0 4.0 APPL 1473411966
# keep参数指定保留哪一个
df.drop_duplicates(['Seqno'], keep='last').head()
Out[53]:
Unnamed: 0 Price Seqno Symbol time
3 3 1623.0 0.0 APPL 1473411963
7 7 1649.0 1.0 APPL 1473411964
11 11 1642.0 2.0 APPL 1473411965
15 15 1636.0 3.0 APPL 1473411966
19 19 1669.0 4.0 APPL 1473411967
Pandas玩转数据(七) -- Series和DataFrame去重
猜你喜欢
转载自blog.csdn.net/weixin_39778570/article/details/81114746
今日推荐
周排行