Pandas玩转数据(七) -- Series和DataFrame去重

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

df = pd.read_csv('demo_duplicate.csv')
df.head()
Out[40]: 
   Unnamed: 0   Price  Seqno Symbol        time
0           0  1623.0    0.0   APPL  1473411962
1           1  1623.0    0.0   APPL  1473411962
2           2  1623.0    0.0   APPL  1473411963
3           3  1623.0    0.0   APPL  1473411963
4           4  1649.0    1.0   APPL  1473411963

# 
df.size
Out[41]: 19945

len(df)
Out[42]: 3989

len(df['Seqno'].unique())
Out[46]: 1000

# 某一行Series的duplicate
df['Seqno'].duplicated().head()
Out[47]: 
0    False
1     True
2     True
3     True
4    False
Name: Seqno, dtype: bool

type(df['Seqno'].duplicated())
Out[48]: pandas.core.series.Series

# DataFrame去重,drop_dupliates
df.drop_duplicates(['Seqno']).head()
Out[49]: 
    Unnamed: 0   Price  Seqno Symbol        time
0            0  1623.0    0.0   APPL  1473411962
4            4  1649.0    1.0   APPL  1473411963
8            8  1642.0    2.0   APPL  1473411964
12          12  1636.0    3.0   APPL  1473411965
16          16  1669.0    4.0   APPL  1473411966

# keep参数指定保留哪一个
df.drop_duplicates(['Seqno'], keep='last').head()
Out[53]: 
    Unnamed: 0   Price  Seqno Symbol        time
3            3  1623.0    0.0   APPL  1473411963
7            7  1649.0    1.0   APPL  1473411964
11          11  1642.0    2.0   APPL  1473411965
15          15  1636.0    3.0   APPL  1473411966
19          19  1669.0    4.0   APPL  1473411967

猜你喜欢

转载自blog.csdn.net/weixin_39778570/article/details/81114746
今日推荐