通过apply、rename进行数据的预处理:
DataFrame.
apply
(func,axis = 0,broadcast = None,raw = False,reduce = None,result_type = None,args =(),** kwds )
# apply的用处在于可以给一个函数对数据进行迭代处理,或者只对某些列(特征)感兴趣,可以通过这个进行处理
In [70]: df = pd.read_csv('apply_demo.csv')
In [71]: df.head() # 默认取前5行
Out[71]:
time data
0 1473411962 Symbol: APPL Seqno: 0 Price: 1623
1 1473411962 Symbol: APPL Seqno: 0 Price: 1623
2 1473411963 Symbol: APPL Seqno: 0 Price: 1623
3 1473411963 Symbol: APPL Seqno: 0 Price: 1623
4 1473411963 Symbol: APPL Seqno: 1 Price: 1649
In [72]: df.shape # 表示有3989个样本,每个样本有两个特征(数据)
Out[72]: (3989, 2)
In [73]: df.size # 返回的是数据的元素个数,即3989*2 = 7978
Out[73]: 7978
In [74]: s1 = Series(['a']* 3992) # 注意经多次试验,如果s1的长度多于df的长度即3989,则最后和添加以后^M
...: # 和df相同,反之还是以df为准,不够的使用nan填充
In [75]: df['A'] = s1
In [76]: df.head()
Out[76]:
time data A
0 1473411962 Symbol: APPL Seqno: 0 Price: 1623 a
1 1473411962 Symbol: APPL Seqno: 0 Price: 1623 a
2 1473411963 Symbol: APPL Seqno: 0 Price: 1623 a
3 1473411963 Symbol: APPL Seqno: 0 Price: 1623 a
4 1473411963 Symbol: APPL Seqno: 1 Price: 1649 a
In [77]: df['A'] = df['A'].apply(str.upper) # 输入一个功能函数,应用于每个列或行进行迭代,对A
# 这一列把小写变为大写,默认是行进行迭代
In [78]: df.head()
Out[78]:
time data A
0 1473411962 Symbol: APPL Seqno: 0 Price: 1623 A
1 1473411962 Symbol: APPL Seqno: 0 Price: 1623 A
2 1473411963 Symbol: APPL Seqno: 0 Price: 1623 A
3 1473411963 Symbol: APPL Seqno: 0 Price: 1623 A
4 1473411963 Symbol: APPL Seqno: 1 Price: 1649 A
In [79]: # data中的数据有三种值,想把data中的三种值提取出来单独用作多列
In [80]: l1 = df['data'][0].strip().split(' ') # .strip()是去除空格,split(" ")是以空格为分隔符进行分割
In [81]: l1
Out[81]: ['Symbol:', 'APPL', 'Seqno:', '0', 'Price:', '1623']
In [82]: l1[1], l1[3],l1[5]
Out[82]: ('APPL', '0', '1623')
In [83]: # 定义一个函数进行提取想要提取的数据,并返回Series结构数据
In [84]: def foo(line):^M
...: items = line.strip().split(' ')^M
...: return Series([items[1], items[3], items[5]])
...:
...:
In [85]: df_tmp = df['data'].apply(foo) # 进行数据处理并返回
df_tmp.head()
0 1 2
0 APPL 0 1623
1 APPL 0 1623
2 APPL 0 1623
3 APPL 0 1623
4 APPL 1 1649
In [86]: df_tmp = df_tmp.rename(columns={0:'Symbol', 1:'Seqno', 2:'Price'}) # 更改columns的名称
In [87]: df_tmp.head()
Out[87]:
Symbol Seqno Price
0 APPL 0 1623
1 APPL 0 1623
2 APPL 0 1623
3 APPL 0 1623
4 APPL 1 1649
df_tmp = df_tmp.rename({0:'a',1:'b',2:'c',3:'d',4:'e'},axis='index') # 改变index的名称
df_tmp.head()
0 1 2
a APPL 0 1623
b APPL 0 1623
c APPL 0 1623
d APPL 0 1623
e APPL 1 1649
df_tmp = df_tmp.rename(index={'a':'A','b':'B',2:'C',3:'d',4:'e'})
df_tmp.head()
0 1 2
A APPL 0 1623
B APPL 0 1623
c APPL 0 1623
d APPL 0 1623
e APPL 1 1649
In [88]: df_new = df.combine_first(df_tmp) # 通过combine_first添加到目标数据中
In [89]: df_new.head()
Out[89]:
A Price Seqno Symbol data time
0 A 1623.0 0.0 APPL Symbol: APPL Seqno: 0 Price: 1623 1473411962
1 A 1623.0 0.0 APPL Symbol: APPL Seqno: 0 Price: 1623 1473411962
2 A 1623.0 0.0 APPL Symbol: APPL Seqno: 0 Price: 1623 1473411963
3 A 1623.0 0.0 APPL Symbol: APPL Seqno: 0 Price: 1623 1473411963
4 A 1649.0 1.0 APPL Symbol: APPL Seqno: 1 Price: 1649 1473411963
In [90]: del df_new['A'],df_new['data'] # 删除无用的数据Series
In [91]: df_new.head()
Out[91]:
Price Seqno Symbol time
0 1623.0 0.0 APPL 1473411962
1 1623.0 0.0 APPL 1473411962
2 1623.0 0.0 APPL 1473411963
3 1623.0 0.0 APPL 1473411963
4 1649.0 1.0 APPL 1473411963
In [92]: df_new.to_csv('demo_duplicate.csv')
DataFrame的merge操作:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
该操作主要是通过使用columns或index执行数据库样式的联接操作,合并DataFrame对象。
给出示例:
In [19]: df1 = DataFrame({'key':['x', 'y', 'z','y'],'data_set_1':[1,2,3,8]})
In [20]: df1
Out[20]:
key data_set_1
0 x 1
1 y 2
2 z 3
3 y 8
In [21]: df2 =DataFrame({'key':['A','y','C'],'data_set_2':[4,5,6]})
In [22]: df2
Out[22]:
key data_set_2
0 A 4
1 y 5
2 C 6
In [23]: pd.merge(df1,df2,on=None) # 参数on的用法,默认是使用key,两个dataframe数据共有的进行合并
Out[23]:
key data_set_1 data_set_2
0 y 2 5
1 y 8 5
In [24]: pd.merge(df1,df2,on='key') # 和默认相同
Out[24]:
key data_set_1 data_set_2
0 y 2 5
1 y 8 5
In [25]: # pd.merge(df1,df2,on='data_test_1')# 出错,原因是两个Dataframe没有共
...: 同的on元素
In [26]: pd.merge(df1,df2,on='key',how='left') # merge的how的用法,和apply相同
Out[26]:
key data_set_1 data_set_2
0 x 1 NaN
1 y 2 5.0
2 z 3 NaN
3 y 8 5.0
In [27]: pd.merge(df1,df2,on='key',how='right') # merge的how的用法
Out[27]:
key data_set_1 data_set_2
0 y 2.0 5
1 y 8.0 5
2 A NaN 4
3 C NaN 6
In [28]: pd.merge(df1,df2,on='key',how='outer') # merge的how的用法
Out[28]:
key data_set_1 data_set_2
0 x 1.0 NaN
1 y 2.0 5.0
2 y 8.0 5.0
3 z 3.0 NaN
4 A NaN 4.0
5 C NaN 6.0
去重:
DataFrame.
drop_duplicates
(subset = None,keep ='first',inplace = False )
返回删除了重复行的DataFrame
参数: | subset:列标签或标签序列,可选
保持:{'first','last',False},默认'first'
inplace:布尔值,默认为False
|
---|
In [93]: df = pd.read_csv('demo_duplicate.csv')
In [94]: df.head()
Out[94]:
Unnamed: 0 Price Seqno Symbol time
0 0 1623.0 0.0 APPL 1473411962
1 1 1623.0 0.0 APPL 1473411962
2 2 1623.0 0.0 APPL 1473411963
3 3 1623.0 0.0 APPL 1473411963
4 4 1649.0 1.0 APPL 1473411963
In [95]: del df['Unnamed: 0'] # 删除Unnamed: 0 columns
In [96]: df.head() # 发现Seqno有很多重复的值,下面进行去除工作
Out[96]:
Price Seqno Symbol time
0 1623.0 0.0 APPL 1473411962
1 1623.0 0.0 APPL 1473411962
2 1623.0 0.0 APPL 1473411963
3 1623.0 0.0 APPL 1473411963
4 1649.0 1.0 APPL 1473411963
In [97]: df.shape # 看看有多少数据
Out[97]: (3989, 4)
In [98]: len(df['Seqno'].unique()) # 看看该列有多少种数值
Out[98]: 1000
In [99]: df['Seqno'].duplicated().head() # 判断是否是重复的数值,一般第一个为原始的后面的为重复数据。
Out[99]:
0 False
1 True
2 True
3 True
4 False
Name: Seqno, dtype: bool
In [100]: df['Seqno'].drop_duplicates().head() # 删除重复的,默认保留第一个出现的,返回的series
Out[100]:
0 0.0
4 1.0
8 2.0
12 3.0
16 4.0
Name: Seqno, dtype: float64
In [101]: df.drop_duplicates().head() # 发现还是没删除完重复的
Out[101]:
Price Seqno Symbol time
0 1623.0 0.0 APPL 1473411962
2 1623.0 0.0 APPL 1473411963
4 1649.0 1.0 APPL 1473411963
6 1649.0 1.0 APPL 1473411964
8 1642.0 2.0 APPL 1473411964
In [102]: df.drop_duplicates(['Seqno']).head() # 加入这一个columns就可以完成,是以这这一列为准删除
Out[102]:
Price Seqno Symbol time
0 1623.0 0.0 APPL 1473411962
4 1649.0 1.0 APPL 1473411963
8 1642.0 2.0 APPL 1473411964
12 1636.0 3.0 APPL 1473411965
16 1669.0 4.0 APPL 1473411966
In [103]: df.drop_duplicates(['Seqno'],keep='last').head() # keep='last'是以重复的最后一个进行保留
Out[103]:
Price Seqno Symbol time
3 1623.0 0.0 APPL 1473411963
7 1649.0 1.0 APPL 1473411964
11 1642.0 2.0 APPL 1473411965
15 1636.0 3.0 APPL 1473411966
19 1669.0 4.0 APPL 1473411967