二、基本功能
1. 重建索引
reindex是pandas对象的重要方法,该方法用于创建一个符合新索引的新对象:
import numpy as np
import pandas as pd
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
对于顺序数据,比如时间序列,在重建索引时可能会需要进行插值或者填值。method可选参数允许我们使用诸如ffill等方法在重建索引时插值,ffill方法会将值前向填充:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3
obj3.reindex(range(6), method='ffill')
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
在dataframe中,reindex可以改变行索引、列索引、也可以同时改变二者。当仅传入一个序列时,结果中的行会重建索引:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
index=['a', 'c', 'd'],
columns=['Ohio', 'Texas', 'California'])
frame
|
Ohio |
Texas |
California |
a |
0 |
1 |
2 |
c |
3 |
4 |
5 |
d |
6 |
7 |
8 |
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2
|
Ohio |
Texas |
California |
a |
0.0 |
1.0 |
2.0 |
b |
NaN |
NaN |
NaN |
c |
3.0 |
4.0 |
5.0 |
d |
6.0 |
7.0 |
8.0 |
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)
|
Texas |
Utah |
California |
a |
1 |
NaN |
2 |
c |
4 |
NaN |
5 |
d |
7 |
NaN |
8 |
你可以用loc进行更为简洁的标签索引:
frame.loc[['a', 'b', 'c', 'd'], states]
|
Texas |
Utah |
California |
a |
1.0 |
NaN |
2.0 |
b |
NaN |
NaN |
NaN |
c |
4.0 |
NaN |
5.0 |
d |
7.0 |
NaN |
8.0 |
reindex方法的参数:
参数 |
说明 |
index |
新建作为索引的序列 |
method |
填充方式,ffill为前向填充,bfill是后向填充 |
fill_value |
引入的缺失数据值 |
limit |
填充间隙 |
tolerance |
所需填充的不精确匹配下的最大尺寸间隙 |
level |
在多层索引上匹配简单索引 |
copy |
如果weiTrue则复制数据 |
2. 轴向上删除条目
drop方法会返回一个含有指示值或轴向上删除值的新对象:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
new_obj = obj.drop('c')
new_obj
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
obj.drop(['d', 'c'])
a 0.0
b 1.0
e 4.0
dtype: float64
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
|
one |
two |
three |
four |
Ohio |
0 |
1 |
2 |
3 |
Colorado |
4 |
5 |
6 |
7 |
Utah |
8 |
9 |
10 |
11 |
New York |
12 |
13 |
14 |
15 |
调用drop时使用标签序列会根据行标签删除值(轴0):
data.drop(['Colorado', 'Ohio'])
|
one |
two |
three |
four |
Utah |
8 |
9 |
10 |
11 |
New York |
12 |
13 |
14 |
15 |
你可以通过传递axis=1或axis='columns’来从列中删除值:
data.drop('two', axis=1)
|
one |
three |
four |
Ohio |
0 |
2 |
3 |
Colorado |
4 |
6 |
7 |
Utah |
8 |
10 |
11 |
New York |
12 |
14 |
15 |
data.drop(['two', 'four'], axis='columns')
|
one |
three |
Ohio |
0 |
2 |
Colorado |
4 |
6 |
Utah |
8 |
10 |
New York |
12 |
14 |
很多函数,例如drop,会修改Series或Dataframe的尺寸或形状,这些方法直接操作原对象而不返回新对象:
obj.drop('c', inplace=True)
obj
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
3. 索引、选择与过滤
Series的索引与numpy数组索引的功能类似,只不过普通python切片不包含尾部,而series的切片不同:
obj['c'] = 2.0
obj = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
obj['b':'c']
b 1.0
c 2.0
dtype: float64
obj['b':'c'] = 5
obj
a 0.0
b 5.0
c 5.0
d 3.0
e 4.0
dtype: float64
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
|
one |
two |
three |
four |
Ohio |
0 |
1 |
2 |
3 |
Colorado |
4 |
5 |
6 |
7 |
Utah |
8 |
9 |
10 |
11 |
New York |
12 |
13 |
14 |
15 |
data['two']
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int64
data[['three', 'one']]
|
three |
one |
Ohio |
2 |
0 |
Colorado |
6 |
4 |
Utah |
10 |
8 |
New York |
14 |
12 |
这种方式也有特殊案例。首先,可以根据一个布尔值数组切片或选择数据:
扫描二维码关注公众号,回复:
8949601 查看本文章
data[:2]
|
one |
two |
three |
four |
Ohio |
0 |
1 |
2 |
3 |
Colorado |
4 |
5 |
6 |
7 |
data[data['three'] > 5]
|
one |
two |
three |
four |
Colorado |
4 |
5 |
6 |
7 |
Utah |
8 |
9 |
10 |
11 |
New York |
12 |
13 |
14 |
15 |
data < 5
|
one |
two |
three |
four |
Ohio |
True |
True |
True |
True |
Colorado |
True |
False |
False |
False |
Utah |
False |
False |
False |
False |
New York |
False |
False |
False |
False |
data[data < 5] = 0
data
|
one |
two |
three |
four |
Ohio |
0 |
0 |
0 |
0 |
Colorado |
0 |
5 |
6 |
7 |
Utah |
8 |
9 |
10 |
11 |
New York |
12 |
13 |
14 |
15 |
使用loc和iloc选额数据
data.loc['Colorado', ['two', 'three']]
two 5
three 6
Name: Colorado, dtype: int64
data.iloc[2, [3, 0, 1]]
four 11
one 8
two 9
Name: Utah, dtype: int64
data.iloc[2]
one 8
two 9
three 10
four 11
Name: Utah, dtype: int64
data.iloc[[1, 2], [3, 0, 1]]
|
four |
one |
two |
Colorado |
7 |
0 |
5 |
Utah |
11 |
8 |
9 |
data.loc[:'Utah', 'two']
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int64
data.iloc[:, :3][data.three > 5]
|
one |
two |
three |
Colorado |
0 |
5 |
6 |
Utah |
8 |
9 |
10 |
New York |
12 |
13 |
14 |
对于整数索引,pandas不可以用负索引:
ser = pd.Series(np.arange(3.))
ser
0 0.0
1 1.0
2 2.0
dtype: float64
ser[-1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-67-7ed1c232b3a2> in <module>()
----> 1 ser[-1]
~/.conda/envs/python36/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
599 key = com._apply_if_callable(key, self)
600 try:
--> 601 result = self.index.get_value(self, key)
602
603 if not is_scalar(result):
~/.conda/envs/python36/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
2475 try:
2476 return self._engine.get_value(s, k,
-> 2477 tz=getattr(series.dtype, 'tz', None))
2478 except KeyError as e1:
2479 if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4404)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4087)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:14031)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13975)()
KeyError: -1
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2
a 0.0
b 1.0
c 2.0
dtype: float64
ser2[-1]
2.0
使用填充的算数方法
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
columns=list('abcde'))
df2.loc[1, 'b'] = np.nan
df1
|
a |
b |
c |
d |
0 |
0.0 |
1.0 |
2.0 |
3.0 |
1 |
4.0 |
5.0 |
6.0 |
7.0 |
2 |
8.0 |
9.0 |
10.0 |
11.0 |
df2
|
a |
b |
c |
d |
e |
0 |
0.0 |
1.0 |
2.0 |
3.0 |
4.0 |
1 |
5.0 |
NaN |
7.0 |
8.0 |
9.0 |
2 |
10.0 |
11.0 |
12.0 |
13.0 |
14.0 |
3 |
15.0 |
16.0 |
17.0 |
18.0 |
19.0 |
df1 + df2
|
a |
b |
c |
d |
e |
0 |
0.0 |
2.0 |
4.0 |
6.0 |
NaN |
1 |
9.0 |
NaN |
13.0 |
15.0 |
NaN |
2 |
18.0 |
20.0 |
22.0 |
24.0 |
NaN |
3 |
NaN |
NaN |
NaN |
NaN |
NaN |
df1.add(df2, fill_value=0)
|
a |
b |
c |
d |
e |
0 |
0.0 |
2.0 |
4.0 |
6.0 |
4.0 |
1 |
9.0 |
5.0 |
13.0 |
15.0 |
9.0 |
2 |
18.0 |
20.0 |
22.0 |
24.0 |
14.0 |
3 |
15.0 |
16.0 |
17.0 |
18.0 |
19.0 |
1 / df1
|
a |
b |
c |
d |
0 |
inf |
1.000000 |
0.500000 |
0.333333 |
1 |
0.250000 |
0.200000 |
0.166667 |
0.142857 |
2 |
0.125000 |
0.111111 |
0.100000 |
0.090909 |
df1.rdiv(1)
|
a |
b |
c |
d |
0 |
inf |
1.000000 |
0.500000 |
0.333333 |
1 |
0.250000 |
0.200000 |
0.166667 |
0.142857 |
2 |
0.125000 |
0.111111 |
0.100000 |
0.090909 |
灵活算术方法:
方法 |
描述 |
add, radd |
想加 |
sub, rsub |
相减 |
div, rdiv |
相除 |
floordiv, rfloordiv |
整除 |
mul, rmul |
方法 |
pow, rpow |
次方 |
dataframe和series间的操作:
arr = np.arange(12.).reshape((3, 4))
arr
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
arr[0]
array([ 0., 1., 2., 3.])
arr - arr[0]
array([[ 0., 0., 0., 0.],
[ 4., 4., 4., 4.],
[ 8., 8., 8., 8.]])
这就是广播机制
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
frame
|
b |
d |
e |
Utah |
0.0 |
1.0 |
2.0 |
Ohio |
3.0 |
4.0 |
5.0 |
Texas |
6.0 |
7.0 |
8.0 |
Oregon |
9.0 |
10.0 |
11.0 |
series
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
frame - series
|
b |
d |
e |
Utah |
0.0 |
0.0 |
0.0 |
Ohio |
3.0 |
3.0 |
3.0 |
Texas |
6.0 |
6.0 |
6.0 |
Oregon |
9.0 |
9.0 |
9.0 |
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
frame + series2
|
b |
d |
e |
f |
Utah |
0.0 |
NaN |
3.0 |
NaN |
Ohio |
3.0 |
NaN |
6.0 |
NaN |
Texas |
6.0 |
NaN |
9.0 |
NaN |
Oregon |
9.0 |
NaN |
12.0 |
NaN |
4. 函数应用和映射
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
|
b |
d |
e |
Utah |
-0.894119 |
-1.707091 |
0.843595 |
Ohio |
-0.413837 |
-0.251321 |
0.440044 |
Texas |
0.871326 |
0.828606 |
-0.521924 |
Oregon |
0.603869 |
0.154679 |
2.067279 |
np.abs(frame)
|
b |
d |
e |
Utah |
0.894119 |
1.707091 |
0.843595 |
Ohio |
0.413837 |
0.251321 |
0.440044 |
Texas |
0.871326 |
0.828606 |
0.521924 |
Oregon |
0.603869 |
0.154679 |
2.067279 |
f = lambda x: x.max() - x.min()
frame.apply(f)
b 1.765445
d 2.535697
e 2.589203
dtype: float64
frame.apply(f, axis='columns')
Utah 2.550685
Ohio 0.853881
Texas 1.393250
Oregon 1.912600
dtype: float64
def f(x):
return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
|
b |
d |
e |
min |
-0.894119 |
-1.707091 |
-0.521924 |
max |
0.871326 |
0.828606 |
2.067279 |
逐元素的python函数也可以使用。假设你要根据frame中的每个浮点数计算一个格式化字符串,可以使用applymap方法:
format = lambda x: '%.2f' % x
frame.applymap(format)
|
b |
d |
e |
Utah |
-0.89 |
-1.71 |
0.84 |
Ohio |
-0.41 |
-0.25 |
0.44 |
Texas |
0.87 |
0.83 |
-0.52 |
Oregon |
0.60 |
0.15 |
2.07 |
series自己有map方法:
frame['e'].map(format)
Utah 0.84
Ohio 0.44
Texas -0.52
Oregon 2.07
Name: e, dtype: object