pandas基本操作（二）

二、基本功能

1. 重建索引

reindex是pandas对象的重要方法，该方法用于创建一个符合新索引的新对象：

import numpy as np
import pandas as pd

obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

对于顺序数据，比如时间序列，在重建索引时可能会需要进行插值或者填值。method可选参数允许我们使用诸如ffill等方法在重建索引时插值，ffill方法会将值前向填充：

obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

在dataframe中，reindex可以改变行索引、列索引、也可以同时改变二者。当仅传入一个序列时，结果中的行会重建索引：


frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
frame

	Ohio	Texas	California
a	0	1	2
c	3	4	5
d	6	7	8

frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

	Ohio	Texas	California
a	0.0	1.0	2.0
b	NaN	NaN	NaN
c	3.0	4.0	5.0
d	6.0	7.0	8.0

states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

	Texas	Utah	California
a	1	NaN	2
c	4	NaN	5
d	7	NaN	8

你可以用loc进行更为简洁的标签索引：

frame.loc[['a', 'b', 'c', 'd'], states]

	Texas	Utah	California
a	1.0	NaN	2.0
b	NaN	NaN	NaN
c	4.0	NaN	5.0
d	7.0	NaN	8.0

reindex方法的参数：

参数	说明
index	新建作为索引的序列
method	填充方式，ffill为前向填充，bfill是后向填充
fill_value	引入的缺失数据值
limit	填充间隙
tolerance	所需填充的不精确匹配下的最大尺寸间隙
level	在多层索引上匹配简单索引
copy	如果weiTrue则复制数据

2. 轴向上删除条目

drop方法会返回一个含有指示值或轴向上删除值的新对象：

obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

调用drop时使用标签序列会根据行标签删除值（轴0）：

data.drop(['Colorado', 'Ohio'])

	one	two	three	four
Utah	8	9	10	11
New York	12	13	14	15

你可以通过传递axis=1或axis='columns’来从列中删除值：

data.drop('two', axis=1)

	one	three	four
Ohio	0	2	3
Colorado	4	6	7
Utah	8	10	11
New York	12	14	15

data.drop(['two', 'four'], axis='columns')

	one	three
Ohio	0	2
Colorado	4	6
Utah	8	10
New York	12	14

很多函数，例如drop，会修改Series或Dataframe的尺寸或形状，这些方法直接操作原对象而不返回新对象：

obj.drop('c', inplace=True)

obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

3. 索引、选择与过滤

Series的索引与numpy数组索引的功能类似，只不过普通python切片不包含尾部，而series的切片不同：

obj['c'] = 2.0
obj = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

obj['b':'c']

b    1.0
c    2.0
dtype: float64

obj['b':'c'] = 5

obj

a    0.0
b    5.0
c    5.0
d    3.0
e    4.0
dtype: float64

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

data[['three', 'one']]

	three	one
Ohio	2	0
Colorado	6	4
Utah	10	8
New York	14	12

这种方式也有特殊案例。首先，可以根据一个布尔值数组切片或选择数据：

扫描二维码关注公众号，回复： 8949601 查看本文章

data[:2]

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7

data[data['three'] > 5]

	one	two	three	four
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

data < 5

	one	two	three	four
Ohio	True	True	True	True
Colorado	True	False	False	False
Utah	False	False	False	False
New York	False	False	False	False

data[data < 5] = 0

data

	one	two	three	four
Ohio	0	0	0	0
Colorado	0	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

使用loc和iloc选额数据

data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int64

data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int64

data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

data.iloc[[1, 2], [3, 0, 1]]

	four	one	two
Colorado	7	0	5
Utah	11	8	9

data.loc[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

data.iloc[:, :3][data.three > 5]

	one	two	three
Colorado	0	5	6
Utah	8	9	10
New York	12	13	14

对于整数索引，pandas不可以用负索引：

ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

ser[-1]

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-67-7ed1c232b3a2> in <module>()
----> 1 ser[-1]


~/.conda/envs/python36/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
    599         key = com._apply_if_callable(key, self)
    600         try:
--> 601             result = self.index.get_value(self, key)
    602 
    603             if not is_scalar(result):


~/.conda/envs/python36/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   2475         try:
   2476             return self._engine.get_value(s, k,
-> 2477                                           tz=getattr(series.dtype, 'tz', None))
   2478         except KeyError as e1:
   2479             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:


pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4404)()


pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4087)()


pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()


pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:14031)()


pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13975)()


KeyError: -1

ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2

a    0.0
b    1.0
c    2.0
dtype: float64

ser2[-1]

2.0

使用填充的算数方法

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list('abcde'))
df2.loc[1, 'b'] = np.nan
df1

	a	b	c	d
0	0.0	1.0	2.0	3.0
1	4.0	5.0	6.0	7.0
2	8.0	9.0	10.0	11.0

df2

	a	b	c	d	e
0	0.0	1.0	2.0	3.0	4.0
1	5.0	NaN	7.0	8.0	9.0
2	10.0	11.0	12.0	13.0	14.0
3	15.0	16.0	17.0	18.0	19.0

df1 + df2

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	NaN
1	9.0	NaN	13.0	15.0	NaN
2	18.0	20.0	22.0	24.0	NaN
3	NaN	NaN	NaN	NaN	NaN

df1.add(df2, fill_value=0)

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	4.0
1	9.0	5.0	13.0	15.0	9.0
2	18.0	20.0	22.0	24.0	14.0
3	15.0	16.0	17.0	18.0	19.0

1 / df1

	a	b	c	d
0	inf	1.000000	0.500000	0.333333
1	0.250000	0.200000	0.166667	0.142857
2	0.125000	0.111111	0.100000	0.090909

df1.rdiv(1)

	a	b	c	d
0	inf	1.000000	0.500000	0.333333
1	0.250000	0.200000	0.166667	0.142857
2	0.125000	0.111111	0.100000	0.090909

灵活算术方法：

方法	描述
add, radd	想加
sub, rsub	相减
div, rdiv	相除
floordiv, rfloordiv	整除
mul, rmul	方法
pow, rpow	次方

dataframe和series间的操作：

arr = np.arange(12.).reshape((3, 4))
arr

array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

arr[0]

array([ 0.,  1.,  2.,  3.])

arr - arr[0]

array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

这就是广播机制

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
frame

	b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

frame - series

	b	d	e
Utah	0.0	0.0	0.0
Ohio	3.0	3.0	3.0
Texas	6.0	6.0	6.0
Oregon	9.0	9.0	9.0

series2 = pd.Series(range(3), index=['b', 'e', 'f'])
frame + series2

	b	d	e	f
Utah	0.0	NaN	3.0	NaN
Ohio	3.0	NaN	6.0	NaN
Texas	6.0	NaN	9.0	NaN
Oregon	9.0	NaN	12.0	NaN

4. 函数应用和映射

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

	b	d	e
Utah	-0.894119	-1.707091	0.843595
Ohio	-0.413837	-0.251321	0.440044
Texas	0.871326	0.828606	-0.521924
Oregon	0.603869	0.154679	2.067279

np.abs(frame)

	b	d	e
Utah	0.894119	1.707091	0.843595
Ohio	0.413837	0.251321	0.440044
Texas	0.871326	0.828606	0.521924
Oregon	0.603869	0.154679	2.067279

f = lambda x: x.max() - x.min()
frame.apply(f)

b    1.765445
d    2.535697
e    2.589203
dtype: float64

frame.apply(f, axis='columns')

Utah      2.550685
Ohio      0.853881
Texas     1.393250
Oregon    1.912600
dtype: float64

def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

	b	d	e
min	-0.894119	-1.707091	-0.521924
max	0.871326	0.828606	2.067279

逐元素的python函数也可以使用。假设你要根据frame中的每个浮点数计算一个格式化字符串，可以使用applymap方法：

format = lambda x: '%.2f' % x
frame.applymap(format)

	b	d	e
Utah	-0.89	-1.71	0.84
Ohio	-0.41	-0.25	0.44
Texas	0.87	0.83	-0.52
Oregon	0.60	0.15	2.07

series自己有map方法：

frame['e'].map(format)

Utah       0.84
Ohio       0.44
Texas     -0.52
Oregon     2.07
Name: e, dtype: object

胖虎卖汤圆

发布了100 篇原创文章 · 获赞 10 · 访问量 3394

私信关注