pandas基本操作(二)

二、基本功能

1. 重建索引

reindex是pandas对象的重要方法,该方法用于创建一个符合新索引的新对象:

import numpy as np
import pandas as pd
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

对于顺序数据,比如时间序列,在重建索引时可能会需要进行插值或者填值。method可选参数允许我们使用诸如ffill等方法在重建索引时插值,ffill方法会将值前向填充:

obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

在dataframe中,reindex可以改变行索引、列索引、也可以同时改变二者。当仅传入一个序列时,结果中的行会重建索引:


frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
frame
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8

你可以用loc进行更为简洁的标签索引:

frame.loc[['a', 'b', 'c', 'd'], states]
Texas Utah California
a 1.0 NaN 2.0
b NaN NaN NaN
c 4.0 NaN 5.0
d 7.0 NaN 8.0

reindex方法的参数:

参数 说明
index 新建作为索引的序列
method 填充方式,ffill为前向填充,bfill是后向填充
fill_value 引入的缺失数据值
limit 填充间隙
tolerance 所需填充的不精确匹配下的最大尺寸间隙
level 在多层索引上匹配简单索引
copy 如果weiTrue则复制数据

2. 轴向上删除条目

drop方法会返回一个含有指示值或轴向上删除值的新对象:

obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
new_obj = obj.drop('c')
new_obj
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
obj.drop(['d', 'c'])
a    0.0
b    1.0
e    4.0
dtype: float64
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

调用drop时使用标签序列会根据行标签删除值(轴0):

data.drop(['Colorado', 'Ohio'])
one two three four
Utah 8 9 10 11
New York 12 13 14 15

你可以通过传递axis=1或axis='columns’来从列中删除值:

data.drop('two', axis=1)
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
data.drop(['two', 'four'], axis='columns')
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14

很多函数,例如drop,会修改Series或Dataframe的尺寸或形状,这些方法直接操作原对象而不返回新对象:

obj.drop('c', inplace=True)
obj
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

3. 索引、选择与过滤

Series的索引与numpy数组索引的功能类似,只不过普通python切片不包含尾部,而series的切片不同:

obj['c'] = 2.0
obj = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
obj['b':'c']
b    1.0
c    2.0
dtype: float64
obj['b':'c'] = 5
obj
a    0.0
b    5.0
c    5.0
d    3.0
e    4.0
dtype: float64
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data['two']
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64
data[['three', 'one']]
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12

这种方式也有特殊案例。首先,可以根据一个布尔值数组切片或选择数据:

扫描二维码关注公众号,回复: 8949601 查看本文章
data[:2]
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
data[data['three'] > 5]
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data < 5
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
data[data < 5] = 0
data
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

使用loc和iloc选额数据

data.loc['Colorado', ['two', 'three']]
two      5
three    6
Name: Colorado, dtype: int64
data.iloc[2, [3, 0, 1]]
four    11
one      8
two      9
Name: Utah, dtype: int64
data.iloc[2]
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64
data.iloc[[1, 2], [3, 0, 1]]
four one two
Colorado 7 0 5
Utah 11 8 9
data.loc[:'Utah', 'two']
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64
data.iloc[:, :3][data.three > 5]
one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14

对于整数索引,pandas不可以用负索引:

ser = pd.Series(np.arange(3.))
ser
0    0.0
1    1.0
2    2.0
dtype: float64
ser[-1] 
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-67-7ed1c232b3a2> in <module>()
----> 1 ser[-1]


~/.conda/envs/python36/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
    599         key = com._apply_if_callable(key, self)
    600         try:
--> 601             result = self.index.get_value(self, key)
    602 
    603             if not is_scalar(result):


~/.conda/envs/python36/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
   2475         try:
   2476             return self._engine.get_value(s, k,
-> 2477                                           tz=getattr(series.dtype, 'tz', None))
   2478         except KeyError as e1:
   2479             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:


pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4404)()


pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4087)()


pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()


pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:14031)()


pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13975)()


KeyError: -1
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2
a    0.0
b    1.0
c    2.0
dtype: float64
ser2[-1]
2.0

使用填充的算数方法

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list('abcde'))
df2.loc[1, 'b'] = np.nan
df1
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
df2
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 NaN 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
df1 + df2
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 NaN 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
df1.add(df2, fill_value=0)
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 5.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
1 / df1
a b c d
0 inf 1.000000 0.500000 0.333333
1 0.250000 0.200000 0.166667 0.142857
2 0.125000 0.111111 0.100000 0.090909
df1.rdiv(1)
a b c d
0 inf 1.000000 0.500000 0.333333
1 0.250000 0.200000 0.166667 0.142857
2 0.125000 0.111111 0.100000 0.090909

灵活算术方法:

方法 描述
add, radd 想加
sub, rsub 相减
div, rdiv 相除
floordiv, rfloordiv 整除
mul, rmul 方法
pow, rpow 次方

dataframe和series间的操作:

arr = np.arange(12.).reshape((3, 4))
arr
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])
arr[0]
array([ 0.,  1.,  2.,  3.])
arr - arr[0]
array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

这就是广播机制

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
frame
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
series
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64
frame - series
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
frame + series2
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN

4. 函数应用和映射

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
b d e
Utah -0.894119 -1.707091 0.843595
Ohio -0.413837 -0.251321 0.440044
Texas 0.871326 0.828606 -0.521924
Oregon 0.603869 0.154679 2.067279
np.abs(frame)
b d e
Utah 0.894119 1.707091 0.843595
Ohio 0.413837 0.251321 0.440044
Texas 0.871326 0.828606 0.521924
Oregon 0.603869 0.154679 2.067279
f = lambda x: x.max() - x.min()
frame.apply(f)
b    1.765445
d    2.535697
e    2.589203
dtype: float64
frame.apply(f, axis='columns')
Utah      2.550685
Ohio      0.853881
Texas     1.393250
Oregon    1.912600
dtype: float64
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
b d e
min -0.894119 -1.707091 -0.521924
max 0.871326 0.828606 2.067279

逐元素的python函数也可以使用。假设你要根据frame中的每个浮点数计算一个格式化字符串,可以使用applymap方法:

format = lambda x: '%.2f' % x
frame.applymap(format)
b d e
Utah -0.89 -1.71 0.84
Ohio -0.41 -0.25 0.44
Texas 0.87 0.83 -0.52
Oregon 0.60 0.15 2.07

series自己有map方法:

frame['e'].map(format)
Utah       0.84
Ohio       0.44
Texas     -0.52
Oregon     2.07
Name: e, dtype: object
发布了100 篇原创文章 · 获赞 10 · 访问量 3394

猜你喜欢

转载自blog.csdn.net/qq_44315987/article/details/104079821