pandas入门：基本功能

1基本功能

这里我们将会了解与Series或DataFrame中数据交互的基础机制。

1.1重建索引

reindex是pandas对象的重要方法，该方法用于创建一个符合新索引的新对象，Series调用reindex方法时，会将数据按照新的索引进行排列，如果某个索引值之前并不存在，则会引入缺失值：

import pandas as pd
obj = pd.Series([4.5,7.2,-5.3,3.6],index = ['d','b','a','c'])
print(obj)
---------------
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
obj2 = obj.reindex(['a','b','c','d','e'])
print(obj2)
---------------
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

对于顺序数据，比如时间序列，在重建索引时可能会需要进行插值或填值。method可选参数允许我们使用诸如ffill等方法在重建索引时插值，ffill方法会将值前向填充：

obj3 = pd.Series(['blue','purple','yellow'],index = [0,2,4])
print(obj3)
---------------
0      blue
2    purple
4    yellow
dtype: object
print(obj3.reindex(range(6),method = 'ffill'))
---------------
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

在DataFrame中，reindex可以改变行索引、列索引，也可以同时改变二者。当仅传入一个序列时，结果中的行会重建索引：

frame = pd.DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'],columns = ['Ohio','Texas','California'])
print(frame)
--------------------------
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8
frame2 = frame.reindex(['a','b','c','d'])
print(frame2)
--------------------------
   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0

列可以使用columns关键字重建索引：

print(frame.reindex(columns = states))
--------------------------
   Texas  Utah  California
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8

我们也可以使用loc进行更为简洁的标签索引：

print(frame.loc[['a','b','c','d'],states])
--------------------------
   Texas  Utah  California
a    1.0   NaN         2.0
b    NaN   NaN         NaN
c    4.0   NaN         5.0
d    7.0   NaN         8.0

表1-1：reindex方法的参数

参数	描述
index	新建作为索引的序列，可以是索引实例或任意其他序列型python数据结构，索引使用时无须复制
method	插值方式，'ffill’为前向填充，‘bfill’为后向填充
fill_value	通过重新索引引入缺失数据时使用的替代值
limit	当前向或后向填充时，所需填充的最大尺寸间隙（以元素数量）
copy	如果为True，即使新索引等于旧索引，也总是复制底层数据；如果是False，则在索引相同时不要复制数据

1.2轴向上删除条目

如果我们已经拥有索引数组或不含条目的列表，在轴向上删除一个或更多的条目就非常容易，但这样需要一些数据操作和集合逻辑，drop方法会返回一个含有指示值或轴向上删除值的新对象：

obj = pd.Series(np.arange(5),index = ['a','b','c','d','e'])
print(obj)
-------------
a    0
b    1
c    2
d    3
e    4
dtype: int32
print(obj.drop(['d','c']))
-------------
a    0
b    1
e    4
dtype: int32

在DataFrame中，索引值可以从轴向上删除：

data = pd.DataFrame(np.arange(16).reshape((4,4)),index = ['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
print(data)
-------------------------------
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
print(data.drop(['Colorado','Ohio']))
-------------------------------
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15
print(data.drop('two',axis = 1))
-------------------------------
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15

很多函数，例如drop，会修改Series或DataFrame的尺寸或形状，这些方法直接操作原对象而不返回新对象：

obj = pd.Series(np.arange(5),index = ['a','b','c','d','e'])
obj.drop('c',inplace = True)
print(obj)
------------
a    0
b    1
d    3
e    4
dtype: int32

1.3索引、选择与过滤

普通的python切片中是不包含尾部的，Series的切片与之不同：

obj = pd.Series(np.arange(4),index = ['a','b','c','d'])
print(obj['b':'c'])
------------
b    1
c    2
dtype: int32

使用单个值或序列，可以从DataFrame中索引出一个或多个列：

data = pd.DataFrame(np.arange(16).reshape((4,4)),index = ['Ohio','Colorado','Utah','New York'],columns = ['one','two','three','four'])
print(data)
-------------------------------
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
print(data['two'])
----------------------------
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32
print(data[['two','three']])
--------------------
          two  three
Ohio        1      2
Colorado    5      6
Utah        9     10
New York   13     14

使用布尔值DataFrame进行索引，布尔值DataFrame可以是标量值进行比较产生的：

print(data < 5)
------------------------------------
            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False
data[data < 5] = 0
print(data)
-------------------------------
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

1.3.1使用loc和iloc选择数据

针对DataFrame在行上的标签索引，这里讲介绍特殊的索引符号loc和iloc：

data = pd.DataFrame(np.arange(16).reshape((4,4)),index = ['Ohio','Colorado','Utah','New York'],columns = ['one','two','three','four'])
print(data)
-------------------------------
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
print(data.loc['Colorado',['two','three']])
----------------------------
two      5
three    6
Name: Colorado, dtype: int32

然后我们可以使用整数标签进行类似的数据选择：

print(data.iloc[2,[3,0,1]])
-------------------------
four    11
one      8
two      9
Name: Utah, dtype: int32

print(data.iloc[2])
-------------------------
one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

print(data.iloc[[1,2],[3,0,1]])
------------------------
          four  one  two
Colorado     7    4    5
Utah        11    8    9

除了单个标签或标签列表之外，索引功能还可以用于切片：

print(data.loc[:'Utah','two'])
-----------------------
Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int32

print(data.iloc[:,:3][data.three > 5])
-------------------------
          one  two  three
Colorado    4    5      6
Utah        8    9     10
New York   12   13     14

表1-2:DataFrame索引选项

类型	描述
df.loc[val]	根据标签选择DataFrame的单行或多行
df.loc[:,val]	根据标签选择单列或多列
df.loc[val1,val2]	同时选择行和列中的一部分
df.iloc[where]	根据整数位置选择单行或多行
df.iloc[:,where]	根据整数位置选择单列或多列
df.iloc[where_i,where_j	根据整数位置选择行和列
df.at[label_i,label_j]	根据行、列标签选择单个标量值
df.iat[i,j]	根据行、列整数位置选择单个标量值
reindex	通过标签选择行或列
get_value,set_value	根据行和列的标签设置单个值

1.4整数索引

为了保持一致性，如果有一个包含整数的轴索引，数据选择时请始终使用标签索引。为了更精确地处理，可以使用loc（用于标签）或iloc（用于整数）：

ser = pd.Series(np.arange(3))
print(ser[:1])
------------
0    0
dtype: int32
print(ser.loc[:1])
------------
0    0
1    1
dtype: int32
print(ser.iloc[:1])
------------
0    0
dtype: int32

1.5算术与数据对齐

不同索引的对象之间的算术行为是pandas提供给一些应用的一项重要特性。当你将对象相加时，如果存在某个索引对不相同，则返回结果的索引将是索引对的并集。示例如下：

s1 = pd.Series([3.4,2.5,-1.5,2.6],index = ['a','c','d','e'])
s2 = pd.Series([-2.1,2.0,-1.8,5,0.9],index = ['a','c','e','f','g'])
print(s1)
--------------
a    3.4
c    2.5
d   -1.5
e    2.6
dtype: float64
print(s2)
--------------
a   -2.1
c    2.0
e   -1.8
f    5.0
g    0.9
dtype: float64
print(s1 + s2)
--------------
a    1.3
c    4.5
d    NaN
e    0.8
f    NaN
g    NaN
dtype: float64

没有交叠的标签位置上，内部数据对齐会产生缺失值。缺失值会在后续的算术操作上产生影响。在DataFrame的示例中，行和列上都会执行对齐：

df1 = pd.DataFrame(np.arange(9).reshape((3,3)),columns = list('bcd'),index = ['Ohio','Texas','Colorado'])
df2 = pd.DataFrame(np.arange(12).reshape((4,3)),columns = list('bde'),index = ['Utah','Ohio','Texas','Oregon'])
print(df1)
-----------------
          b  c  d
Ohio      0  1  2
Texas     3  4  5
Colorado  6  7  8
print(df2)
-----------------
        b   d   e
Utah    0   1   2
Ohio    3   4   5
Texas   6   7   8
Oregon  9  10  11
print(df1 + df2)
---------------------------
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN

如果将两个行和列完全不同的DataFrame对象相加，结果将全部为空：

df1 = pd.DataFrame({'A':[1,2]})
df2 = pd.DataFrame({'B':[3,4]})
print(df1)
----
   A
0  1
1  2
print(df2)
----
   B
0  3
1  4
print(df1 + df2)
---------
    A   B
0 NaN NaN
1 NaN NaN

1.5.1使用填充值的算术方法

当轴标签在一个对象中存在，在另一个对象中不存在时，我们可以将缺失值填充为0：

df1 = pd.DataFrame(np.arange(12).reshape((3,4)),columns = list('abcd'))
df2 = pd.DataFrame(np.arange(20).reshape((4,5)),columns = list('abcde'))
df2.loc[1,'b'] = np.nan
print(df1)
---------------
   a  b   c   d
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
print(df2)
-----------------------
    a     b   c   d   e
0   0   1.0   2   3   4
1   5   NaN   7   8   9
2  10  11.0  12  13  14
3  15  16.0  17  18  19
print(df1 + df2)
-----------------------------
      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0   NaN  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN

在df1上使用add方法，将df2和一个fill_value作为参数传入：

print(df1.add(df2,fill_value = 0))
-------------------------------
      a     b     c     d     e
0   0.0   2.0   4.0   6.0   4.0
1   9.0   5.0  13.0  15.0   9.0
2  18.0  20.0  22.0  24.0  14.0
3  15.0  16.0  17.0  18.0  19.0

当对Series或DataFrame重建索引时，我们也可以指定一个不同的填充值：

print(df1.reindex(columns = df2.columns,fill_value = 0))
------------------
   a  b   c   d  e
0  0  1   2   3  0
1  4  5   6   7  0
2  8  9  10  11  0

表1-3：灵活算术方法

方法	描述
add,radd	加法（+）
sub,rsub	减法（-）
div,rdiv	除法（/）
floordiv,rfloordiv	整除（//）
mul,rmul	乘法（*）
pow,rpow	幂次方（**）

1.5.2DataFrame和Series间的操作

DataFrame和Series间的算术操作与Numpy中不同维度数组间的操作类似：

arr = np.arange(12).reshape(3,4)
print(arr)
---------------
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
print(arr[0])
---------
[0 1 2 3]
print(arr-arr[0])
---------------
[[0 0 0 0]
 [4 4 4 4]
 [8 8 8 8]]

当我们从arr中减去arr[0]时，减法在每一行都进行了操作，这就是所谓的广播机制。
DataFrame和Series间的操作是类似的：

frame = pd.DataFrame(np.arange(12).reshape((4,3)),columns = list('bde'),index = ['Utah','Ohio','Texas','Oregon'])
series = frame.iloc[0]
print(frame)
-----------------
        b   d   e
Utah    0   1   2
Ohio    3   4   5
Texas   6   7   8
Oregon  9  10  11
print(series)
------------------------
b    0
d    1
e    2
Name: Utah, dtype: int32
print(frame - series)
---------------
        b  d  e
Utah    0  0  0
Ohio    3  3  3
Texas   6  6  6
Oregon  9  9  9

如果一个索引值不在DataFrame的列中，也不在Series的索引中，则对象会重建索引并形成联合：

frame = pd.DataFrame(np.arange(12).reshape((4,3)),columns = list('bde'),index = ['Utah','Ohio','Texas','Oregon'])
series2 = pd.Series(range(3),index = ['b','e','f'])
print(frame + series2)
-------------------------
          b   d     e   f
Utah    0.0 NaN   3.0 NaN
Ohio    3.0 NaN   6.0 NaN
Texas   6.0 NaN   9.0 NaN
Oregon  9.0 NaN  12.0 NaN

如果想改为在列上进行广播，在行上进行匹配，必须使用算术方法中的一种：

frame = pd.DataFrame(np.arange(12).reshape((4,3)),columns = list('bde'),index = ['Utah','Ohio','Texas','Oregon'])
series3 = frame['d']
print(frame.sub(series3,axis = 'index'))
---------------
        b  d  e
Utah   -1  0  1
Ohio   -1  0  1
Texas  -1  0  1
Oregon -1  0  1

这里传递的axis的值是用于匹配轴的。上面的示例中表示我们需要再DataFrame的行索引上对行匹配（axis = ‘index’ 或 axis = 0），并进行广播。

1.6函数应用和映射

NumPy的通用函数对pandas对象也有效：

frame = pd.DataFrame(np.random.randn(4,3),columns = list('bde'),index = ['Utah','Ohio','Texas','Oregon'])
print(frame)
------------------------------------
               b         d         e
Utah    0.488348 -1.199960  0.779822
Ohio   -0.359066 -0.093988 -0.934162
Texas  -0.631553  0.342848  1.578237
Oregon  0.979741 -1.573482 -1.404697
print(np.abs(frame))
------------------------------------
               b         d         e
Utah    0.488348  1.199960  0.779822
Ohio    0.359066  0.093988  0.934162
Texas   0.631553  0.342848  1.578237
Oregon  0.979741  1.573482  1.404697

另一个常用的操作是将函数应用到一行或一列的一维数组上。DataFrame的apply方法可以实现这个功能：

f = lambda x:x.max() - x.min()
print(frame.apply(f))
-------------
b    1.611294
d    1.916330
e    2.982935
dtype: float64

这里的函数f，可以计算Series最大值和最小值的差，会被frame中的每一列调用一次。
如果传递axis = 'columns’给apply函数，函数将会被每行调用一次：

print(frame.apply(f,axis = 'columns'))
------------------
Utah      1.979781
Ohio      0.840174
Texas     2.209790
Oregon    2.553224
dtype: float64

大部分最常用的数组统计（比如sum和mean）都是DataFrame的方法，因此计算统计值时使用apply并不是必需的。
传递给apply的函数并不一定要返回一个标量值，也可以返回带有多个值的Series：

def f(x):
    return pd.Series([x.min(),x.max()],index = ['min','max'])
print(frame.apply(f))
---------------------------------
            b         d         e
min -0.631553 -1.573482 -1.404697
max  0.979741  0.342848  1.578237

逐元素的python函数也可以使用。假设想要根据frame中的每个浮点数计算一个格式化字符串，可以使用applymap方法：

format = lambda x:'%.2f' % x
print(frame.applymap(format))
---------------------------
            b      d      e
Utah     0.49  -1.20   0.78
Ohio    -0.36  -0.09  -0.93
Texas   -0.63   0.34   1.58
Oregon   0.98  -1.57  -1.40

使用applymap作为函数名是因为Series有map方法，可以将一个逐元素的函数应用到Series上：

print(frame['e'].map(format))
----------------------
Utah       0.78
Ohio      -0.93
Texas      1.58
Oregon    -1.40
Name: e, dtype: object

1.7排序和排名

需要按行或列索引进行字典型排序，需要使用sort_index方法，该方法返回一个新的、排序好的对象：

obj = pd.Series(range(4),index = ['d','a','b','c'])
print(obj.sort_index())
------------
a    1
b    2
c    3
d    0
dtype: int64

在DataFrame中，你可以在各个轴上按索引排序：

frame = pd.DataFrame(np.arange(8).reshape((2,4)),index = ['three','one'],columns = ['d','a','b','c'])
print(frame.sort_index())
-----------------
       d  a  b  c
one    4  5  6  7
three  0  1  2  3
print(frame.sort_index(axis = 1))
-----------------
       a  b  c  d
three  1  2  3  0
one    5  6  7  4

数据默认会升序排序，但是也可以按照降序排序：

print(frame.sort_index(axis = 1,ascending=False))
-----------------
       d  c  b  a
three  0  3  2  1
one    4  7  6  5

如果要根据Series的值进行排序，使用sort_values方法：

obj = pd.Series([4,7,-3,2])
print(obj.sort_values())
------------
2   -3
3    2
0    4
1    7
dtype: int64

默认情况下，所有的缺失值都会被排序至Series的尾部：

obj = pd.Series([4,np.nan,7,np.nan,-3,2])
print(obj.sort_values())
--------------
4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

当对DataFrame排序时，可以使用一列或多列作为排序键。为了实现这个功能，传递一个或多个列名给sort_values的可选参数by：

frame = pd.DataFrame({'b':[4,7,-3,2],'a':[0,1,0,-1]})
print(frame)
-------
   b  a
0  4  0
1  7  1
2 -3  0
3  2 -1
print(frame.sort_values(by = 'b'))
-------
   b  a
2 -3  0
3  2 -1
0  4  0
1  7  1
print(frame.sort_values(by = ['a','b']))
-------
   b  a
3  2 -1
2 -3  0
0  4  0
1  7  1

排名是指对数组从1到有效数据点总数分配名次的操作。Series和DataFrame的rank方法是实现排名的方法，默认情况下，rank通过将平均排名分配到每个组来打破平级关系：

obj = pd.Series([7,-5,7,4,2,0,4])
print(obj.rank())
--------------
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

排名也可以根据他们在数据中的观察顺序进行分配：

print(obj.rank(method = 'first'))
--------------
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

也可以按降序排名：

print(obj.rank(ascending = False,method = 'max'))
--------------
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

DataFrame可以对行或列计算排名：

frame = pd.DataFrame({'b':[4.3,7,-3,2],'a':[0,1,0,1],'c':[-2,5,8,-2.5]})
print(frame)
--------------
     b  a    c
0  4.3  0 -2.0
1  7.0  1  5.0
2 -3.0  0  8.0
3  2.0  1 -2.5
print(frame.rank(axis = 'columns'))
----------------
     b    a    c
0  3.0  2.0  1.0
1  3.0  1.0  2.0
2  1.0  2.0  3.0
3  3.0  2.0  1.0

表1-3:排名中的平级关系打破方法

方法	描述
average	默认：在每个组中分配平均排名
min	对整个组使用最小排名
max	对整个组使用最大排名
first	按照值在数据中出现的次序分配排名

1.8含有重复标签的轴索引

索引的is_unique属性可以告诉你标签是否唯一：

obj = pd.Series(range(5),index = ['a','a','b','b','c'])
print(obj.index.is_unique) #False

带有重复索引的情况下，数据选择是与之前操作有差别的主要情况。根据一个标签索引多个条目会返回一个序列，而单个条目会返回标量值。