pandas 学习第3篇：序列的处理（排序、连接、替换、更新和缺失值）

对序列进行处理，包括对序列进行排序、追加一个序列、对序列值进行替换、对序列的值进行更新，并处理序列中出现的缺失值。

一，序列的排序

按照值或索引对序列进行排序：

Series.sort_values(self, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
Series.sort_index(self, axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True)

参数注释：

axis：对Series而言，只能是0
ascending：默认值是True，按照升序排序；如果设置为False，按照降序排序。
inplace：是否就地对原始序列进行排序，如果设置为True，那么对原始序列进行排序；如果设置为False，那么原始序列不会改变，返回有序的序列。
kind：排序的方法，有效值是quicksort,mergesort,heapsort，默认值是quicksort
na_position：first 把Nan放在顺序的开始，last把Nan放在顺序的最后
level：多级索引的级别，默认是None，按照level 0的索引进行排序；
sort_remaining：如果设置为True，对多级索引而言，其他级别的索引也会相应的进行排序。

1，按照值来排序

按照序列的值进行排序，Nan放在last位置，

>>> s = pd.Series([np.nan, 1, 3, 10, 5])
>>> s.sort_values()
1     1.0
2     3.0
4     5.0
3    10.0
0     NaN
dtype: float64

2，按照索引来排序

按照索引的值进行排序

>>> s = pd.Series(['a', 'b', 'c', 'd'], index=[3, 2, 1, 4])
>>> s.sort_index()
1    c
2    b
3    a
4    d
dtype: object

二，序列的追加连接

使用追加的方法，把序列追加在另一个序列之后，合并为一个新的序列：

Series.append(self, to_append, ignore_index=False, verify_integrity=False)

参数注释：

to_append：追加的序列
ignore_index：默认值是False，不忽略索引；如果设置为True，那么连接之后的序列会重建索引。
verify_integrity：默认值是False，如果设置为True，在创建索引时出现重复会抛出异常。

举个例子，把两个序列合并为一个，当不忽略索引时，把序列的索引作为合并之后的索引；当忽略索引时，新的序列会重建索引。

>>> s1 = pd.Series([1, 2, 3])
>>> s2 = pd.Series([4, 5, 6])
>>> s1.append(s2)
0    1
1    2
2    3
0    4
1    5
2    6
dtype: int64
>>> s1.append(s2, ignore_index=True)
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

三，序列值的替换

把序列中的值替换为另一个值：

Series.replace(self, to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')

参数注释：

to_replace：参数to_replace是序列中原有的值，查找到该值之后，把该值替换为参数value指定的值
inplace：默认值是False，不就地修改序列
limit：替换的最大次数
regex：是否把to_replace 解释为正则表达式，默认值是False。如果设置为True，那么参数to_replace必须是字符串
method：有效值是pad、ffill、bfill，当参数to_replace是标量、列表或字典，并且参数value是None时，使用method参数来替换。

场景1：参数to_replace是标量，参数value也是标量

>>> s = pd.Series([0, 1, 2, 3, 4])
>>> s.replace(0, 5)
0    5
1    1
2    2
3    3
4    4
dtype: int64

场景2：参数to_replace是列表，参数value是标量，例如，把序列中匹配to_replace列表中的值替换为5

>>> s = pd.Series([0, 1, 2, 3, 4])
>>> s.replace([0,2],5)
0    5
1    1
2    5
3    3
4    4
dtype: int64

场景3，参数to_replace是字典，参数value是None，例如，把序列中匹配字典的key的值替换为字典的value。

>>> s = pd.Series([0, 1, 2, 3, 4])
>>> s.replace({0:5,1:7})
0    5
1    7
2    2
3    3
4    4
dtype: int64

四，序列的更新

序列值的更新，有3种方式，第一种方式是使用标量值更新序列的单个值，第二种方式是通过索引切片修改序列的多个值，第三种方式是使用序列来更新一个序列。

1，使用标量来更新序列

索引到索引的单个值，通过赋值来修改序列

>>> s = pd.Series([0, 1, 2, 3, 4])
>>> s.iat[1]=7
>>> s
0    0
1    7
2    2
3    3
4    4
dtype: int64

2，使用切片来更新序列

通过loc属性获得序列的切片，通过赋值一个列表来修改序列的多个值：

>>> s.loc[1:2]=[2,3]
>>> s
0    0
1    2
2    3
3    3
4    4
dtype: int64

3，使用序列来更新序列

按照索引对齐方式就地修改序列，也就是说，在修改原始序列的值时，原始序列的索引必须和参数序列进行匹配，把索引相同的值修改为新值。

Series.update(self, other)

举个例子，使用update()函数，修改索引为0和2的值为'd'和'e'：

>>> s = pd.Series(['a', 'b', 'c'])
>>> s.update(pd.Series(['d', 'e'], index=[0, 2]))
>>> s
0    d
1    b
2    e
dtype: object

五，处理序列的缺失值

缺失值使用NumPy.NaN ，NumPy.nan或者None来表示，使用isna()函数来检查是否存在NA，使用dropna()删除序列中的NA值，使用fillna()函数填充缺失值，

1，检查序列是否存在缺失值

>>> s=pd.Series(data=[1,2,np.NaN,4])
>>> s.isna()
0    False
1    False
2     True
3    False
dtype: bool

2，删除序列中的缺失值

>>> s.dropna()
0    1.0
1    2.0
3    4.0
dtype: float64

六，根据相邻的有效数据来填充

使用fillna()函数，找到缺失数据相邻的有效数据，使用该有效数据来填充缺失值：

Series.fillna(self, value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

参数注释：

value：用于填充缺失的值
method：寻找有效值的方法，回填（‘backfill’, ‘bfill’）,补填（ ‘pad’, ‘ffill’）和固定值填充（method= None），默认值是None
downcast：向下转换，尽可能把类型转换为较低的类型，默认值是None，例如，尽可能把float64转换为Int64。

1，回填

回填是指backfill和bfill 方法，用缺失值之后的第一有效值来填充

>>> s.fillna(method='bfill')
0    1.0
1    2.0
2    4.0
3    4.0
dtype: float64

2，补填

补填是指 pad和ffill方法，用缺失值之前的有效值来填充

>>> s.fillna(method='ffill')
0    1.0
1    2.0
2    2.0
3    4.0
dtype: float64

3，固定值填充

当method为None时，使用value参数指定的值来填充缺失值，固定值可以是均值、中位数、和众数，

>>> s.fillna(value=3,method=None)
0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64

七，使用插补法来填充

使用插补法拟合出缺失的值，然后用拟合值来填充缺失值：

Series.interpolate(self, method='linear', axis=0, limit=None, inplace=False, limit_direction='forward', limit_area=None, downcast=None, **kwargs)

参数注释：

limit_direction：限制的方向，有效值是{‘forward’, ‘backward’, ‘both’}，如果指定方向，使用该方向来填充NaN
limit_area：有效值是{None, ‘inside’, ‘outside’}，None表示没有填充限制，inside表示只填充那些被有效值围绕的NaN；outside表示在有效值之外填充NaN（推断）。
method：获得插值的方法，默认值是线性(linear)，已经实现的method是：linear、time、index、pad、‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘spline’, ‘barycentric’, ‘polynomial’,‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’,‘from_derivatives’
**kwargs：关键字参数，传递给插值函数

常见的插补方法是线性回归和多项式回归

1，线性回归拟合

linear是默认的拟合方法，linear 忽略索引，并发序列值作为等间距，

>>> s.interpolate()
0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64

2，多项式拟合

polynomial表示多项式拟合，需要传递order参数：

>>> s.interpolate(method='polynomial',order=2)
0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64

参考文档：