Pandas学习笔记（三）—— Pandas分组

前导

更多文章代码详情可查看博主个人网站：https://www.iwtmbtly.com/

导入需要使用的库和文件：

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.read_csv('data/table.csv',index_col='ID')
>>> df.head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+

一、SAC过程

1. 内涵

SAC指的是分组操作中的split-apply-combine过程，其中：

split指基于某一些规则，将数据拆成若干组；
apply是指对每一组独立地使用函数；
combine指将每一组的结果组合成某一类数据结构；

2. apply过程

在该过程中，我们实际往往会遇到四类问题：

整合（Aggregation）——即分组计算统计量（如求均值、求每组元素个数）
变换（Transformation）——即分组对每个单元的数据进行操作（如元素标准化）
过滤（Filtration）——即按照某些规则筛选出一些组（如选出组内某一指标小于50的组）
综合问题——即前面提及的三种问题的混合

二、groupby函数

（一）分组函数的基本内容

1. 根据某一列分组

# 经过groupby后会生成一个groupby对象，该对象本身不会返回任何东西，只有当相应的方法被调用才会起作用
>>> grouped_single = df.groupby('School')
# 例如取出某一个组：
>>> grouped_single.get_group('S_1').head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+

2. 根据某几列分组

>>> grouped_mul = df.groupby(['School','Class'])
>>> grouped_mul.get_group(('S_2','C_4'))
     School Class Gender   Address  Height  Weight  Math Physics
ID
2401    S_2   C_4      F  street_2     192      62  45.3       A
2402    S_2   C_4      M  street_7     166      82  48.7       B
2403    S_2   C_4      F  street_6     158      60  59.7      B+
2404    S_2   C_4      F  street_2     160      84  67.7       B
2405    S_2   C_4      F  street_6     193      54  47.6       B

3. 组容量与数组

>>> grouped_single.size()
School
S_1    15
S_2    20
dtype: int64

>>> grouped_mul.size()
School  Class
S_1     C_1      5
        C_2      5
        C_3      5
S_2     C_1      5
        C_2      5
        C_3      5
        C_4      5
dtype: int64

>>> grouped_single.ngroups
2
>>> grouped_mul.ngroups
7

4. 组的遍历

>>> for name,group in grouped_single:
...     print(name)
...     print(group.head())
...
S_1
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+
S_2
     School Class Gender   Address  Height  Weight  Math Physics
ID
2101    S_2   C_1      M  street_7     174      84  83.3       C
2102    S_2   C_1      F  street_6     161      61  50.6      B+
2103    S_2   C_1      M  street_4     157      61  52.5      B-
2104    S_2   C_1      F  street_5     159      97  72.2      B+
2105    S_2   C_1      M  street_4     170      81  34.2       A

5. level参数（用于多级索引）和axis参数

>>> df.set_index(['Gender', 'School']).groupby(level=1, axis=0).get_group('S_1').head()
              Class   Address  Height  Weight  Math Physics
Gender School
M      S_1      C_1  street_1     173      63  34.0      A+
F      S_1      C_1  street_2     192      73  32.5      B+
M      S_1      C_1  street_2     186      82  87.2      B+
F      S_1      C_1  street_2     167      81  80.4      B-
       S_1      C_1  street_4     159      64  84.8      B+

（二）group对象的特点

1. 查看所有可调用的方法

由下例可见，groupby对象可以使用相当多的函数，灵活程度很高：

>>> print([attr for attr in dir(grouped_single) if not attr.startswith('_')])
['Address', 'Class', 'Gender', 'Height', 'Math', 'Physics', 'School', 'Weight', 'agg', 'aggregate', 'all', 'any', 'apply', 'backfill', 'bfill', 'boxplot', 'corr', 'corrwith', 'count', 'cov', 'cumcount', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'dtypes', 'ewm', 'expanding', 'ffill', 'fillna', 'filter', 'first', 'get_group', 'groups', 'head', 'hist', 'idxmax', 'idxmin', 'indices', 'last', 'mad', 'max', 'mean', 'median', 'min', 'ndim', 'ngroup', 'ngroups', 'nth', 'nunique', 'ohlc', 'pad', 'pct_change', 'pipe', 'plot', 'prod', 'quantile', 'rank', 'resample', 'rolling', 'sample', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail', 'take', 'transform', 'tshift', 'var']

2. 分组对象的head和first

对分组对象使用head函数，返回的是每个组的前几行，而不是数据集前几行：

>>> grouped_single.head(2)
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
2101    S_2   C_1      M  street_7     174      84  83.3       C
2102    S_2   C_1      F  street_6     161      61  50.6      B+

first显示的是以分组为索引的每组的第一个分组信息：

>>> grouped_single.first()
       Class Gender   Address  Height  Weight  Math Physics
School
S_1      C_1      M  street_1     173      63  34.0      A+
S_2      C_1      M  street_7     174      84  83.3       C

3. 分组依据

对于groupby函数而言，分组的依据是非常自由的，只要是与数据框长度相同的列表即可，同时支持函数型分组：

# 相当于将np.random.choice(['a','b','c'],df.shape[0])当做新的一列进行分组
>>> df.groupby(np.random.choice(['a','b','c'],df.shape[0])).get_group('a').head()
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1105    S_1   C_1      F  street_4     159      64  84.8      B+
1201    S_1   C_2      M  street_5     188      68  97.0      A-
2103    S_2   C_1      M  street_4     157      61  52.5      B-

从原理上说，我们可以看到利用函数时，传入的对象就是索引，因此根据这一特性可以做一些复杂的操作：

>>> df[:5].groupby(lambda x:print(x)).head(0)
1101
1102
1103
1104
1105
Empty DataFrame
Columns: [School, Class, Gender, Address, Height, Weight, Math, Physics]
Index: []

根据奇偶行分组

>>> df.groupby(lambda x:'奇数行' if not df.index.get_loc(x)%2==1 else '偶数行').groups
{
    
    '偶数行': [1102, 1104, 1201, 1203, 1205, 1302, 1304, 2101, 2103, 2105, 2202, 2204, 2301, 2303, 2305, 2402, 2404], '奇数行': [1101, 1103, 1105, 1202, 1204, 1301, 1303, 1305, 2102, 2104, 2201, 2203, 2205, 2302, 2304, 2401, 2403, 2405]}

如果是多层索引，那么lambda表达式中的输入就是元组，下面实现的功能为查看两所学校中男女生分别均分是否及格

注：此处只是演示groupby的用法，实际操作不会这样写

>>> math_score = df.set_index(['Gender','School'])['Math'].sort_index()
>>> grouped_score = df.set_index(['Gender','School']).sort_index().\
...             groupby(lambda x:(x,'均分及格' if math_score[x].mean()>=60 else '均分不及格'))
>>> for name,_ in grouped_score:print(name)
(('F', 'S_1'), '均分及格')
(('F', 'S_2'), '均分及格')
(('M', 'S_1'), '均分及格')
(('M', 'S_2'), '均分不及格')

4. group的[]操作

可以用[]选出groupby对象的某个或者某几个列，上面的均分比较可以如下简洁地写出：

>>> df.groupby(['Gender','School'])['Math'].mean()>=60
Gender  School
F       S_1        True
        S_2        True
M       S_1        True
        S_2       False
Name: Math, dtype: bool

用列表可选出多个属性列：

>>> df.groupby(['Gender','School'])[['Math','Height']].mean()
                    Math      Height
Gender School
F      S_1     64.100000  173.125000
       S_2     66.427273  173.727273
M      S_1     63.342857  178.714286
       S_2     51.155556  172.000000

5. 连续型变量分组

例如利用cut函数对数学成绩分组：

>>> bins = [0,40,60,80,90,100]
>>> cuts = pd.cut(df['Math'],bins=bins) #可选label添加自定义标签
>>> df.groupby(cuts)['Math'].count()
Math
(0, 40]       7
(40, 60]     10
(60, 80]      9
(80, 90]      7
(90, 100]     2
Name: Math, dtype: int64

三、聚合、过滤和变换

（一）聚合（Aggregation）

1. 常用聚合函数

啊所谓聚合就是把一堆数，变成一个标量，因此mean/sum/size/count/std/var/sem/describe/first/last/nth/min/max都是聚合函数

为了熟悉操作，不妨验证标准误sem函数，它的计算公式是：

下面进行验证：

>>> group_m = grouped_single['Math']
>>> group_m.std().values/np.sqrt(group_m.count().values)== group_m.sem().values
array([ True,  True])

2. 同时使用多个聚合函数

>>> group_m.agg(['sum','mean','std'])
           sum       mean        std
School
S_1      956.2  63.746667  23.077474
S_2     1191.1  59.555000  17.589305

利用元组进行重命名：

>>> group_m.agg([('rename_sum','sum'),('rename_mean','mean')])
        rename_sum  rename_mean
School
S_1          956.2    63.746667
S_2         1191.1    59.555000

指定哪些函数作用哪些列：

>>> grouped_mul.agg({
    
    'Math':['mean','max'],'Height':'var'})
               Math       Height
               mean   max    var
School Class
S_1    C_1    63.78  87.2  183.3
       C_2    64.30  97.0  132.8
       C_3    63.16  87.7  179.2
S_2    C_1    58.56  83.3   54.7
       C_2    62.80  85.4  256.0
       C_3    63.06  95.5  205.7
       C_4    53.80  67.7  300.2

3. 使用自定义函数

#可以发现，agg函数的传入是分组逐列进行的，有了这个特性就可以做许多事情
>>> grouped_single['Math'].agg(lambda x:print(x.head(),'间隔'))
1101    34.0
1102    32.5
1103    87.2
1104    80.4
1105    84.8
Name: Math, dtype: float64 间隔
2101    83.3
2102    50.6
2103    52.5
2104    72.2
2105    34.2
Name: Math, dtype: float64 间隔
School
S_1    None
S_2    None
Name: Math, dtype: object

官方没有提供极差计算的函数，但通过agg可以容易地实现组内极差计算

>>> grouped_single['Math'].agg(lambda x:x.max()-x.min())
School
S_1    65.5
S_2    62.8
Name: Math, dtype: float64

利用NamedAgg函数进行多个聚合

注：不支持lambda函数，但是可以使用外置的def函数

>>> def R1(x):
...     return x.max()-x.min()
>>> def R2(x):
...     return x.max()-x.median()
>>> grouped_single['Math'].agg(min_score1=pd.NamedAgg(column='col1', aggfunc=R1),
...                            max_score1=pd.NamedAgg(column='col2', aggfunc='max'),
...                            range_score2=pd.NamedAgg(column='col3', aggfunc=R2)).head()
	min_score1	max_score1	range_score2
School			
S_1	65.5	97.0	33.5
S_2	62.8	95.5	39.4

5. 带参数的聚合函数

判断是否组内数学分数至少有一个值在50-52之间：

>>> def f(s,low,high):
...     return s.between(low,high).max()
>>> grouped_single['Math'].agg(f,50,52)
School
S_1    False
S_2     True
Name: Math, dtype: bool

如果需要使用多个函数，并且其中至少有一个带参数，则使用wrap技巧：

>>> def f_test(s,low,high):
...     return s.between(low,high).max()
>>> def agg_f(f_mul,name,*args,**kwargs):
...     def wrapper(x):
...         return f_mul(x,*args,**kwargs)
...     wrapper.__name__ = name
...     return wrapper
>>> new_f = agg_f(f_test,'at_least_one_in_50_52',50,52)
>>> grouped_single['Math'].agg([new_f,'mean']).head()
        at_least_one_in_50_52       mean
School
S_1                     False  63.746667
S_2                      True  59.555000

（二）过滤（Filteration）

filter函数是用来筛选某些组的（务必记住结果是组的全体），因此传入的值应当是布尔标量

>>> grouped_single[['Math','Physics']].filter(lambda x:(x['Math']>32).all()).head()
      Math Physics
ID
2101  83.3       C
2102  50.6      B+
2103  52.5      B-
2104  72.2      B+
2105  34.2       A

（三）变换（Transformation）

1. 传入对象

transform函数中传入的对象是组内的列，并且返回值需要与列长完全一致：

>>> grouped_single[['Math','Height']].transform(lambda x:x-x.min()).head()
      Math  Height
ID
1101   2.5      14
1102   1.0      33
1103  55.7      27
1104  48.9       8
1105  53.3       0

如果返回了标量值，那么组内的所有元素会被广播为这个值：

>>> grouped_single[['Math','Height']].transform(lambda x:x.mean()).head()
           Math      Height
ID
1101  63.746667  175.733333
1102  63.746667  175.733333
1103  63.746667  175.733333
1104  63.746667  175.733333
1105  63.746667  175.733333

2. 利用变换方法进行组内标准化

>>> grouped_single[['Math','Height']].transform(lambda x:(x-x.mean())/x.std()).head()
          Math    Height
ID
1101 -1.288991 -0.214991
1102 -1.353990  1.279460
1103  1.016287  0.807528
1104  0.721627 -0.686923
1105  0.912289 -1.316166

3. 利用变换方法进行组内缺失值的均值填充

>>> df_nan = df[['Math','School']].copy().reset_index()
>>> df_nan.loc[np.random.randint(0,df.shape[0],25),['Math']]=np.nan
>>> df_nan.head()
     ID  Math School
0  1101   NaN    S_1
1  1102   NaN    S_1
2  1103   NaN    S_1
3  1104  80.4    S_1
4  1105   NaN    S_1

>>> df_nan.groupby('School').transform(lambda x: x.fillna(x.mean())).join(df.reset_index()['School']).head()
     ID   Math School
0  1101  62.22    S_1
1  1102  62.22    S_1
2  1103  62.22    S_1
3  1104  80.40    S_1
4  1105  62.22    S_1

四、apply函数

（一）apply函数的灵活性

可能在所有的分组函数中，apply是应用最为广泛的，这得益于它的灵活性。

对于传入值而言，从下面的打印内容可以看到是以分组的表传入apply中：

>>> df.groupby('School').apply(lambda x:print(x.head(1)))
     School Class Gender   Address  Height  Weight  Math Physics
ID
1101    S_1   C_1      M  street_1     173      63  34.0      A+
     School Class Gender   Address  Height  Weight  Math Physics
ID
2101    S_2   C_1      M  street_7     174      84  83.3       C
Empty DataFrame
Columns: []
Index: []

apply函数的灵活性很大程度来源于其返回值的多样性：

1. 标量返回值

>>> df[['School','Math','Height']].groupby('School').apply(lambda x:x.max())
       School  Math  Height
School
S_1       S_1  97.0     195
S_2       S_2  95.5     194

2. 列表返回值

>>> df[['School','Math','Height']].groupby('School').apply(lambda x:x-x.min()).head()
      Math  Height
ID
1101   2.5    14.0
1102   1.0    33.0
1103  55.7    27.0
1104  48.9     8.0
1105  53.3     0.0

3. 数据框返回值

>>> df[['School','Math','Height']].groupby('School')\
...     .apply(lambda x:pd.DataFrame({
    
    'col1':x['Math']-x['Math'].max(),
...                                   'col2':x['Math']-x['Math'].min(),
...                                   'col3':x['Height']-x['Height'].max(),
...                                   'col4':x['Height']-x['Height'].min()})).head()
      col1  col2  col3  col4
ID
1101 -63.0   2.5   -22    14
1102 -64.5   1.0    -3    33
1103  -9.8  55.7    -9    27
1104 -16.6  48.9   -28     8
1105 -12.2  53.3   -36     0

（二）用apply同时统计多个指标

此处可以借助OrderedDict工具进行快捷的统计：

>>> from collections import OrderedDict
>>> def f(df):
...     data = OrderedDict()
...     data['M_sum'] = df['Math'].sum()
...     data['W_var'] = df['Weight'].var()
...     data['H_mean'] = df['Height'].mean()
...     return pd.Series(data)
>>> grouped_single.apply(f)
         M_sum       W_var      H_mean
School
S_1      956.2  117.428571  175.733333
S_2     1191.1  181.081579  172.950000