pandas - GroupBy: split-apply-combine

重点

  • 核心结果是GroupBy对象
  • 按照列的值分组,是拆分行; 按照索引(行)值分组是拆分列
  • reset_index()可以降低索引的层数 (译注:多层级的索引有些难以理解)
  • agg()函数支持对每个分组做处理
  • agg()传入list或dict时返回DataFrame,否则返回Series

“group by”的含义包括以下的一或多个操作:

  • Splittng
    基于某种准则把数据分组
  • Applying
    对每个分组分别调用一个函数
  • Combining
    把处理结果组合成新的数据结构
    最容易理解的是Splitting操作.在很多场景下,我们都是需要把数据分组并对分组做一些处理.在Applying阶段,我们可能期望的是如下操作:

  • Aggregation
    对每个分组计算摘要或统计量,例如计算均值,和,数目等

  • Transformation
    对每个分组做变换,返回 like-indexed object, 例如在组内做标准化(zscore)或填充NA

  • Filtration
    依据某种准测删除分组,比如删除数据较少的分组或基于组的和或均值过滤分组

  • 组合操作
    上述三个操作的组合: GroupBy will examine the results of the apply step and try to return a sensibly combined result if it doesn’t fit into either of the above two categories

基于pandas数据结果的操作通常丰富而且直观,我们通常是把分组看作DataFrame,调用相关函数完成任务.熟悉基于SQL工具的读者应该比较熟悉GroupBy这个功能,类似如下的语句

SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2

pandas的操作和上面的代码一样简单易懂.以下将覆盖GroupBy的每个用法并提供一些复杂的例子.
高级的用法可以在cookbook中找到

Splitting an object into groups

pandas对象可以在任意轴做拆分.分组的定义是标签到分组名称的映射.可以用如下的代码获得GroupBy对象

# default is axis=0
>>> grouped = obj.groupby(key)
>>> grouped = obj.groupby(key, axis=1)
>>> grouped = obj.groupby([key1, key2])

这种映射可以通过如下的几种方式给出:

  • python函数,可以作用在目标轴的标签上
  • list或numpy数组,其长度和目标轴一致
  • dict或Series,提供label -> group name的映射
  • 对于DataFrame对象,一个字符串表示按某一列分组.
    df.groupby(‘A’)是df.gropuby(df[‘A’])的缩略写法.
  • 对于DataFrame对象,字符串也可表示用来分组的索引(在0.20以后版本中,字符串如果同时匹配到列和索引,优先使用列做分组,会打印一个警告信息)
  • 上面所有方式构成的list

一般我们把分组的对象称为Key.例如下面的DataFrame

In [1]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : np.random.randn(8),
   ...:                    'D' : np.random.randn(8)})
   ...: 

In [2]: df
Out[2]: 
     A      B         C         D
0  foo    one  0.469112 -0.861849
1  bar    one -0.282863 -2.104569
2  foo    two -1.509059 -0.494929
3  bar  three -1.135632  1.071804
4  foo    two  1.212112  0.721555
5  bar    two -0.173215 -0.706771
6  foo    one  0.119209 -1.039575
7  foo  three -1.044236  0.271860

通过调用DataFrame的groupby()接口可以获得GroupBy对象,可以依据A或B列做分组,也可以同时依据A和B

In [3]: grouped = df.groupby('A')

In [4]: grouped = df.groupby(['A', 'B'])

上述代码将在索引(行)上做分组(译注:groupby(‘A’)将按”A”列的内容分组,’A’列有两个值foo和bar,所以df将被分成两组,一组的’A’只有foo,另一组的’A’只有bar,这是在索引轴上做分组),下面的代码按列分组

In [5]: def get_letter_type(letter):
   ...:     if letter.lower() in 'aeiou': #列名字是{a,e,i,o,u}的分成一组,新组名vowel
   ...:         return 'vowel'
   ...:     else:
   ...:         return 'consonant'
   ...: 

In [6]: grouped = df.groupby(get_letter_type, axis=1)

pandas的Index对象支持重复的值.如果一个不唯一的索引值作为分组依据,相同索引将被划分为一个组,所以aggregation函数将不包括重复的索引值:

In [7]: lst = [1, 2, 3, 1, 2, 3]

In [8]: s = pd.Series([1, 2, 3, 10, 20, 30], lst)

In [9]: grouped = s.groupby(level=0)

#译注 : print s
1     1
2     2
3     3
1    10
2    20
3    30
dtype: int64


In [10]: grouped.first()
Out[10]: 
1    1
2    2
3    3
dtype: int64

In [11]: grouped.last()
Out[11]: 
1    10
2    20
3    30
dtype: int64

In [12]: grouped.sum()
Out[12]: 
1    11
2    22
3    33
dtype: int64

分组操作时延迟进行的,生成GroupBy对象只是验证传递的映射是否有效

GroupBy sorting

默认分组会按照key排序,令sort=False可以节省排序的时间开销

In [13]: df2 = pd.DataFrame({'X' : ['B', 'B', 'A', 'A'], 'Y' : [1, 2, 3, 4]})

In [14]: df2.groupby(['X']).sum()
Out[14]: 
   Y
X   
A  7
B  3

In [15]: df2.groupby(['X'], sort=False).sum()
Out[15]: 
   Y
X   
B  3
A  7

groupby操作不会修改观测量在组内的次序,而是保持其在原始DataFrame中出现的次序

In [16]: df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})

In [17]: df3.groupby(['X']).get_group('A')
Out[17]: 
   X  Y
0  A  1
2  A  3

In [18]: df3.groupby(['X']).get_group('B')
Out[18]: 
   X  Y
1  B  4
3  B  2

GroupBy object attributes

groups的属性是一个字典.字典的key是分组的标签,字典的值是每个标签对应的分组.

In [19]: df.groupby('A').groups
Out[19]: 
{'bar': Int64Index([1, 3, 5], dtype='int64'),
 'foo': Int64Index([0, 2, 4, 6, 7], dtype='int64')}

In [20]: df.groupby(get_letter_type, axis=1).groups
Out[20]: 
{'consonant': Index(['B', 'C', 'D'], dtype='object'),
 'vowel': Index(['A'], dtype='object')}

调用python标准len函数将得到groups字典的大小

In [21]: grouped = df.groupby(['A', 'B'])

In [22]: grouped.groups
Out[22]: 
{('bar', 'one'): Int64Index([1], dtype='int64'),
 ('bar', 'three'): Int64Index([3], dtype='int64'),
 ('bar', 'two'): Int64Index([5], dtype='int64'),
 ('foo', 'one'): Int64Index([0, 6], dtype='int64'),
 ('foo', 'three'): Int64Index([7], dtype='int64'),
 ('foo', 'two'): Int64Index([2, 4], dtype='int64')}

In [23]: len(grouped)
Out[23]: 6

命令行模式下,GroupBy对象输入TAB键将自动填充列名字和其他的属性

In [24]: df
Out[24]: 
               height      weight  gender
2000-01-01  42.849980  157.500553    male
2000-01-02  49.607315  177.340407    male
2000-01-03  56.293531  171.524640    male
2000-01-04  48.421077  144.251986  female
2000-01-05  46.556882  152.526206    male
2000-01-06  68.448851  168.272968  female
2000-01-07  70.757698  136.431469    male
2000-01-08  58.909500  176.499753  female
2000-01-09  76.435631  174.094104  female
2000-01-10  45.306120  177.540920    male

In [25]: gb = df.groupby('gender')

In [26]: gb.<TAB>
gb.agg        gb.boxplot    gb.cummin     gb.describe   gb.filter     gb.get_group  gb.height     gb.last       gb.median     gb.ngroups    gb.plot       gb.rank       gb.std        gb.transform
gb.aggregate  gb.count      gb.cumprod    gb.dtype      gb.first      gb.groups     gb.hist       gb.max        gb.min        gb.nth        gb.prod       gb.resample   gb.sum        gb.var
gb.apply      gb.cummax     gb.cumsum     gb.fillna     gb.gender     gb.head       gb.indices    gb.mean       gb.name       gb.ohlc       gb.quantile   gb.size       gb.tail       gb.weight

GroupBy with MultiIndex

对于hierachically-indexed data,可以按照层次中的任意层分组. 先创建一个两层的MultiIndex

In [27]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ....:           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
   ....: 

In [28]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])

In [29]: s = pd.Series(np.random.randn(8), index=index)

In [30]: s
Out[30]: 
first  second
bar    one      -0.919854
       two      -0.042379
baz    one       1.247642
       two      -0.009920
foo    one       0.290213
       two       0.495767
qux    one       0.362949
       two       1.548106
dtype: float64

按照s的一个层分组

In [31]: grouped = s.groupby(level=0)

In [32]: grouped.sum()
Out[32]: 
first
bar   -0.962232
baz    1.237723
foo    0.785980
qux    1.911055
dtype: float64

如果MultiIndex被赋予了名字,可以用名字替换层数

In [33]: s.groupby(level='second').sum()
Out[33]: 
second
one    0.980950
two    1.991575
dtype: float64

aggregation函数,比如sum函数,支持直接输入层数.另外结果索引将直接用选择的层命名

In [34]: s.sum(level='second')
Out[34]: 
second
one    0.980950
two    1.991575
dtype: float64

基于多个层的分组也是支持的

In [35]: s
Out[35]: 
first  second  third
bar    doo     one     -1.131345
               two     -0.089329
baz    bee     one      0.337863
               two     -0.945867
foo    bop     one     -0.932132
               two      1.956030
qux    bop     one      0.017587
               two     -0.016692
dtype: float64

In [36]: s.groupby(level=['first', 'second']).sum()
Out[36]: 
first  second
bar    doo      -1.220674
baz    bee      -0.608004
foo    bop       1.023898
qux    bop       0.000895
dtype: float64

0.20版本新增支把层作为key

In [37]: s.groupby(['first', 'second']).sum()
Out[37]: 
first  second
bar    doo      -1.220674
baz    bee      -0.608004
foo    bop       1.023898
qux    bop       0.000895
dtype: float64

Grouping DataFrame with index Levels and Columns

DataFrame可以同时按照列和索引分组,此时需要用字符串设置列名,用pd.Grouper对象设置索引

In [38]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ....:           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
   ....: 

In [39]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])

In [40]: df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
   ....:                    'B': np.arange(8)},
   ....:                   index=index)
   ....: 

In [41]: df
Out[41]: 
              A  B
first second      
bar   one     1  0
      two     1  1
baz   one     1  2
      two     1  3
foo   one     2  4
      two     2  5
qux   one     3  6
      two     3  7

下面的例子用A列和second索引分组

In [42]: df.groupby([pd.Grouper(level=1), 'A']).sum()
Out[42]: 
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

索引也可以用名字设置

In [43]: df.groupby([pd.Grouper(level='second'), 'A']).sum()
Out[43]: 
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

0.20新版本中允许直接把索引值作为key

In [44]: df.groupby(['second', 'A']).sum()
Out[44]: 
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

DataFrame column selection in GroupBy

获得GroupBy对象后,可能需要对不同的列做不同的处理.这时可以利用[]获得一列数据,如下所示:

In [45]: grouped = df.groupby(['A'])

In [46]: grouped_C = grouped['C']

In [47]: grouped_D = grouped['D']

以上是为了简化使用而设计的语法糖(译注:增加的一种语法,不影响功能,只是单纯的方便使用),其等价于下面的语句

In [48]: df['C'].groupby(df['A'])
Out[48]: <pandas.core.groupby.groupby.SeriesGroupBy object at 0x1c2f67b128>

Iterating through groups

一旦获得GroupBy对象,遍历分组十分方便,和itertools.groupby()类似:

In [49]: grouped = df.groupby('A')

In [50]: for name, group in grouped:
   ....:        print(name)
   ....:        print(group)
   ....: 
bar
     A      B         C         D
1  bar    one  0.254161  1.511763
3  bar  three  0.215897 -0.990582
5  bar    two -0.077118  1.211526
foo
     A      B         C         D
0  foo    one -0.575247  1.346061
2  foo    two -1.143704  1.627081
4  foo    two  1.193555 -0.441652
6  foo    one -0.408530  0.268520
7  foo  three -0.862495  0.024580

如果是基于多索引的分组,分组名就是元祖

In [51]: for name, group in df.groupby(['A', 'B']):
   ....:        print(name)
   ....:        print(group)
   ....: 
('bar', 'one')
     A    B         C         D
1  bar  one  0.254161  1.511763
('bar', 'three')
     A      B         C         D
3  bar  three  0.215897 -0.990582
('bar', 'two')
     A    B         C         D
5  bar  two -0.077118  1.211526
('foo', 'one')
     A    B         C         D
0  foo  one -0.575247  1.346061
6  foo  one -0.408530  0.268520
('foo', 'three')
     A      B         C        D
7  foo  three -0.862495  0.02458
('foo', 'two')
     A    B         C         D
2  foo  two -1.143704  1.627081
4  foo  two  1.193555 -0.441652

这就是标准的python语法,而且可以在循环中展开元组:

for (k1,k2), group in grouped:

Selecting a group

利用get_group()可以获得单个分组

In [52]: grouped.get_group('bar')
Out[52]: 
     A      B         C         D
1  bar    one  0.254161  1.511763
3  bar  three  0.215897 -0.990582
5  bar    two -0.077118  1.211526

使用元组获得多列的分组

In [53]: df.groupby(['A', 'B']).get_group(('bar', 'one'))
Out[53]: 
     A    B         C         D
1  bar  one  0.254161  1.511763

Aggregation

有了GroupBy对象后,有一些方法可以用来处理分组数据.这些操作类似 aggregating API, windows function API 和 resample API.
常用的aggregation是利用aggregate(),其等价于agg()方法:

In [54]: grouped = df.groupby('A')

In [55]: grouped.aggregate(np.sum)
Out[55]: 
            C         D
A                      
bar  0.392940  1.732707
foo -1.796421  2.824590

In [56]: grouped = df.groupby(['A', 'B'])

In [57]: grouped.aggregate(np.sum)
Out[57]: 
                  C         D
A   B                        
bar one    0.254161  1.511763
    three  0.215897 -0.990582
    two   -0.077118  1.211526
foo one   -0.983776  1.614581
    three -0.862495  0.024580
    two    0.049851  1.185429

如上aggregation的结果中用分组名作为新的索引,对于多索引,默认结果是MultiIndex.但可以利用as_index选项修改默认值.
(译注:as_index=False把多层索引变成单层索引,方法是延展高层索引)

In [58]: grouped = df.groupby(['A', 'B'], as_index=False)

In [59]: grouped.aggregate(np.sum)
Out[59]: 
     A      B         C         D
0  bar    one  0.254161  1.511763
1  bar  three  0.215897 -0.990582
2  bar    two -0.077118  1.211526
3  foo    one -0.983776  1.614581
4  foo  three -0.862495  0.024580
5  foo    two  0.049851  1.185429

In [60]: df.groupby('A', as_index=False).sum()
Out[60]: 
     A         C         D
0  bar  0.392940  1.732707
1  foo -1.796421  2.824590

利用DataFrame的reset_index()函数也可以达到相同目的

In [61]: df.groupby(['A', 'B']).sum().reset_index()
Out[61]: 
     A      B         C         D
0  bar    one  0.254161  1.511763
1  bar  three  0.215897 -0.990582
2  bar    two -0.077118  1.211526
3  foo    one -0.983776  1.614581
4  foo  three -0.862495  0.024580
5  foo    two  0.049851  1.185429

另一个简单的aggregation例子是计算每个分组的大小,可以用GroupBy的size方法.其返回的是一个Series,分组名做为索引,分组大小作为值

In [62]: grouped.size()
Out[62]: 
A    B    
bar  one      1
     three    1
     two      1
foo  one      2
     three    1
     two      2
dtype: int64
In [63]: grouped.describe()
Out[63]: 
      C                                                                ...            D                                                            
  count      mean       std       min       25%       50%       75%    ...         mean       std       min       25%       50%       75%       max
0   1.0  0.254161       NaN  0.254161  0.254161  0.254161  0.254161    ...     1.511763       NaN  1.511763  1.511763  1.511763  1.511763  1.511763
1   1.0  0.215897       NaN  0.215897  0.215897  0.215897  0.215897    ...    -0.990582       NaN -0.990582 -0.990582 -0.990582 -0.990582 -0.990582
2   1.0 -0.077118       NaN -0.077118 -0.077118 -0.077118 -0.077118    ...     1.211526       NaN  1.211526  1.211526  1.211526  1.211526  1.211526
3   2.0 -0.491888  0.117887 -0.575247 -0.533567 -0.491888 -0.450209    ...     0.807291  0.761937  0.268520  0.537905  0.807291  1.076676  1.346061
4   1.0 -0.862495       NaN -0.862495 -0.862495 -0.862495 -0.862495    ...     0.024580       NaN  0.024580  0.024580  0.024580  0.024580  0.024580
5   2.0  0.024925  1.652692 -1.143704 -0.559389  0.024925  0.609240    ...     0.592714  1.462816 -0.441652  0.075531  0.592714  1.109898  1.627081

[6 rows x 16 columns]

注意:按列值Aggregation时,默认(as_index=True)不会返回进行分组的组,列只是返回的对象的索引 (译注:多层索引). 当as_index=False时会返回分组的group(译注:这个结果更加容易理解)

Aggregation函数降低了返回对象的维度,一些aggregating函数列举如下:

Function Description
mean() Compute mean of groups
sum() Compute sum of group values
size() Compute group sizes
count() Compute count of group
std() Standard deviation of groups
var() Compute variance of groups
sem() Standard error of the mean of groups
describe() Generates descriptive statistics
first() Compute first of group values
last() Compute last of group values
nth() Take nth value, or a subset if n is a list
min() Compute min of group values
max() Compute max of group values

上述aggregating函数会排除NA. 任意可以把Series映射到标量的函数都可以,比如df.groupby(‘A’).agg(lambda ser:1).
Note that nth() can act as a reducer or a filter, see here

Applying multiple functions at once

可以传入一个函数列表或字典进行aggregation,输出DataFrame(译注:否则输出Series)

In [64]: grouped = df.groupby('A')

In [65]: grouped['C'].agg([np.sum, np.mean, np.std])
Out[65]: 
          sum      mean       std
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

如果是DataFrame的分组结果,传递一个函数list,agg的结果是分层索引,如下 (译注:对每一列都会被所有函数调用)

In [66]: grouped.agg([np.sum, np.mean, np.std])
Out[66]: 
            C                             D                    
          sum      mean       std       sum      mean       std
A                                                              
bar  0.392940  0.130980  0.181231  1.732707  0.577569  1.366330
foo -1.796421 -0.359284  0.912265  2.824590  0.564918  0.884785

aggregations的结果以函数名命名,可以利用rename()函数传入字典重命名

In [67]: (grouped['C'].agg([np.sum, np.mean, np.std])
   ....:              .rename(columns={'sum': 'foo',
   ....:                               'mean': 'bar',
   ....:                               'std': 'baz'})
   ....: )
   ....: 
Out[67]: 
          foo       bar       baz
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

对于分组的DataFrame,可以用同样的方法重命名

In [68]: (grouped.agg([np.sum, np.mean, np.std])
   ....:         .rename(columns={'sum': 'foo',
   ....:                          'mean': 'bar',
   ....:                          'std': 'baz'})
   ....:  )
   ....: 
Out[68]: 
            C                             D                    
          foo       bar       baz       foo       bar       baz
A                                                              
bar  0.392940  0.130980  0.181231  1.732707  0.577569  1.366330
foo -1.796421 -0.359284  0.912265  2.824590  0.564918  0.884785

Applying different functions to DataFrame columns

利用字典可以对不同列做不同的处理

In [69]: grouped.agg({'C' : np.sum,
   ....:              'D' : lambda x: np.std(x, ddof=1)})
   ....: 
Out[69]: 
            C         D
A                      
bar  0.392940  1.366330
foo -1.796421  0.884785

函数名可以用字符串,但是其要么是GroupBy已经实现的,要么通过dispatching可以调用

In [70]: grouped.agg({'C' : 'sum', 'D' : 'std'})
Out[70]: 
            C         D
A                      
bar  0.392940  1.366330
foo -1.796421  0.884785

注意: 传递dict到agg函数,输出的次序有可能会被修改,只有传入OrderdDict才可以保证输出次序,如下所示

In [71]: grouped.agg({'D': 'std', 'C': 'mean'})
Out[71]: 
            D         C
A                      
bar  1.366330  0.130980
foo  0.884785 -0.359284

In [72]: grouped.agg(OrderedDict([('D', 'std'), ('C', 'mean')]))
Out[72]: 
            D         C
A                      
bar  1.366330  0.130980
foo  0.884785 -0.359284

Cython-optimaized aggregation functions

sum/std/sem这三个agg函数用Cython实现以提高速度

In [73]: df.groupby('A').sum()
Out[73]: 
            C         D
A                      
bar  0.392940  1.732707
foo -1.796421  2.824590

In [74]: df.groupby(['A', 'B']).mean()
Out[74]: 
                  C         D
A   B                        
bar one    0.254161  1.511763
    three  0.215897 -0.990582
    two   -0.077118  1.211526
foo one   -0.491888  0.807291
    three -0.862495  0.024580
    two    0.024925  0.592714

Transformation

待续….

猜你喜欢

转载自blog.csdn.net/z0n1l2/article/details/80828613