In-depth understanding of the groupby function of Pandas

sequence

Recently, I am learning Pandas. When processing data, it is often necessary to group and analyze certain fields of the data. This requires the use of the groupby function. This article makes a detailed record

Pandas version 1.4.3

The groupby function in Pandas first splits the DataFrame or Series according to the fields of interest , divides the same attributes into a group, and then performs corresponding conversion operations on the split groups, and finally returns the results of each group after the summary conversion

1. Basic usage

Initialize some data first for demonstration

import pandas as pd

df = pd.DataFrame({
    
    
            'name': ['香蕉', '菠菜', '糯米', '糙米', '丝瓜', '冬瓜', '柑橘', '苹果', '橄榄油'],
            'category': ['水果', '蔬菜', '米面', '米面', '蔬菜', '蔬菜', '水果', '水果', '粮油'],
            'price': [3.5, 6, 2.8, 9, 3, 2.5, 3.2, 8, 18],
            'count': [2, 1, 3, 6, 4, 8, 5, 3, 2]
        })

Initialization data
Group by category

grouped = df.groupby('category')
print(type(grouped))
print(grouped)

output result

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x127112df0>

The type of grouped is DataFrameGroupBy, try to output directly, and print is the memory address, which is not very intuitive. Here is a function to show it (the principle of writing in this way will be introduced later)

def view_group(the_pd_group):
    for name, group in the_pd_group:
        print(f'group name: {
      
      name}')
        print('-' * 30)
        print(group)
        print('=' * 30, '\n')
view_group(grouped)

output result

group name: 水果
------------------------------
    name  category  price  count
0   香蕉       水果    3.5      2
6   柑橘       水果    3.2      5
7   苹果       水果    8.0      3
============================== 
group name: 米面
------------------------------
    name  category  price  count
2   糯米       米面    2.8      3
3   糙米       米面    9.0      6
============================== 
group name: 粮油
------------------------------
   name    category  price  count
8  橄榄油       粮油   18.0      2
============================== 
group name: 蔬菜
------------------------------
    name  category  price  count
1   菠菜       蔬菜    6.0      1
4   丝瓜       蔬菜    3.0      4
5   冬瓜       蔬菜    2.5      8
============================== 

2. Analysis of parameter source code

Next, look at the method in the source code to define
the groupby of DataFrame

def groupby(
        self,
        by=None,
        axis: Axis = 0,
        level: Level | None = None,
        as_index: bool = True,
        sort: bool = True,
        group_keys: bool = True,
        squeeze: bool | lib.NoDefault = no_default,
        observed: bool = False,
        dropna: bool = True,
    ) -> DataFrameGroupBy:
    pass

Series groupby

def groupby(
        self,
        by=None,
        axis=0,
        level=None,
        as_index: bool = True,
        sort: bool = True,
        group_keys: bool = True,
        squeeze: bool | lib.NoDefault = no_default,
        observed: bool = False,
        dropna: bool = True,
    ) -> SeriesGroupBy:
    pass

The operation of the groupby function of Series is similar to that of DataFrame. This article only uses DataFrame as an example.

entering

by

Let's recall the writing in the basic usage

grouped = df.groupby('category')

passed in herecategoryIt is the first parameter by, indicating what to group by. According to the official document, by can be one of mapping, function, label, list of labels, here is the label used, that is to say, it can also be like the following write like this

  1. label list
grouped = df.groupby(['category'])
  1. The mapping
    method needs to be mapped according to the index of the DataFrame, where the fruits and vegetables are divided into large groupsvegetable and fruit, rice noodles and grain and oil are divided into large groupsRice flour, grain and oil
category_dict = {
    
    '水果': '蔬菜水果', '蔬菜': '蔬菜水果', '米面': '米面粮油', '粮油': '米面粮油'}
the_map = {
    
    }
for i in range(len(df.index)):
    the_map[i] = category_dict[df.iloc[i]['category']]
grouped = df.groupby(the_map)
view_group(grouped)

The output is as follows

group name: 米面粮油
------------------------------
    name  category  price  count
2   糯米       米面    2.8      3
3   糙米       米面    9.0      6
8  橄榄油      粮油   18.0      2
============================== 

group name: 蔬菜水果
------------------------------
    name  category  price  count
0   香蕉       水果    3.5      2
1   菠菜       蔬菜    6.0      1
4   丝瓜       蔬菜    3.0      4
5   冬瓜       蔬菜    2.5      8
6   柑橘       水果    3.2      5
7   苹果       水果    8.0      3
============================== 
  1. In this mode of function
    , the input parameter of the custom function is also the index of the DataFrame, and the output result is the same as the mapping example
category_dict = {
    
    '水果': '蔬菜水果', '蔬菜': '蔬菜水果', '米面': '米面粮油', '粮油': '米面粮油'}

def to_big_category(the_idx):
    return category_dict[df.iloc[the_idx]['category']]
grouped = df.groupby(to_big_category)
view_group(grouped)

axis

axis indicates which axis is used as the segmentation basis for grouping
0 - equivalent to index , which means split by row, default
1 - equivalent to columns , means split by column

Here's an example of splitting by column

def group_columns(column_name: str):
    if column_name in ['name', 'category']:
        return 'Group 1'
    else:
        return 'Group 2'
# 等价写法 grouped = df.head(3).groupby(group_columns, axis='columns')
grouped = df.head(3).groupby(group_columns, axis=1)
view_group(grouped)

The output is as follows

group name: Group 1
------------------------------
    name  category
0   香蕉       水果
1   菠菜       蔬菜
2   糯米       米面
============================== 

group name: Group 2
------------------------------
   price  count
0    3.5      2
1    6.0      1
2    2.8      3
==============================

It is equivalent to cutting the table vertically, the left half is Group 1, and the right half is Group 2

level

When the axis is a MultiIndex (hierarchical structure), it is grouped by a specific level. Note that the level here is an int type, starting from 0, 0 means the first layer, and so on

Construct another set of test data with MultiIndex

the_arrays = [['A', 'A', 'A', 'B', 'A', 'A', 'A', 'B', 'A', 'A'],
              ['蔬菜水果', '蔬菜水果', '米面粮油', '休闲食品', '米面粮油', '蔬菜水果', '蔬菜水果', '休闲食品', '蔬菜水果', '米面粮油'],
              ['水果', '蔬菜', '米面', '糖果', '米面', '蔬菜', '蔬菜', '饼干', '水果', '粮油']]
the_index = pd.MultiIndex.from_arrays(arrays=the_arrays, names=['one ', 'two', 'three'])
df_2 = pd.DataFrame(data=[3.5, 6, 2.8, 4, 9, 3, 2.5, 3.2, 8, 18], index=the_index, columns=['price'])
print(df_2)

The output is as follows

                     price
one  two  three       
A    蔬菜水果 水果       3.5
             蔬菜       6.0
     米面粮油 米面       2.8
B    休闲食品 糖果       4.0
A    米面粮油 米面       9.0
     蔬菜水果 蔬菜       3.0
             蔬菜       2.5
B    休闲食品 饼干       3.2
A    蔬菜水果 水果       8.0
     米面粮油 粮油      18.0

1. Group by Tier 3

grouped = df_2.groupby(level=2)
view_group(grouped)

The output is as follows

group name: 水果
------------------------------
                      price
one  two    three       
A    蔬菜水果 水果       3.5
             水果       8.0
============================== 

group name: 米面
------------------------------
                     price
one  two    three       
A    米面粮油 米面       2.8
             米面       9.0
============================== 

group name: 粮油
------------------------------
                      price
one  two    three       
A    米面粮油 粮油      18.0
============================== 

group name: 糖果
------------------------------
                      price
one  two    three       
B    休闲食品 糖果       4.0
============================== 

group name: 蔬菜
------------------------------
                     price
one  two    three       
A    蔬菜水果 蔬菜       6.0
             蔬菜       3.0
             蔬菜       2.5
============================== 

group name: 饼干
------------------------------
                      price
one  two    three       
B    休闲食品 饼干       3.2
==============================

6 groups in total

2. Group by 1st and 2nd layers

grouped = df_2.groupby(level=[0, 1])
view_group(grouped)

The output is as follows

group name: ('A', '米面粮油')
------------------------------
                      price
one  two    three       
A    米面粮油 米面       2.8
             米面       9.0
             粮油      18.0
============================== 

group name: ('A', '蔬菜水果')
------------------------------
                      price
one  two    three       
A    蔬菜水果 水果       3.5
             蔬菜       6.0
             蔬菜       3.0
             蔬菜       2.5
             水果       8.0
============================== 

group name: ('B', '休闲食品')
------------------------------
                      price
one  two    three       
B    休闲食品 糖果       4.0
             饼干       3.2
============================== 

There are 3 groups in total, as you can see, the group name becomes a tuple

as_index

bool type, the default value is True. For aggregated output, the returned object is indexed by the group name

grouped = self.df.groupby('category', as_index=True)
print(grouped.sum())

The output of as_index is True is as follows

            price  count
category              
水果         14.7     10
米面         11.8      9
粮油         18.0      2
蔬菜         11.5     13
grouped = self.df.groupby('category', as_index=False)
print(grouped.sum())

The output result of as_index is False is as follows, which is similar to the SQL groupby output style

    category  price  count
0       水果   14.7     10
1       米面   11.8      9
2       粮油   18.0      2
3       蔬菜   11.5     13

sort

bool type, the default is True . Whether to sort group names, turning off automatic sorting can improve performance. Note: Sorting the group name does not affect the order within the group

group_keys

Bool type, the default is True
If it is True , when calling apply, add the grouped keys to the index

squeeze

Version 1.1.0 is deprecated and does not explain

observed

The bool type, the default value is False
only applies to any groupers of Categoricals
If True , only show observations of categorical groups ; if False , show all values ​​of categorical groups

dropsy

bool type, the default value is True, a new parameter in version 1.1.0
If it is True , and the grouped keys contain NA values, the NA values ​​will be deleted together with the row (axis=0)/column (axis=1)
if it is False , NA values ​​are also regarded as grouping keys and are not processed

return value

The gropuby function of DateFrame, the return type is DataFrameGroupBy , and the groupby function of Series, the return type is SeriesGroupBy
After checking the source code, it is found that they all inherit BaseGroupBy, and the inheritance relationship is shown in the figure

SelectionMixin
+aggregation()
+agg()
BaseGroupBy
+ int axis
+ ops.BaseGrouper grouper
+ bool group_keys
+groups()
GroupBy
+ ops.BaseGrouper grouper
+ bool as_index
+apply()
DataFrameGroupBy
+apply()
+transform()
+filter()
SeriesGroupBy
+apply()
+transform()
+filter()

BaseGroupBy类中有一个grouper属性,是ops.BaseGrouper类型,但BaseGroupBy类没有__init__方法,因此进入GroupBy类,该类重写了父类的grouper属性,在__init__方法中调用了grouper.py的get_grouper,下面是抽取出来的伪代码

groupby.py文件

class GroupBy(BaseGroupBy[NDFrameT]):
	grouper: ops.BaseGrouper
	
	def __init__(self, ...):
		# ...
		if grouper is None:
			from pandas.core.groupby.grouper import get_grouper
			grouper, exclusions, obj = get_grouper(...)

grouper.py文件

def get_grouper(...) -> tuple[ops.BaseGrouper, frozenset[Hashable], NDFrameT]:
	# ...
	# create the internals grouper
    grouper = ops.BaseGrouper(
        group_axis, groupings, sort=sort, mutated=mutated, dropna=dropna
    )
	return grouper, frozenset(exclusions), obj

class Grouping"""
	obj : DataFrame or Series
	"""
	def __init__(
        self,
        index: Index,
        grouper=None,
        obj: NDFrame | None = None,
        level=None,
        sort: bool = True,
        observed: bool = False,
        in_axis: bool = False,
        dropna: bool = True,
    ):
    	pass

ops.py文件

class BaseGrouper:
    """
    This is an internal Grouper class, which actually holds
    the generated groups
    
    ......
    """
    def __init__(self, axis: Index, groupings: Sequence[grouper.Grouping], ...):
    	# ...
    	self._groupings: list[grouper.Grouping] = list(groupings)
    
    @property
    def groupings(self) -> list[grouper.Grouping]:
        return self._groupings

BaseGrouper中包含了最终生成的分组信息,是一个list,其中的元素类型为grouper.Grouping,每个分组对应一个Grouping,而Grouping中的obj对象为分组后的DataFrame或者Series

在第一部分写了一个函数来展示groupby返回的对象,这里再来探究一下原理,对于可迭代对象,会实现__iter__()方法,先定位到BaseGroupBy的对应方法

class BaseGroupBy:
	grouper: ops.BaseGrouper
	
	@final
    def __iter__(self) -> Iterator[tuple[Hashable, NDFrameT]]:
        return self.grouper.get_iterator(self._selected_obj, axis=self.axis)

接下来进入BaseGrouper类中

class BaseGrouper:
    def get_iterator(
        self, data: NDFrameT, axis: int = 0
    ) -> Iterator[tuple[Hashable, NDFrameT]]:
        splitter = self._get_splitter(data, axis=axis)
        keys = self.group_keys_seq
        for key, group in zip(keys, splitter):
            yield key, group.__finalize__(data, method="groupby")

Debug模式进入group.finalize()方法,发现返回的确实是DataFrame对象
__iter__() method Debug details of BaseGroupBy

三、4大函数

有了上面的基础,接下来再看groupby之后的处理函数,就简单多了

agg

聚合操作是groupby后最常见的操作,常用来做数据分析
比如,要查看不同category分组的最大值,以下三种写法都可以实现,并且grouped.aggregate和grouped.agg完全等价,因为在SelectionMixin类中有这样的定义:agg = aggregate
insert image description here
但是要聚合多个字短时,就只能用aggregate或者agg了,比如要获取不同category分组下price最大,count最小的记录
insert image description here
还可以结合numpy里的聚合函数

import numpy as np
grouped.agg({
    
    'price': np.max, 'count': np.min})

常见的聚合函数如下

聚合函数 功能
max 最大值
mean 平均值
median 中位数
min 最小值
sum 求和
std 标准差
var 方差
count 计数

其中,count在numpy中对应的调用方式为np.size

transform

现在需要新增一列price_mean,展示每个分组的平均价格

transform函数刚好可以实现这个功能,在指定分组上产生一个与原df相同索引的DataFrame,返回与原对象有相同索引且已填充了转换后的值的DataFrame,然后可以把转换结果新增到原来的DataFrame上
示例代码如下

grouped = df.groupby('category', sort=False)
df['price_mean'] = grouped['price'].transform('mean')
print(df)

输出结果如下
insert image description here

apply

现在需要获取各个分组下价格最高的数据,调用apply可以实现这个功能,apply可以传入任意自定义的函数,实现复杂的数据操作

from pandas import DataFrame
grouped = df.groupby('category', as_index=False, sort=False)

def get_max_one(the_df: DataFrame):
    sort_df = the_df.sort_values(by='price', ascending=True)
    return sort_df.iloc[-1, :]
max_price_df = grouped.apply(get_max_one)
max_price_df

输出结果如下
insert image description here

filter

filter函数可以对分组后数据做进一步筛选,该函数在每一个分组内,根据筛选函数排除不满足条件的数据并返回一个新的DataFrame

假设现在要把平均价格低于4的分组排除掉,根据transform小节的数据,会把蔬菜分类过滤掉

grouped = df.groupby('category', as_index=False, sort=False)
filtered = grouped.filter(lambda sub_df: sub_df['price'].mean() > 4)
print(filtered)

输出结果如下
insert image description here

四、总结

The process of groupby is to divide the original DataFrame/Series into several grouped DataFrame/Series according to the fields of groupby, and there are as many grouped DataFrame/Series as there are groups. Therefore, a series of operations after groupby (such as agg, apply, etc.) are based on sub-DataFrame/Series operations. After understanding this, you understand the main principle of the groupby operation in Pandas

5. Reference documents

Introduction to pandas.DateFrame.groupby on Pandas official website
Introduction to pandas.Series.groupby on Pandas official website

Guess you like

Origin blog.csdn.net/u013481793/article/details/127158683