Pandas 分组聚合

# 导入相关库 
import numpy as np 
import pandas as pd

创建数据

index = pd.Index(data=["Tom", "Bob", "Mary", "James", "Andy", "Alice"], name="name") 
data = { 
"age": [18, 30, 35, 18, np.nan, 30], 
"city": ["Bei Jing ", "Shang Hai ", "Guang Zhou", "Shen Zhen", np.nan, " "], 
"sex": ["male", "male", "female", "male", np.nan, "female"],
"income": [3000, 8000, 8000, 4000, 6000, 7000] 
} 
user_info = pd.DataFrame(data=data, index=index) 
user_info
"""
　　　　 age    city    　　sex      income
name                
Tom    18.0    Bei Jing    male    3000
Bob    30.0    Shang Hai   male    8000
Mary   35.0    Guang Zhou  female  8000
James  18.0    Shen Zhen   male    4000
Andy    NaN    NaN   　　   NaN     6000
Alice  30.0    　　　　　　  female  7000
"""

将对象分割成组

　　在进行分组统计前，首先要做的就是进行分组

# 依据性别来分组
grouped = user_info.groupby(user_info["sex"]) 
grouped.groups 

# 更简洁的方式：
# 依据性别来分组 
user_info.groupby("sex").groups 
# 先按照性别来分组，再按照年龄进一步分组
user_info.groupby(["sex", "age"]).groups

关闭排序

　　默认情况下，groupby 会在操作过程中对数据进行排序。如果为了更好的性能，可以设置 sort=False

grouped = user_info.groupby(["sex"], sort=False) 
grouped.groups

选择列

　　在使用 groupby 进行分组后，可以使用切片 [] 操作来完成对某一列的选择

grouped = user_info.groupby("sex")
grouped

遍历分组

　　在对数据进行分组后，可以进行遍历

单个字段分组遍历

grouped = user_info.groupby("sex") 
for name, group in grouped: 
    print("name: {}".format(name)) 
    print("group: {}".format(group)) 
    print("--------------") 

"""
name: female
group: age        city    sex  　　income
name                                   
Mary   35.0  Guang Zhou  female    8000
Alice  30.0              female    7000
--------------
name: male
group: age        city   sex  　　income
name                                 
Tom    18.0   Bei Jing   male    3000
Bob    30.0  Shang Hai   male    8000
James  18.0  Shen Zhen   male    4000
"""

多个字段分组遍历

grouped = user_info.groupby(["sex", "age"]) 
for name, group in grouped: 
    print("name: {}".format(name)) 
    print("group: {}".format(group)) 
    print("--------------") 

"""
name: ('female', 30.0)
group:         age city     sex  income
name                            
Alice  30.0       female    7000
--------------
name: ('female', 35.0)
group:        age        city     sex  income
name                                  
Mary  35.0  Guang Zhou  female    8000
--------------
name: ('male', 18.0)
group:         age       city   sex  income
name                                
Tom    18.0  Bei Jing   male    3000
James  18.0  Shen Zhen  male    4000
--------------
name: ('male', 30.0)
group:        age        city   sex  income
name                                
Bob   30.0  Shang Hai   male    8000
"""

选择一个组

.get_group ()

　　分组后，我们可以通过 get_group 方法来选择其中的某一个组

grouped = user_info.groupby("sex") 
grouped.get_group("male") 
user_info.groupby(["sex", "age"]).get_group(("male", 18))

聚合

　　分组的目的是为了统计，统计的时候需要聚合，所以我们需要在分完组后来看下如何进行聚合。常见的一些聚合操作有：计数、求和、最大值、最小值、平均值等。得到的结果是
一个以分组名作为索引的结果对象。

　　想要实现聚合操作，一种方式就是调用 agg 方法。

.agg()

# 获取不同性别下所包含的人数 
grouped = user_info.groupby("sex") 
grouped["age"].agg(len)
# 获取不同性别下包含的最大的年龄 
grouped = user_info.groupby("sex") 
grouped["age"].agg(np.max)

如果是根据多个键来进行聚合，默认情况下得到的结果是一个多层索引结构

grouped = user_info.groupby(["sex", "age"]) 
rs = grouped.agg(len) 
rs
"""
　　　　　　　　　city    income
sex     age        
female  30.0    1      1
　　　 　35.0    1       1
male    18.0    2      2
　　　　　30.0    1      1

避免出现多层索引

.reset_index()

　　对包含多层索引的对象调用 reset_index 方法

rs.reset_index()

参数 as_index=False

　　在分组时，设置参数 as_index=False

grouped = user_info.groupby(["sex", "age"], as_index=False) 
grouped["income"].agg(max)

.describe()查看数据情况

　　Series 和 DataFrame 都包含了 describe 方法，我们分组后一样可以使用 describe 方法来查看数据的情况。

grouped = user_info.groupby("sex") 
grouped.describe().reset_index()

一次应用多个聚合操作

得到多个统计结果

　　有时候进行分组后，不单单想得到一个统计结果，有可能是多个

# 统计出不同性别下的收入的总和和平均值
grouped = user_info.groupby("sex") 
grouped["income"].agg([np.sum, np.mean]).reset_index()

统计结果重命名

　　如果想将统计结果进行重命名，可以传入字典

grouped = user_info.groupby("sex") 
grouped["income"].agg([np.sum, np.mean]).rename(columns={"sum": "income_sum", "mean": " income_mean"})

对 DataFrame 列应用不同的聚合操作

　　有时候可能需要对不同的列使用不同的聚合操作

# 统计不同性别下人群的年龄的均值以及收入的总和
grouped.agg({"age": np.mean, "income": np.sum}).rename(columns={"age": "age_mean", "income": "income_sum"}).reset_index()

transform 操作

　　前面进行聚合运算的时候，得到的结果是一个以分组名作为索引的结果对象。虽然可以指定 as_index=False ,但是得到的索引也并不是元数据的索引。如果我们想使用原数组的索引的话，就需要进行 merge 转换

　　transform 方法简化了这个过程，它会把 func 参数应用到所有分组，然后把结果放置到原数组的索引上（如果结果是一个标量，就进行广播）

# 通过 agg 得到的结果的索引是分组名 
grouped = user_info.groupby("sex") 
grouped["income"].agg(np.mean) 

# 通过 transform 得到的结果的索引是原始索引，它会将得到的结果自动关联上原始的索引 grouped = user_info.groupby("sex") 
grouped["income"].transform(np.mean)

　　可以看到，通过 transform 操作得到的结果的长度与原来保持一致

apply 操作

　　apply 会将待处理的对象拆分成多个片段，然后对各片段调用传入的函数，最后尝试用pd.concat() 把结果组合起来。func 的返回值可以是 Pandas 对象或标量，并且数组对象的大小不限。

# 使用 apply 来完成上面的聚合 
grouped = user_info.groupby("sex") 
grouped["income"].apply(np.mean) 

# 统计不同性别最高收入的前 n 个值 
def f1(ser, num=2): 
return ser.nlargest(num).tolist() 
grouped["income"].apply(f1) 

# 获取不同性别下的年龄的均值 
def f2(df): 
    return df["age"].mean() 
grouped.apply(f2)