pandas系列学习（六）：数据聚合

作者：chen_h
微信号 & QQ：862251340
微信公众号：coderpai

我最近一直在探索的一个方面是通过不同变量对大型数据帧进行分组，以及对每个组应用汇总函数的任务。这是在 pandas 中使用 DataFrame 对象的 groupby() 和 agg() 函数完成。

示例 DataFrame

为了演示分组命令的有效性和简单性，我们需要一些数据。对于示例数据集，我已经提取了自己的手机使用记录。如果你也想分析这个数据集，你可以点击这里进行下载。

该数据集包含来自我的手机日志的 830 个条目，总时间为 5 个月。可以使用 DataFrame.read_csv() 函数将 CSV 文件加载到 DataFrame 中，如下所示：

df = pd.read_csv('./phone_data.csv')
df.head()

	index	date	duration	item	month	network	network_type
0	0	15/10/14 06:58	34.429	data	2014-11	data	data
1	1	15/10/14 06:58	13.000	call	2014-11	Vodafone	mobile
2	2	15/10/14 14:46	23.000	call	2014-11	Meteor	mobile
3	3	15/10/14 14:48	4.000	call	2014-11	Tesco	mobile
4	4	15/10/14 17:27	4.000	call	2014-11	Tesco	mobile

该文件的主要列是：

date：输入的日期和时间；
duration：每个呼叫的持续时间（以秒为单位），每个数据条目的数据量（以 MB 为单位），以及每个 sms 条目发送的文本数（通常为 1）；
item：事件发生的描述——可以是呼叫，短信或者数据之一；
month：每个条目所属的结算月份——形式为 “YYYY-MM”；
network：为每个条目调用 / 发送的移动网络；
network_type：被呼叫的号码是移动，国际，语音邮件，固定电话还是其他特殊号码；

删除了所有的隐私电话，我们可以非常方便的用 dateutil 库来解析日期列。

import dateutil
data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)

	index	date	duration	item	month	network	network_type
0	0	2014-10-15 06:58:00	34.429	data	2014-11	data	data
1	1	2014-10-15 06:58:00	13.000	call	2014-11	Vodafone	mobile
2	2	2014-10-15 14:46:00	23.000	call	2014-11	Meteor	mobile
3	3	2014-10-15 14:48:00	4.000	call	2014-11	Tesco	mobile
4	4	2014-10-15 17:27:00	4.000	call	2014-11	Tesco	mobile

聚合 DataFrame

一旦将数据加载到 Python 中，pandas 就可以非常简单的计算不同的统计数据。例如，计算列的平均值，最大值，最小值，标准偏差等等。

# How many rows the dataset
print(data['item'].count())

# What was the longest phone call / data entry?
print(data['duration'].max())

# How many seconds of phone calls are recorded in total?
print(data['duration'][data['item'] == 'call'].sum())

# How many entries are there for each month?
print(data['month'].value_counts())

# Number of non-null unique network entries
print(data['network'].nunique())

830
10528.0
92321.0
2014-11    230
2015-01    205
2014-12    157
2015-02    137
2015-03    101
Name: month, dtype: int64
9

除非你有非常具体的要求，否则对自定义功能的需求很小。pandas 中内置的函数已经能基本满足需求了，具体的函数可以点击这个页面。

Function	Description
`count`	Number of non-null observations
`sum`	Sum of values
`mean`	Mean of values
`mad`	Mean absolute deviation
`median`	Arithmetic median of values
`min`	Minimum
`max`	Maximum
`mode`	Mode
`abs`	Absolute Value
`prod`	Product of values
`std`	Unbiased standard deviation
`var`	Unbiased variance
`sem`	Unbiased standard error of the mean
`skew`	Unbiased skewness (3rd moment)
`kurt`	Unbiased kurtosis (4th moment)
`quantile`	Sample quantile (value at %)
`cumsum`	Cumulative sum
`cumprod`	Cumulative product
`cummax`	Cumulative maximum
`cummin`	Cumulative minimum

.describe() 函数是一个有用的汇总工具，可以快速显示应用它的任何变量或者组的统计信息。describe() 输出取决于你是将其应用于数字列还是字符列。

聚合 DataFrame 中的组

通过掌握 pandas groupby() 函数，你可以获得更多的快捷。groupby() 实际上根据你选择的变量将数据拆分为不同的组。例如，表达式 data.groupby(‘month’) 将按月拆分当前的 DataFrame 。

groupby() 函数将返回 GroupBy 对象，但实际上描述了如何拆分原始数据集的行。GroupBy 对象.groups变量是一个字典，其键是计算的唯一组，对应的值是属于每个组的轴标签。例如：

data.groupby(['month']).groups.keys()

dict_keys(['2014-11', '2014-12', '2015-01', '2015-02', '2015-03'])

len(data.groupby(['month']).groups['2014-11'])

max()，min()，mean()，first()，last() 等函数可以快速应用于 GroupBy 对象，以获取每个组的汇总统计信息 —— 这是一个非常有用的函数。此功能类似于 R 的 dplyr 和 plyr 库。可以从每个摘要中查看不同的变量。

# Get the first entry for each month

data.groupby('month').first()

	index	date	duration	item	network	network_type
month
2014-11	0	2014-10-15 06:58:00	34.429	data	data	data
2014-12	228	2014-11-13 06:58:00	34.429	data	data	data
2015-01	381	2014-12-13 06:58:00	34.429	data	data	data
2015-02	577	2015-01-13 06:58:00	34.429	data	data	data
2015-03	729	2015-02-12 20:15:00	69.000	call	landline	landline

# Get the sum of the durations per month

data.groupby('month')['duration'].sum()

month
2014-11    26639.441
2014-12    14641.870
2015-01    18223.299
2015-02    15522.299
2015-03    22750.441
Name: duration, dtype: float64

# Get the number of dates / entries in each month

data.groupby('month')['date'].count()

month
2014-11    230
2014-12    157
2015-01    205
2015-02    137
2015-03    101
Name: date, dtype: int64

# What is the sum of durations, for calls only, to each network

data[data['item'] == 'call'].groupby('network')['duration'].sum()

network
Meteor        7200.0
Tesco        13828.0
Three        36464.0
Vodafone     14621.0
landline     18433.0
voicemail     1775.0
Name: duration, dtype: float64

你还可以按多个变量进行分组，从而允许更复杂的查询。

# How many calls, sms, and data entries are in each month?

data.groupby(['month', 'item'])['date'].count()

month    item
2014-11  call    107
         data     29
         sms      94
2014-12  call     79
         data     30
         sms      48
2015-01  call     88
         data     31
         sms      86
2015-02  call     67
         data     31
         sms      39
2015-03  call     47
         data     29
         sms      25
Name: date, dtype: int64

# How many calls, texts, and data are sent per month, split by network_type?

data.groupby(['month', 'network_type'])['date'].count()

month    network_type
2014-11  data             29
         landline          5
         mobile          189
         special           1
         voicemail         6
2014-12  data             30
         landline          7
         mobile          108
         voicemail         8
         world             4
2015-01  data             31
         landline         11
         mobile          160
         voicemail         3
2015-02  data             31
         landline          8
         mobile           90
         special           2
         voicemail         6
2015-03  data             29
         landline         11
         mobile           54
         voicemail         4
         world             3
Name: date, dtype: int64

Groupby 输出格式 —— Series 或者 DataFrame？

groupby 和聚合操作的输出在 padnas Series 和 pandas DataFrame 之间变化，这可能会让新用户感到困惑。根据经验，如果计算多个结果列，则结果将为 DataFrame 。对于单列结果，默认情况下，agg 函数将生成一个 Series。

你可以通过不同的选择操作列来更改此设置：

data.groupby('month')['duration'].sum() # produces Pandas Series

month
2014-11    26639.441
2014-12    14641.870
2015-01    18223.299
2015-02    15522.299
2015-03    22750.441
Name: duration, dtype: float64

data.groupby('month')[['duration']].sum() # Produces Pandas DataFrame

	duration
month
2014-11	26639.441
2014-12	14641.870
2015-01	18223.299
2015-02	15522.299
2015-03	22750.441

Groupby 输出将在与你选择的分组变量对应的行上具有索引或者多索引。要避免设置此索引，请将 as_index=False 传递给 groupby 操作。

data.groupby('month', as_index=False).agg({"duration": "sum"})

	month	duration
0	2014-11	26639.441
1	2014-12	14641.870
2	2015-01	18223.299
3	2015-02	15522.299
4	2015-03	22750.441

在 pandas 中对数据进行分组时使用 as_index 参数可防止在结果中设置行索引。

每组多个统计信息

我们将要检查的最后一段语法是 pandas 的 agg() 函数。agg() 函数提供的聚合功能允许在一次计算中按组计算多个统计数据。语法很简单，类似于 MongoDB 的聚合框架。

2017 年 5 月对 pandas 聚合函数进行了重大更改。在 agg() 函数中重命名变量不再如下图所示 —— 请参考注释。

在这里插入图片描述

使用 agg() 函数聚合 pandas DataFrame 中的变量。请注意，在 pandas 版本 0.20.1 以后，结果的重命名需要单独完成。

将单个函数应用于组中的列

聚合指令以 Python 字典或列表的形式提供。字典键用于指定要执行操作的列，以及用于制定要运行的函数的字典值。例如：

# Group the data frame by month and item and extract a number of stats from each group
data.groupby(['month', 'item']).agg({'duration':sum,      # find the sum of the durations for each group
                                     'network_type': "count", # find the number of network type entries
                                     'date': 'first'})    # get the first date per group

		duration	network_type	date
month	item
2014-11	call	25547.000	107	2014-10-15 06:58:00
data	998.441	29	2014-10-15 06:58:00
sms	94.000	94	2014-10-16 22:18:00
2014-12	call	13561.000	79	2014-11-14 17:24:00
data	1032.870	30	2014-11-13 06:58:00
sms	48.000	48	2014-11-14 17:28:00
2015-01	call	17070.000	88	2014-12-15 20:03:00
data	1067.299	31	2014-12-13 06:58:00
sms	86.000	86	2014-12-15 19:56:00
2015-02	call	14416.000	67	2015-01-15 10:36:00
data	1067.299	31	2015-01-13 06:58:00
sms	39.000	39	2015-01-15 12:23:00
2015-03	call	21727.000	47	2015-02-12 20:15:00
data	998.441	29	2015-02-13 06:58:00
sms	25.000	25	2015-02-19 18:46:00

聚合字典语法是灵活的，可以在操作之前定义。你还可以使用 lambda 函数内联定义函数，以提取内置选项未提供的统计信息。

# Define the aggregation procedure outside of the groupby operation
aggregations = {
    'duration':'sum',
    'date': lambda x: max(x) - 1
}
data.groupby('month').agg(aggregations)

将多个函数应用于组中的列

要将多个函数应用于分组数据中的单个列，请展开上面的语法，将函数列表作为聚合数据框中的值传递。如下：

# Group the data frame by month and item and extract a number of stats from each group
data.groupby(['month', 'item']).agg({'duration': [min, max, sum],      # find the min, max, and sum of the duration column
                                     'network_type': "count", # find the number of network type entries
                                     'date': [min, 'first', 'nunique']})    # get the min, first, and number of unique dates per group

		duration	network_type	date
		min	max	sum	count	min	first	nunique
month	item
2014-11	call	1.000	1940.000	25547.000	107	2014-10-15 06:58:00	2014-10-15 06:58:00	104
data	34.429	34.429	998.441	29	2014-10-15 06:58:00	2014-10-15 06:58:00	29
sms	1.000	1.000	94.000	94	2014-10-16 22:18:00	2014-10-16 22:18:00	79
2014-12	call	2.000	2120.000	13561.000	79	2014-11-14 17:24:00	2014-11-14 17:24:00	76
data	34.429	34.429	1032.870	30	2014-11-13 06:58:00	2014-11-13 06:58:00	30
sms	1.000	1.000	48.000	48	2014-11-14 17:28:00	2014-11-14 17:28:00	41
2015-01	call	2.000	1859.000	17070.000	88	2014-12-15 20:03:00	2014-12-15 20:03:00	84
data	34.429	34.429	1067.299	31	2014-12-13 06:58:00	2014-12-13 06:58:00	31
sms	1.000	1.000	86.000	86	2014-12-15 19:56:00	2014-12-15 19:56:00	58
2015-02	call	1.000	1863.000	14416.000	67	2015-01-15 10:36:00	2015-01-15 10:36:00	67
data	34.429	34.429	1067.299	31	2015-01-13 06:58:00	2015-01-13 06:58:00	31
sms	1.000	1.000	39.000	39	2015-01-15 12:23:00	2015-01-15 12:23:00	27
2015-03	call	2.000	10528.000	21727.000	47	2015-02-12 20:15:00	2015-02-12 20:15:00	47
data	34.429	34.429	998.441	29	2015-02-13 06:58:00	2015-02-13 06:58:00	29
sms	1.000	1.000	25.000	25	2015-02-19 18:46:00	2015-02-19 18:46:00	17

agg(…) 语法灵活且易于使用。请记住，你可以将自定义和 lambda 函数传递到聚合计算列表，并且每个函数都将从分组数据中的列传递值。

从 groupby 操作重命名分组统计信息

在列上计算多个统计信息时，结果数据框将在列轴中设置多个索引。这可能很难处理，我同行必须在 groupby 操作后重命名列。

一种选择是使用以下方法删除新创建的多索引的 top level：

grouped = data.groupby('month').agg("duration": [min, max, mean])
grouped.columns = grouped.columns.droplevel(level=0)
grouped.rename(columns={"min": "min_duration", "max": "max_duration", "mean": "mean_duration"})
grouped.head()

来源：shanelynn