数据聚合
quantile用于计算样本的分位数
>>> df = DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})
>>>
>>> df
data1 data2 key1 key2
0 1.040176 -2.914926 a one
1 2.127639 1.089139 a two
2 0.422289 0.744127 b one
3 -0.540590 -0.698663 b two
4 0.652605 0.773524 a one
>>> grouped = df.groupby('key1')
>>> grouped['data1'].quantile(0.9)
key1
a 1.910146
b 0.326001
Name: data1, dtype: float64
aggreagte \ agg 使用自己的聚合函数
>>> def peak_to_peak(arr):
... return arr.max() - arr.min()
...
>>> grouped.agg(peak_to_peak)
data1 data2
key1
a 1.475034 4.004065
b 0.962878 1.442790
>>> grouped.describe()
data1 ... data2
count mean std ... 50% 75% max
key1 ...
a 3.0 1.273474 0.764691 ... 0.773524 0.931331 1.089139
b 2.0 -0.059151 0.680858 ... 0.022732 0.383430 0.744127
[2 rows x 16 columns]
下面加载了一家餐馆的数据,我添加了一列表示消费比例的数据
>>> tips = pd.read_csv('D:\python\DataAnalysis\data\\tips.csv')
>>> tips['tip_pct'] = tips['tip']/tips['total_bill']
>>> tips[:6]
total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808
5 25.29 4.71 No Sun Dinner 4 0.186240
面向列的多函数应用
书中的使用的文件丢失sex列,这里手动生成。
series = Series(x for x in np.random.randn(tips.shape[0]))
for x in series:
... if x > 0:
... series1.append('Male')
... else:
... series1.append('FeMale')
tips['sex'] = Series(series1)
>>> tips[:5]
total_bill tip smoker day time size tip_pct sex
0 16.99 1.01 No Sun Dinner 2 0.059447 Male
1 10.34 1.66 No Sun Dinner 3 0.160542 Male
2 21.01 3.50 No Sun Dinner 3 0.166587 FeMale
3 23.68 3.31 No Sun Dinner 2 0.139780 Male
4 24.59 3.61 No Sun Dinner 4 0.146808 FeMale
如果输入一组函数或函数名,得到的DataFrame的列就会以相应的函数命名
>>> grouped = tips.groupby(['sex','smoker'])
>>> grouped.mean()
total_bill tip size tip_pct
sex smoker
FeMale No 20.133117 3.083247 2.727273 0.158241
Yes 20.699556 3.246444 2.511111 0.174925
Male No 18.205135 2.896757 2.608108 0.160460
Yes 20.809583 2.785833 2.312500 0.152200
>>> grouped_pct = grouped['tip_pct']
>>> grouped_pct.agg('mean')
sex smoker
FeMale No 0.158241
Yes 0.174925
Male No 0.160460
Yes 0.152200
Name: tip_pct, dtype: float64
>>> grouped_pct.agg(['mean','std',peak_to_peak])
mean std peak_to_peak
sex smoker
FeMale No 0.158241 0.042148 0.209515
Yes 0.174925 0.106243 0.636362
Male No 0.160460 0.037694 0.232543
Yes 0.152200 0.057966 0.290095
GroupBy自动给出的列名的识别度较低,如果传入由(name,function)元组组成的列表,则个元组的第一个元素就会被用作DatFrame列名。
>>> grouped_pct.agg([('foo','mean'),('bar',np.std)])
foo bar
sex smoker
FeMale No 0.158241 0.042148
Yes 0.174925 0.106243
Male No 0.160460 0.037694
Yes 0.152200 0.057966
引入一组应用于全部列的函数
>>> result = grouped['tip_pct','total_bill'].agg(function)
>>> result
tip_pct total_bill
count mean max count mean max
sex smoker
FeMale No 77 0.158241 0.266312 77 20.133117 48.33
Yes 45 0.174925 0.710345 45 20.699556 50.81
Male No 74 0.160460 0.291990 74 18.205135 48.17
Yes 48 0.152200 0.325733 48 20.809583 45.35
>>> result['tip_pct']
count mean max
sex smoker
FeMale No 77 0.158241 0.266312
Yes 45 0.174925 0.710345
Male No 74 0.160460 0.291990
Yes 48 0.152200 0.325733
假设你想要对不同的列应用不同的函数,具体的方法是向agg传入一个从列名映射到函数的字典。
>>> grouped.agg({'tip':np.max,'size':'sum'})
tip size
sex smoker
FeMale No 9.00 210
Yes 10.00 113
Male No 7.58 193
Yes 6.50 111
grouped.agg({'tip_pct':['min','max','mean','std'],'size':'sum'})
tip_pct size
min max mean std sum
sex smoker
FeMale No 0.056797 0.266312 0.158241 0.042148 210
Yes 0.073983 0.710345 0.174925 0.106243 113
Male No 0.059447 0.291990 0.160460 0.037694 193
Yes 0.035638 0.325733 0.152200 0.057966 111
以无索引的形式返回聚合数据
>>> tips.groupby(['sex','smoker'],as_index=False).mean()
sex smoker total_bill tip size tip_pct
0 FeMale No 20.133117 3.083247 2.727273 0.158241
1 FeMale Yes 20.699556 3.246444 2.511111 0.174925
2 Male No 18.205135 2.896757 2.608108 0.160460
3 Male Yes 20.809583 2.785833 2.312500 0.152200