python数据聚合-----python进行数据分析

数据聚合

quantile用于计算样本的分位数

>>> df = DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})
>>> 
>>> df
      data1     data2 key1 key2
0  1.040176 -2.914926    a  one
1  2.127639  1.089139    a  two
2  0.422289  0.744127    b  one
3 -0.540590 -0.698663    b  two
4  0.652605  0.773524    a  one
>>> grouped = df.groupby('key1')
>>> grouped['data1'].quantile(0.9)
key1
a    1.910146
b    0.326001
Name: data1, dtype: float64

aggreagte \ agg  使用自己的聚合函数

>>> def peak_to_peak(arr):
...     return arr.max() - arr.min()
... 
>>> grouped.agg(peak_to_peak)
         data1     data2
key1                    
a     1.475034  4.004065
b     0.962878  1.442790

>>> grouped.describe()
     data1                        ...        data2                    
     count      mean       std    ...          50%       75%       max
key1                              ...                                 
a      3.0  1.273474  0.764691    ...     0.773524  0.931331  1.089139
b      2.0 -0.059151  0.680858    ...     0.022732  0.383430  0.744127

[2 rows x 16 columns]

 下面加载了一家餐馆的数据,我添加了一列表示消费比例的数据

>>> tips = pd.read_csv('D:\python\DataAnalysis\data\\tips.csv')
>>> tips['tip_pct'] = tips['tip']/tips['total_bill']
>>> tips[:6]
   total_bill   tip smoker  day    time  size   tip_pct
0       16.99  1.01     No  Sun  Dinner     2  0.059447
1       10.34  1.66     No  Sun  Dinner     3  0.160542
2       21.01  3.50     No  Sun  Dinner     3  0.166587
3       23.68  3.31     No  Sun  Dinner     2  0.139780
4       24.59  3.61     No  Sun  Dinner     4  0.146808
5       25.29  4.71     No  Sun  Dinner     4  0.186240

面向列的多函数应用

书中的使用的文件丢失sex列,这里手动生成。

series = Series(x for x in np.random.randn(tips.shape[0]))
for x in series:
...     if x > 0:
...         series1.append('Male')
...     else:
...         series1.append('FeMale')
tips['sex'] = Series(series1)
>>> tips[:5]
   total_bill   tip smoker  day    time  size   tip_pct     sex
0       16.99  1.01     No  Sun  Dinner     2  0.059447    Male
1       10.34  1.66     No  Sun  Dinner     3  0.160542    Male
2       21.01  3.50     No  Sun  Dinner     3  0.166587  FeMale
3       23.68  3.31     No  Sun  Dinner     2  0.139780    Male
4       24.59  3.61     No  Sun  Dinner     4  0.146808  FeMale

 如果输入一组函数或函数名,得到的DataFrame的列就会以相应的函数命名

>>> grouped = tips.groupby(['sex','smoker'])
>>> grouped.mean()
               total_bill       tip      size   tip_pct
sex    smoker                                          
FeMale No       20.133117  3.083247  2.727273  0.158241
       Yes      20.699556  3.246444  2.511111  0.174925
Male   No       18.205135  2.896757  2.608108  0.160460
       Yes      20.809583  2.785833  2.312500  0.152200
>>> grouped_pct = grouped['tip_pct']
>>> grouped_pct.agg('mean')
sex     smoker
FeMale  No        0.158241
        Yes       0.174925
Male    No        0.160460
        Yes       0.152200
Name: tip_pct, dtype: float64
>>> grouped_pct.agg(['mean','std',peak_to_peak])
                   mean       std  peak_to_peak
sex    smoker                                  
FeMale No      0.158241  0.042148      0.209515
       Yes     0.174925  0.106243      0.636362
Male   No      0.160460  0.037694      0.232543
       Yes     0.152200  0.057966      0.290095

GroupBy自动给出的列名的识别度较低,如果传入由(name,function)元组组成的列表,则个元组的第一个元素就会被用作DatFrame列名。

>>> grouped_pct.agg([('foo','mean'),('bar',np.std)])
                    foo       bar
sex    smoker                    
FeMale No      0.158241  0.042148
       Yes     0.174925  0.106243
Male   No      0.160460  0.037694
       Yes     0.152200  0.057966

 引入一组应用于全部列的函数

>>> result = grouped['tip_pct','total_bill'].agg(function)
>>> result
              tip_pct                     total_bill                  
                count      mean       max      count       mean    max
sex    smoker                                                         
FeMale No          77  0.158241  0.266312         77  20.133117  48.33
       Yes         45  0.174925  0.710345         45  20.699556  50.81
Male   No          74  0.160460  0.291990         74  18.205135  48.17
       Yes         48  0.152200  0.325733         48  20.809583  45.35
>>> result['tip_pct']
               count      mean       max
sex    smoker                           
FeMale No         77  0.158241  0.266312
       Yes        45  0.174925  0.710345
Male   No         74  0.160460  0.291990
       Yes        48  0.152200  0.325733

假设你想要对不同的列应用不同的函数,具体的方法是向agg传入一个从列名映射到函数的字典。

>>> grouped.agg({'tip':np.max,'size':'sum'})
                 tip  size
sex    smoker             
FeMale No       9.00   210
       Yes     10.00   113
Male   No       7.58   193
       Yes      6.50   111
grouped.agg({'tip_pct':['min','max','mean','std'],'size':'sum'})
                tip_pct                               size
                    min       max      mean       std  sum
sex    smoker                                             
FeMale No      0.056797  0.266312  0.158241  0.042148  210
       Yes     0.073983  0.710345  0.174925  0.106243  113
Male   No      0.059447  0.291990  0.160460  0.037694  193
       Yes     0.035638  0.325733  0.152200  0.057966  111

以无索引的形式返回聚合数据

>>> tips.groupby(['sex','smoker'],as_index=False).mean()
      sex smoker  total_bill       tip      size   tip_pct
0  FeMale     No   20.133117  3.083247  2.727273  0.158241
1  FeMale    Yes   20.699556  3.246444  2.511111  0.174925
2    Male     No   18.205135  2.896757  2.608108  0.160460
3    Male    Yes   20.809583  2.785833  2.312500  0.152200

猜你喜欢

转载自blog.csdn.net/Da___Vinci/article/details/83178407