GroupBy技术-----python进行数据分析

GroupBy技术

>>> import numpy as np
>>> from pandas import DataFrame,Series
Backend TkAgg is interactive backend. Turning interactive mode on.
>>> df = DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})
>>> df
      data1     data2 key1 key2
0 -1.012239  0.381608    a  one
1  0.432161 -1.384340    a  two
2  0.426435 -1.732019    b  one
3 -1.388080  0.839690    b  two
4 -0.439888 -0.603553    a  one
>>> grouped = df['data1'].groupby(df['key1'])
>>> grouped.mean()
key1
a   -0.339989
b   -0.480822
Name: data1, dtype: float64
>>> means = df['data1'].groupby([df['key1'],df['key2']]).mean()
>>> means
key1  key2
a     one    -0.726064
      two     0.432161
b     one     0.426435
      two    -1.388080
Name: data1, dtype: float64
>>> means.unstack()
key2       one       two
key1                    
a    -0.726064  0.432161
b     0.426435 -1.388080

直接使用列名作为分组键

>>> df.groupby('key1').mean()
         data1     data2
key1                    
a    -0.339989 -0.535428
b    -0.480822 -0.446165
>>> df.groupby(['key1','key2']).size()
key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

对分组进行迭代

GroupBy对象支持迭代,可以产生一组二元元组。

>>> for name,group in df.groupby('key1'):
...     print name
...     print group
... 
a
      data1     data2 key1 key2
0 -1.012239  0.381608    a  one
1  0.432161 -1.384340    a  two
4 -0.439888 -0.603553    a  one
b
      data1     data2 key1 key2
2  0.426435 -1.732019    b  one
3 -1.388080  0.839690    b  two

对于多重键情况,元组的第一元素是由键值组成的元组:

>>> for(k1,k2),group in df.groupby(['key1','key2']):
...     print k1,k2
...     print group
... 
a one
      data1     data2 key1 key2
0 -1.012239  0.381608    a  one
4 -0.439888 -0.603553    a  one
a two
      data1    data2 key1 key2
1  0.432161 -1.38434    a  two
b one
      data1     data2 key1 key2
2  0.426435 -1.732019    b  one
b two
     data1    data2 key1 key2
3 -1.38808  0.83969    b  two

你可以对这些数据片段做任何操作,比如把他们当成一个字典

>>> pieces = dict(list(df.groupby('key1')))

>>> pieces['b']
      data1     data2 key1 key2
2  0.426435 -1.732019    b  one
3 -1.388080  0.839690    b  two
>>> pieces['a']
      data1     data2 key1 key2
0 -1.012239  0.381608    a  one
1  0.432161 -1.384340    a  two
4 -0.439888 -0.603553    a  one

groupby默认是在axis=0上进行分组的,通过设置也可以在其他任何轴上进行分组

>>> df.dtypes
data1    float64
data2    float64
key1      object
key2      object
dtype: object
>>> grouped = df.groupby(df.dtypes,axis=1)
>>> dict(list(grouped))
{dtype('O'):   key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one, dtype('float64'):       data1     data2
0 -1.012239  0.381608
1  0.432161 -1.384340
2  0.426435 -1.732019
3 -1.388080  0.839690
4 -0.439888 -0.603553}

选取一个或一组列

对于大数据集很可能只需对部分列进行聚合,例:

>>> df.groupby(['key1','key2'])[['data2']].mean()
              data2
key1 key2          
a    one  -0.110972
     two  -1.384340
b    one  -1.732019
     two   0.839690
>>> s_grouped = df.groupby(['key1','key2'])['data2']
>>> s_grouped.mean()
key1  key2
a     one    -0.110972
      two    -1.384340
b     one    -1.732019
      two     0.839690
Name: data2, dtype: float64

通过字典或Series进行分组

>>> people = DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],index=['Joe','Steve','Wes','Jim','Travis'])
>>> people.ix[2:3,['b','c']] = np.nan
>>> people
               a         b         c         d         e
Joe    -0.507204  1.111102 -1.626998 -1.191771  0.386699
Steve   1.225585  1.202014  0.089095  0.004328 -0.660203
Wes    -0.641992       NaN       NaN -1.612848  0.327813
Jim     1.271822 -0.117422  0.919063 -0.254136 -0.957631
Travis  0.690725 -1.098159 -0.757635 -0.794666 -1.297784
>>> mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
>>> by_column = people.groupby(mapping,axis=1)
>>> by_column.sum()
            blue       red
Joe    -2.818768  0.990597
Steve   0.093423  1.767396
Wes    -1.612848 -0.314179
Jim     0.664926  0.196769
Travis -1.552301 -1.705217
>>> map_series = Series(mapping)
>>> map_series
a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object
>>> people.groupby(map_series,axis=1).count()
        blue  red
Joe        2    3
Steve      2    3
Wes        1    2
Jim        2    3
Travis     2    3

通过函数进行分组

>>> import numpy as np
>>> from pandas import DataFrame,Series
Backend TkAgg is interactive backend. Turning interactive mode on.
>>> people = DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],index=['Joe','Steve','Wes','Jim','Travis'])
>>> people.ix[2:3,['b','c']] = np.nan
>>> people.groupby(len).sum()
          a         b         c         d         e
3  2.080547 -0.604547 -0.604366 -1.513836  0.497836
5  0.079461 -1.729398 -0.901477  0.569260  0.302427
6  0.005069 -0.035869 -0.793810  1.150144  2.031785
>>> people
               a         b         c         d         e
Joe     1.119423 -0.345290  0.668423 -0.658008  0.413723
Steve   0.079461 -1.729398 -0.901477  0.569260  0.302427
Wes    -0.556755       NaN       NaN -0.992753  0.124015
Jim     1.517879 -0.259257 -1.272789  0.136925 -0.039903
Travis  0.005069 -0.035869 -0.793810  1.150144  2.031785

 下例:先按长度分组,然后是one,two的分组

>>> key_list = ['one','one','one','two','two']
>>> people.groupby([len,key_list]).min()
              a         b         c         d         e
3 one -0.556755 -0.345290  0.668423 -0.992753  0.124015
  two  1.517879 -0.259257 -1.272789  0.136925 -0.039903
5 one  0.079461 -1.729398 -0.901477  0.569260  0.302427
6 two  0.005069 -0.035869 -0.793810  1.150144  2.031785

根据索引级别分组

>>> columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,3,5,1,3]],names = ['cty','tenor'])
>>> hier_df = DataFrame(np.random.randn(4,5),columns = columns)
>>> hier_df
cty          US                            JP          
tenor         1         3         5         1         3
0      0.839657  0.656362  1.034138 -1.107702  0.687075
1      0.979355  0.581277  1.024826 -0.617576  0.117190
2      0.579184 -0.629204  1.849724 -0.738685 -1.937523
3      0.168968 -0.352462 -0.791173 -0.628160  0.391682
>>> hier_df.groupby(level='cty',axis=1).count()
cty  JP  US
0     2   3
1     2   3
2     2   3
3     2   3

 

猜你喜欢

转载自blog.csdn.net/Da___Vinci/article/details/83153261