Pandas grouping (GroupBy)

The grouping operation of Pandas is basically the same as the group by of SQL statements, and they can be understood by association. You can perform the following operations after grouping:

  • Aggregate agg()-calculate summary statistics conversion
  • Transformation transform()-perform some group-specific operation filtering-
  • Filtering filter()-discarding data in some cases
import pandas as pd
import numpy as np
data = {
    
    
    'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
     'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
     'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
     'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
     'Points':[876,789,863,673,741,812,756,788,694,701,804,690]
   	 }
df = pd.DataFrame(data)

Output df available

      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
2   Devils     2  2014     863
3   Devils     3  2015     673
4    Kings     3  2014     741
5    kings     4  2015     812
6    Kings     1  2016     756
7    Kings     1  2017     788
8   Riders     2  2016     694
9   Royals     4  2014     701
10  Royals     1  2015     804
11  Riders     2  2017     690

Traverse the grouped objects

grouped = df.groupby('Year')

for name,group in grouped:
    print (name)
    print (group)

operation result

2014
     Team  Rank  Year  Points
0  Riders     1  2014     876
2  Devils     2  2014     863
4   Kings     3  2014     741
9  Royals     4  2014     701
				.
				.
				.
				.
2017
      Team  Rank  Year  Points
7    Kings     1  2017     788
11  Riders     2  2017     690

Use aggregate functions

# groupby()中可以太假一个值,可以使用列表形成二级索引
# 其中的as_index 可以让一级索引是否都填充,默认是True

df.groupby('Team').agg([np.sum,np.mean])

operation result

      			 Rank            Year              			Points            
        sum      mean   sum         mean    sum        mean
Team                                                       
Devils    5  2.500000  4029  2014.500000   1536  768.000000
Kings     5  1.666667  6047  2015.666667   2285  761.666667
Riders    7  1.750000  8062  2015.500000   3049  762.250000
Royals    5  2.500000  4029  2014.500000   1505  752.500000
kings     4  4.000000  2015  2015.000000    812  812.000000

Of course, you can individually select the value of a certain column to view when aggregating, for example, I choose the Rank column
df.groupby('Team').agg([np.sum,np.mean])['Rank']
It is worth noting thatWhen we look at the index value when we aggregate, we will find that it returns a multi-level index (MultiIndex)

mult = df.groupby(['Team','Year']).agg([np.sum,np.mean])
print(mult.index)

The output result is

MultiIndex([('Devils', 2014),
            ('Devils', 2015),
            ( 'Kings', 2014),
            ( 'Kings', 2016),
            ( 'Kings', 2017),
            ('Riders', 2014),
            ('Riders', 2015),
            ('Riders', 2016),
            ('Riders', 2017),
            ('Royals', 2014),
            ('Royals', 2015),
            ( 'kings', 2015)],
           names=['Team', 'Year'])

So since it is a multi-level index, how to get the value and find it? The search method is still used here .loc.

  • When a tuple (K1, K2) is passed in, it queries two levels of index, that is, K1 is the first level and K2 is the second level
  • When the list is passed in [K1, K2], it queries the same level index
mult.loc[('Devils',2015)] 
mult.loc[('Devils','Riders')]

The output result is

Rank    sum       3
        mean      3
Points  sum     673
        mean    673
Name: (Devils, 2015), dtype: int64
---------------------------------------
            Rank      Points     
             sum mean    sum mean
Team   Year                      
Devils 2014    2    2    863  863
       2015    3    3    673  673
Riders 2014    1    1    876  876
       2015    2    2    789  789
       2016    2    2    694  694
       2017    2    2    690  690

Of course, you can still use a separate selection of a column for separate display when using it :
mult.loc[['Devils','Riders'],'Rank']
use the filter function

filter = df.groupby('Team').filter(lambda x: len(x) >= 3)
输出如下(出现大于三次的)
      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
4    Kings     3  2014     741
6    Kings     1  2016     756
7    Kings     1  2017     788
8   Riders     2  2016     694
11  Riders     2  2017     690

Use conversion functions

grouped = df.groupby('Team')
score = lambda x: (x - x.mean()) / x.std()*10
print (grouped.transform(score))
输出如下
       Points       Rank       Year
0   12.843272 -15.000000 -11.618950
1    3.020286   5.000000  -3.872983
2    7.071068  -7.071068  -7.071068
					.
					.

The key is to pay attention to the level of hierarchy when aggregated!

Guess you like

Origin blog.csdn.net/qq_44091773/article/details/106079418