The grouping operation of Pandas is basically the same as the group by of SQL statements, and they can be understood by association. You can perform the following operations after grouping:
- Aggregate
agg()
-calculate summary statistics conversion - Transformation
transform()
-perform some group-specific operation filtering- - Filtering
filter()
-discarding data in some cases
import pandas as pd
import numpy as np
data = {
'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]
}
df = pd.DataFrame(data)
Output df available
Team Rank Year Points
0 Riders 1 2014 876
1 Riders 2 2015 789
2 Devils 2 2014 863
3 Devils 3 2015 673
4 Kings 3 2014 741
5 kings 4 2015 812
6 Kings 1 2016 756
7 Kings 1 2017 788
8 Riders 2 2016 694
9 Royals 4 2014 701
10 Royals 1 2015 804
11 Riders 2 2017 690
Traverse the grouped objects
grouped = df.groupby('Year')
for name,group in grouped:
print (name)
print (group)
operation result
2014
Team Rank Year Points
0 Riders 1 2014 876
2 Devils 2 2014 863
4 Kings 3 2014 741
9 Royals 4 2014 701
.
.
.
.
2017
Team Rank Year Points
7 Kings 1 2017 788
11 Riders 2 2017 690
Use aggregate functions
# groupby()中可以太假一个值,可以使用列表形成二级索引
# 其中的as_index 可以让一级索引是否都填充,默认是True
df.groupby('Team').agg([np.sum,np.mean])
operation result
Rank Year Points
sum mean sum mean sum mean
Team
Devils 5 2.500000 4029 2014.500000 1536 768.000000
Kings 5 1.666667 6047 2015.666667 2285 761.666667
Riders 7 1.750000 8062 2015.500000 3049 762.250000
Royals 5 2.500000 4029 2014.500000 1505 752.500000
kings 4 4.000000 2015 2015.000000 812 812.000000
Of course, you can individually select the value of a certain column to view when aggregating, for example, I choose the Rank column
df.groupby('Team').agg([np.sum,np.mean])['Rank']
It is worth noting thatWhen we look at the index value when we aggregate, we will find that it returns a multi-level index (MultiIndex)
mult = df.groupby(['Team','Year']).agg([np.sum,np.mean])
print(mult.index)
The output result is
MultiIndex([('Devils', 2014),
('Devils', 2015),
( 'Kings', 2014),
( 'Kings', 2016),
( 'Kings', 2017),
('Riders', 2014),
('Riders', 2015),
('Riders', 2016),
('Riders', 2017),
('Royals', 2014),
('Royals', 2015),
( 'kings', 2015)],
names=['Team', 'Year'])
So since it is a multi-level index, how to get the value and find it? The search method is still used here .loc
.
- When a tuple (K1, K2) is passed in, it queries two levels of index, that is, K1 is the first level and K2 is the second level
- When the list is passed in [K1, K2], it queries the same level index
mult.loc[('Devils',2015)]
mult.loc[('Devils','Riders')]
The output result is
Rank sum 3
mean 3
Points sum 673
mean 673
Name: (Devils, 2015), dtype: int64
---------------------------------------
Rank Points
sum mean sum mean
Team Year
Devils 2014 2 2 863 863
2015 3 3 673 673
Riders 2014 1 1 876 876
2015 2 2 789 789
2016 2 2 694 694
2017 2 2 690 690
Of course, you can still use a separate selection of a column for separate display when using it :
mult.loc[['Devils','Riders'],'Rank']
use the filter function
filter = df.groupby('Team').filter(lambda x: len(x) >= 3)
输出如下(出现大于三次的)
Team Rank Year Points
0 Riders 1 2014 876
1 Riders 2 2015 789
4 Kings 3 2014 741
6 Kings 1 2016 756
7 Kings 1 2017 788
8 Riders 2 2016 694
11 Riders 2 2017 690
Use conversion functions
grouped = df.groupby('Team')
score = lambda x: (x - x.mean()) / x.std()*10
print (grouped.transform(score))
输出如下
Points Rank Year
0 12.843272 -15.000000 -11.618950
1 3.020286 5.000000 -3.872983
2 7.071068 -7.071068 -7.071068
.
.
The key is to pay attention to the level of hierarchy when aggregated!