DataWhale & Pandas (four, grouping)
Study outline:
table of Contents
DataWhale & Pandas (four, grouping)
Example: Fill missing values by group
1. Grouping mode and its objects
1.2. The nature of grouping basis
1.4. Three operations of grouping
2.1 Built-in aggregate functions
Three, transformation and filtering
3.1. Transform function and transform method
3.2. Group Indexing and Filtering
Ex2: implement the transform function
supplement:
Grouping statistics
The grouping object supports almost all df statistical methods, see Mathematical Statistical Methods
grouped = s.groupby(level=0) # 唯一索引用.groupby(level=0),将同一个index的分为一组
print(grouped)
print(grouped.first(),'→ first:非NaN的第一个值\n')
print(grouped.last(),'→ last:非NaN的最后一个值\n')
print(grouped.sum(),'→ sum:非NaN的和\n')
print(grouped.mean(),'→ mean:非NaN的平均值\n')
print(grouped.median(),'→ median:非NaN的算术中位数\n')
print(grouped.count(),'→ count:非NaN的值\n')
print(grouped.min(),'→ min、max:非NaN的最小值、最大值\n')
print(grouped.std(),'→ std,var:非NaN的标准差和方差\n')
print(grouped.prod(),'→ prod:非NaN的积\n')
grouped.corr()
grouped.sem()
grouped.prod()
grouped.cummax() # 每组的累计最大值
grouped.cumsum() # 累加
grouped.mad() # 平均绝对偏差
# 特别的有
df.groupby('team').ngroups # 5 分组数
df.groupby('team').ngroup() # 分组序号
df.groupby('team').first() # 组内第一个
df.groupby('team').last() # 组内最后一个
# 库姆计数,按组对成员标记, 支持正排倒排
# 返回每个元素在所在组的序号的序列
grouped.cumcount(ascending=False)
Group by dictionary or Series
df = pd.DataFrame(np.arange(16).reshape(4,4),columns = ['a','b','c','d'])
print(df)
mapping = {'a':'one','b':'one','c':'two','d':'two','e':'three'}
by_column = df.groupby(mapping, axis = 1)
print(by_column.sum())
# mapping中,a、b列对应的为one,c、d列对应的为two,以字典来分组
s = pd.Series(mapping)
print(s,'\n')
print(s.groupby(s).count())
# s中,index中a、b对应的为one,c、d对应的为two,以Series来分组
Grouping-iterable objects
df = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
print(df)
print(df.groupby('X'), type(df.groupby('X')))
print('-----')
print(list(df.groupby('X')), '→ 可迭代对象,直接生成list\n')
print(list(df.groupby('X'))[0], '→ 以元祖形式显示\n')
for n,g in df.groupby('X'):
print(n)
print(g)
print('###')
print('-----')
# n是组名,g是分组后的Dataframe
print(df.groupby(['X']).get_group('A'),'\n')
print(df.groupby(['X']).get_group('B'),'\n')
print('-----')
# .get_group()提取分组后的组
grouped = df.groupby(['X'])
print(grouped.groups)
print(grouped.groups['A']) # 也可写:df.groupby('X').groups['A']
print('-----')
# .groups:将分组后的groups转为dict
# 可以字典索引方法来查看groups里的元素
sz = grouped.size()
print(sz,type(sz))
print('-----')
# .size():查看分组后的长度
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
print(df)
print()
print(df.groupby(['A','B']))
print()
grouped = df.groupby(['A','B']).groups
print(grouped)
print()
print(grouped[('foo', 'three')])
# 按照两个列进行分组
Aggregations
.aggregate()
Abbreviated as.agg()
. Its function is to give statistical methods to the grouped objects, and it also supports different statistical methods to be given by fields.
# 所有列使用一个计算计算方法
df.groupby('team').aggregate(sum)
df.groupby('team').agg(sum)
grouped.agg(np.size)
grouped['Q1'].agg(np.mean)
## 多个计算方法
# 所有列指定多个计算方法
grouped.agg([np.sum, np.mean, np.std])
# 指定列使用多个计算方法
grouped['Q1'].agg([sum, np.mean, np.std])
# 一列使用多个计算方法
df.groupby('team').agg({'Q1': ['min', 'max'], 'Q2': 'sum'})
# 指定列名,列表是为原列和方法
df.groupby('team').Q1.agg(Mean='mean', Sum='sum')
df.groupby('team').agg(Mean=('Q1', 'mean'), Sum=('Q2', 'sum'))
df.groupby('team').agg(
Q1_max=pd.NamedAgg(column='Q1', aggfunc='max'),
Q2_min=pd.NamedAgg(column='Q2', aggfunc='min')
)
# 如果列名不是有效的 python 变量,则可以用以下方法
df.groupby('team').agg(**{
'1_max':pd.NamedAgg(column='Q1', aggfunc='max')})
# 聚合结果使用函数
# lambda/函数 所有方法都可以用
def max_min(x):
return x.max() - x.min()
# 定义函数
df.groupby('team').Q1.agg(Mean='mean',
Sum='sum',
Diff=lambda x: x.max() - x.min(),
Max_min=max_min
)
# 不同列不同的计算方法
df.groupby('team').agg({'Q1': sum, # 总和
'Q2': 'count', # 总数
'Q3':'mean', # 平均
'Q4': max}) # 最大值
# 分组对象使用函数
# 定义函数
def max_min(var):
return var.max() - var.min()
# 调用函数
df.groupby('team').agg(max_min)
Filtration
filter()
After filtering, the original dataframe after filtering is displayed, and the content is all the details that meet the group conditions:
# 值的长度都大于等于 3 的
df.groupby('team').filter(lambda x: len(x) >= 3)
# Q1成绩只要有一个大于97的组
df.groupby(['team']).filter(lambda x: (x['Q1'] > 97).any())
# 所有成员平均成绩大于 60 的组
df.groupby(['team']).filter(lambda x: (x.mean() >= 60).all())
# Q1 所有成员成绩之和超过 1060 的组
df.groupby('team').filter(lambda g: g.Q1.sum() > 1060)
Resample time series
Use
TimeGrouper
to group time:
idx = pd.date_range('1/1/2000', periods=100, freq='T')
df = pd.DataFrame(data=1 * [range(2)],
index=idx,
columns=['a', 'b'])
# 三个周期一聚合(一分钟一个周期)
df.groupby('a').resample('3T').sum()
# 30 秒一分组
df.groupby('a').resample('30S').sum()
# 每月
df.groupby('a').resample('M').sum()
# 以右边时间点为标识
df.groupby('a').resample('3T', closed='right').sum()
Example: Fill missing values by group
Missing values need to be filled in the source data. In the two groups of data marked in the category column, each column has only one data with a value, and the missing values need to be filled with this value by group.
源数据:
类别 col1 col2 col3
A NaN NaN NaN
A d NaN NaN
A NaN NaN NaN
A NaN c e
B NaN Y Z
B NaN NaN NaN
B X NaN NaN
将以上源数据中的缺失值填充完成如下样子:
类别 col1 col2 col3
0 A d c e
1 A d c e
2 A d c e
3 A d c e
4 B X Y Z
5 B X Y Z
6 B X Y Z
Ideas
Since you want to process data in groups, you need to use Pandas groupby for grouping. After grouping, fill the data with fillna, and use apply to call the function to apply fillna.
Since the filling method of fillna does not take the only non-empty value filling method, we need
method='bfill'
to usemethod='ffill'
backward filling first and then forward filling to complete the filling of all missing value positions.
method one:
# 结果为需求中的目标数据
df.groupby('类别')
.apply(lambda x: x.fillna(method='bfill'))
.groupby('类别')
.apply(lambda x: x.fillna(method='ffill'))
Method Two:
# 一
df.groupby('类别')
.apply(lambda x: x.fillna(method='bfill').fillna(method='ffill'))
# 二
df.groupby('类别')
.apply(lambda x: x.fillna(method='bfill'))
.fillna(method='ffill')
import numpy as np
import pandas as pd
1. Grouping mode and its objects
1.1. General mode of grouping
Grouping operations are extremely widely used in daily life, for example:
According to gender grouping, statistics the average life expectancy of the national population
According to seasonal grouping, the temperature of each season is standardized within the group
According to the class grouping, filter out the classes whose average math score exceeds 80 points in the group
df.groupby(分组依据)[数据来源].使用操作
E.g:
df.groupby('Gender')['Longevity'].mean()
If you want to calculate the median height by gender:
df.groupby('Gender')['Height'].median()
1.2. The nature of grouping basis
What if you need to group by multiple dimensions now? In fact, just
groupby
pass in a list of corresponding column names in.
groupby
The grouping basis can be directly obtained from the column by name
# 例如根据学生体重是否超过总体均值来分组,同样还是计算身高的均值。
# 首先应该先写出分组条件:
condition = df.Weight > df.Weight.mean()
#传入groupby
df.groupby(condition)['Height'].mean()
You
drop_duplicates
can know the specific group category through:df[['School', 'Gender']].drop_duplicates()
1.3. Groupby Object
In the final specific grouping operation, the methods called are all from
pandas
thegroupby
objects in
gb = df.groupby(['School', 'Grade'])
# 通过 ngroups 属性,可以得到分组个数
gb.ngroups
# 通过 groups 属性,可以返回从 组名 映射到 组索引列表 的字典
res = gb.groups
res.keys() # 字典的值由于是索引,元素个数过多,此处只展示字典的键
Count the number of elements in each group:
gb.size()
get_group
The row corresponding to the group can be directly obtained through the method, and the specific name of the group must be known at this time:gb.get_group(('Fudan University', 'Freshman')).iloc[:3, :3]
1.4. Three operations of grouping
The three major operations: aggregation, transformation and filtering , respectively correspond to: agg, transform and filter functions and their operations
Two, aggregate function
2.1 Built-in aggregate functions
According to the principle of returning a scalar value, the following functions are included:
max/min/mean/median/count/all/any/idxmax/idxmin/mad/nunique/skew/quantile/sum/std/var/sem/size/prod
2.2 agg method
Although
groupby
many convenient functions are defined on the object, there are still the following inconveniences :
Cannot use multiple functions at the same time
Cannot use specific aggregate functions for specific columns
Cannot use custom aggregate functions
It is not possible to directly customize the column names of the results before aggregation
[A] Use multiple functions
When using multiple aggregate functions, you need to pass in the string corresponding to the built-in aggregate function in the form of a list. All the strings mentioned earlier are legal.
gb.agg(['sum', 'idxmax', 'skew'])
[B] Use specific aggregate functions for specific columns
The special correspondence between methods and columns can be realized by constructing a dictionary and passing
agg
in, where the dictionary takes the column name as the key and the aggregate string or string list as the value
gb.agg({'Height':['mean','max'], 'Weight':'count'})
[C] Use custom functions
It should be noted that the parameters of the incoming function are the columns in the previous data source, which are calculated column by column
gb.agg(lambda x: x.mean()-x.min()) #计算身高和体重的极差
[D] Rename the aggregation result
. Rewrite the position of the above function as a tuple. The first element of the tuple is the new name, and the second position is the original function, including aggregate strings and custom functions.
gb.agg([('range', lambda x: x.max()-x.min()), ('my_sum', 'sum')])
note
When using a single aggregation for one or more columns, you need to add square brackets to rename, otherwise you don’t know if it is a new name or a wrong built-in function string
Three, transformation and filtering
3.1. Transform function and transform method
The return value of the transformation function is a sequence of the same length. The most commonly used built-in transformation function is the accumulation function:,
cumcount/cumsum/cumprod/cummax/cummin
their use is similar to the aggregation function, except that they complete the accumulation operation within the group.
gb.cummax().head()
gb.transform(lambda x: (x-x.mean())/x.std()).head()
- When using a custom transformation, you need to use the
transform
method. The called custom function has the incoming value of the sequence of the data sourceagg
, which is consistent with the incoming type, and the final return result is that the row and column index is consistent with the data sourceDataFrame
.transform
Only sequences of the same length can be returned, but in fact a scalar can also be returned, which will cause the result to be broadcast to the entire group in which it belongs
3.2. Group Indexing and Filtering
- Filtering is for group filtering in grouping, while index is for row filtering. The return value in the above, whether it is a boolean list, element list or position list, is essentially a row filtering, that is, if the filter conditions are met If it is selected into the result table, otherwise it is not selected.
- Group filtering, as a promotion of row filtering, means that if the results of all rows of a group are returned,
True
they will be retained,False
then the group will be filtered, and finally all unfiltered groups will be joined to their corresponding rows Get up as aDataFrame
return.- In the
groupby
object, afilter
method is defined to filter the group. The input parameter of the custom function is the data sourceDataFrame
itself. In thegroupby
object defined in the previous example , it is passed indf[['Height', 'Weight']]
, so all table methods and attributes can be in the self Use it accordingly in the defined function, and only need to ensure that the return of the custom function is a Boolean value.
Example: filter all groups with a capacity greater than 100 in the original table:
gb.filter(lambda x: x.shape[0] > 100).head()
Four, grouping across columns
4.1 Introduction of apply
- Use
apply
functions to process multiple columns of data at the same time
4.2. Use of apply
apply
The input parameters of the custom function arefilter
exactly the same , but the latter only allows to return Boolean values.- In addition to returning a scalar, the
apply
method can also return one-dimensionalSeries
and two-dimensionalDataFrame
[A] Scalar case: the result is Series
that the index agg
is consistent with the result
gb = df.groupby(['Gender','Test_Number'])[['Height','Weight']]
gb.apply(lambda x: 0)
# 虽然是列表,但是作为返回值仍然看作标量
gb.apply(lambda x: [0, 0])
[B] Series
Situation: What we get is DataFrame
that the row index is consistent with the scalar situation, and the column index is Series
the index
gb.apply(lambda x: pd.Series([0,0],index=['a','b']))
[C] DataFrame
Situation: What is obtained is DataFrame
that the innermost row index is added to the original agg
result index of each group , and the returned DataFrame
row index is added. At the same time DataFrame
, the column index of the grouping result is the same as the returned DataFrame
column index
gb.apply(lambda x: pd.DataFrame(np.ones((2,2)),
index = ['a','b'],
columns=pd.Index([('w','x'),('y','z')])))
note:
apply
The flexibility of the function is obtained at the cost of sacrificing certain performance. Unless you need to use the grouping processing of cross-column processing, you should use other specially designedgroupby
object methods, otherwise there will be a big gap in performance. At the same time, when using aggregate functions and transformation functions, the built-in functions should also be used first. They have undergone a high degree of performance optimization, and generally speaking, they are faster than using custom functions.
Five, practice
Ex1: car data set
There is a car data set, which
Brand, Disp., HP
represents the car brand, engine capacity, and engine output.
df = pd.read_csv('data/car.csv')
df.head(3)
First filter out
Country
cars that belong to more than 2, that is, if theCountry
number of occurrences of the car in the overall data set does not exceed 2, then remove, and thenCountry
calculate the price average, price variation coefficient, and theCountry
number of cars by group, and the calculation method of the variation coefficient It is the standard deviation divided by the mean, and the coefficient of variation is renamed in the resultsCoV
.
Grouped according to the top third, middle third, and bottom third of the positions in the table,
Price
the average of the statistics .
Type
Group the types ,Price
andHP
calculate the maximum and minimum values respectively. The result will be a multi-level index. Please use an underscore to merge the multi-level column index into a single-level index.
Type
Group the types and normalize themHP
within the groupmin-max
.Type
Type
packet, calculatingDisp.
theHP
correlation coefficient.
Ex2: implement the transform function
groupby
The object's construction method ismy_groupby(df, group_cols)
Support single column grouping and multi-column grouping
Support
my_groupby(df)[col].transform(my_func)
functions with scalar broadcasting
pandas
Istransform
not calculated columns across, please support this function, that still returnsSeries
butcol
parameters for the multi-columnNo need to consider performance and exception handling, only need to implement the above functions, give the test sample and at the same time whether
pandas
thetransform
comparison result is consistent with the comparison result in
Experience:
There is a small note in the souvenir of the National Award, which says:
- With sleepless minutes and seconds in the study room
- Endless sweat and tears on the field
- One day one month one year
- You look through the book window--
- Lingyun's ambition, the posture of hard work, is now you
- The backbone of the country, the light of the world is your future