DataWhale & Pandas (four, grouping)

DataWhale & Pandas (four, grouping)


Pandas Learning Manual


Study outline: 


table of Contents

DataWhale & Pandas (four, grouping)

Study outline: 

supplement:

Grouping statistics

 Group by dictionary or Series

Grouping-iterable objects

Aggregations

Filtration

Resample time series

Example: Fill missing values ​​by group

Ideas

method one:

Method Two:

1. Grouping mode and its objects

1.1. General mode of grouping

1.2. The nature of grouping basis

1.3. Groupby Object

1.4. Three operations of grouping

Two, aggregate function

2.1 Built-in aggregate functions

2.2 agg method

note

Three, transformation and filtering

3.1. Transform function and transform method

3.2. Group Indexing and Filtering

Four, grouping across columns

4.1 Introduction of apply

4.2. Use of apply

note:

Five, practice

Ex1: car data set

Ex2: implement the transform function

Experience: 


supplement:

Grouping statistics

The grouping object supports almost all df statistical methods, see Mathematical Statistical Methods

grouped = s.groupby(level=0)  # 唯一索引用.groupby(level=0),将同一个index的分为一组
print(grouped)
print(grouped.first(),'→ first:非NaN的第一个值\n')
print(grouped.last(),'→ last:非NaN的最后一个值\n')
print(grouped.sum(),'→ sum:非NaN的和\n')
print(grouped.mean(),'→ mean:非NaN的平均值\n')
print(grouped.median(),'→ median:非NaN的算术中位数\n')
print(grouped.count(),'→ count:非NaN的值\n')
print(grouped.min(),'→ min、max:非NaN的最小值、最大值\n')
print(grouped.std(),'→ std,var:非NaN的标准差和方差\n')
print(grouped.prod(),'→ prod:非NaN的积\n')

grouped.corr()
grouped.sem()
grouped.prod()
grouped.cummax() # 每组的累计最大值
grouped.cumsum() # 累加
grouped.mad() # 平均绝对偏差

# 特别的有
df.groupby('team').ngroups # 5 分组数
df.groupby('team').ngroup() # 分组序号
df.groupby('team').first() # 组内第一个
df.groupby('team').last() # 组内最后一个

# 库姆计数,按组对成员标记, 支持正排倒排
# 返回每个元素在所在组的序号的序列
grouped.cumcount(ascending=False)

 Group by dictionary or Series

df = pd.DataFrame(np.arange(16).reshape(4,4),columns = ['a','b','c','d'])
print(df)
mapping = {'a':'one','b':'one','c':'two','d':'two','e':'three'}
by_column = df.groupby(mapping, axis = 1)
print(by_column.sum())
# mapping中,a、b列对应的为one,c、d列对应的为two,以字典来分组
s = pd.Series(mapping)
print(s,'\n')
print(s.groupby(s).count())
# s中,index中a、b对应的为one,c、d对应的为two,以Series来分组

Grouping-iterable objects

df = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
print(df)
print(df.groupby('X'), type(df.groupby('X')))
print('-----')

print(list(df.groupby('X')), '→ 可迭代对象,直接生成list\n')
print(list(df.groupby('X'))[0], '→ 以元祖形式显示\n')
for n,g in df.groupby('X'):
    print(n)
    print(g)
    print('###')
print('-----')
# n是组名,g是分组后的Dataframe

print(df.groupby(['X']).get_group('A'),'\n')
print(df.groupby(['X']).get_group('B'),'\n')
print('-----')
# .get_group()提取分组后的组

grouped = df.groupby(['X'])
print(grouped.groups)
print(grouped.groups['A'])  # 也可写:df.groupby('X').groups['A']
print('-----')
# .groups:将分组后的groups转为dict
# 可以字典索引方法来查看groups里的元素

sz = grouped.size()
print(sz,type(sz))
print('-----')
# .size():查看分组后的长度

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
print(df)
print()
print(df.groupby(['A','B']))
print()
grouped = df.groupby(['A','B']).groups

print(grouped)
print()
print(grouped[('foo', 'three')])
# 按照两个列进行分组

Aggregations

.aggregate() Abbreviated as  .agg(). Its function is to give statistical methods to the grouped objects, and it also supports different statistical methods to be given by fields.

# 所有列使用一个计算计算方法
df.groupby('team').aggregate(sum)
df.groupby('team').agg(sum)
grouped.agg(np.size)
grouped['Q1'].agg(np.mean)

## 多个计算方法
# 所有列指定多个计算方法
grouped.agg([np.sum, np.mean, np.std])
# 指定列使用多个计算方法
grouped['Q1'].agg([sum, np.mean, np.std])
# 一列使用多个计算方法
df.groupby('team').agg({'Q1': ['min', 'max'], 'Q2': 'sum'})

# 指定列名,列表是为原列和方法
df.groupby('team').Q1.agg(Mean='mean', Sum='sum')
df.groupby('team').agg(Mean=('Q1', 'mean'), Sum=('Q2', 'sum'))
df.groupby('team').agg(
    Q1_max=pd.NamedAgg(column='Q1', aggfunc='max'),
    Q2_min=pd.NamedAgg(column='Q2', aggfunc='min')
)
# 如果列名不是有效的 python 变量,则可以用以下方法
df.groupby('team').agg(**{
    '1_max':pd.NamedAgg(column='Q1', aggfunc='max')})

# 聚合结果使用函数
# lambda/函数 所有方法都可以用
def max_min(x):
    return x.max() - x.min()
# 定义函数
df.groupby('team').Q1.agg(Mean='mean',
                          Sum='sum',
                          Diff=lambda x: x.max() - x.min(),
                          Max_min=max_min
                         )

# 不同列不同的计算方法
df.groupby('team').agg({'Q1': sum,  # 总和
                        'Q2': 'count', # 总数
                        'Q3':'mean', # 平均
                        'Q4': max}) # 最大值

# 分组对象使用函数
# 定义函数
def max_min(var):
    return var.max() - var.min()
# 调用函数
df.groupby('team').agg(max_min)

 Filtration

filter() After filtering, the original dataframe after filtering is displayed, and the content is all the details that meet the group conditions:

# 值的长度都大于等于 3 的
df.groupby('team').filter(lambda x: len(x) >= 3)
# Q1成绩只要有一个大于97的组
df.groupby(['team']).filter(lambda x: (x['Q1'] > 97).any())
# 所有成员平均成绩大于 60 的组
df.groupby(['team']).filter(lambda x: (x.mean() >= 60).all())
# Q1 所有成员成绩之和超过 1060 的组
df.groupby('team').filter(lambda g: g.Q1.sum() > 1060)

Resample time series

Use  TimeGrouper to group time:

idx = pd.date_range('1/1/2000', periods=100, freq='T')
df = pd.DataFrame(data=1 * [range(2)],
                  index=idx,
                  columns=['a', 'b'])

# 三个周期一聚合(一分钟一个周期)
df.groupby('a').resample('3T').sum()
# 30 秒一分组
df.groupby('a').resample('30S').sum()
# 每月
df.groupby('a').resample('M').sum()
# 以右边时间点为标识
df.groupby('a').resample('3T', closed='right').sum()

 Example: Fill missing values ​​by group

Missing values ​​need to be filled in the source data. In the two groups of data marked in the category column, each column has only one data with a value, and the missing values ​​need to be filled with this value by group.

源数据:

类别 col1 col2 col3
A  NaN  NaN  NaN
A    d  NaN  NaN
A  NaN  NaN  NaN
A  NaN    c    e
B  NaN    Y    Z
B  NaN  NaN  NaN
B    X  NaN  NaN
将以上源数据中的缺失值填充完成如下样子:

  类别 col1 col2 col3
0  A    d    c    e
1  A    d    c    e
2  A    d    c    e
3  A    d    c    e
4  B    X    Y    Z
5  B    X    Y    Z
6  B    X    Y    Z

Ideas

Since you want to process data in groups, you need to use Pandas groupby for grouping. After grouping, fill the data with fillna, and use apply to call the function to apply fillna.

Since the filling method of fillna does not take the only non-empty value filling method, we need  method='bfill' to use method='ffill' backward filling first and then  forward filling to complete the filling of all missing value positions.

method one:

# 结果为需求中的目标数据
    df.groupby('类别')
    .apply(lambda x: x.fillna(method='bfill'))
    .groupby('类别')
    .apply(lambda x: x.fillna(method='ffill'))

Method Two:

# 一
    df.groupby('类别')
    .apply(lambda x: x.fillna(method='bfill').fillna(method='ffill'))
# 二
    df.groupby('类别')
    .apply(lambda x: x.fillna(method='bfill'))
    .fillna(method='ffill')
import numpy as np
import pandas as pd

1. Grouping mode and its objects

1.1. General mode of grouping

Grouping operations are extremely widely used in daily life, for example:

  • According to gender grouping, statistics the average life expectancy of the national population

  • According to seasonal grouping, the temperature of each season is standardized within the group

  • According to the class grouping, filter out the classes whose average math score exceeds 80 points in the group

df.groupby(分组依据)[数据来源].使用操作

E.g:

df.groupby('Gender')['Longevity'].mean()

If you want to calculate the median height by gender:

df.groupby('Gender')['Height'].median()

1.2. The nature of grouping basis

What if you need to group by multiple dimensions now? In fact, just  groupby pass in a list of corresponding column names in.

groupby The grouping basis can be directly obtained from the column by name

# 例如根据学生体重是否超过总体均值来分组,同样还是计算身高的均值。

# 首先应该先写出分组条件:
condition = df.Weight > df.Weight.mean()
#传入groupby
df.groupby(condition)['Height'].mean()

You  drop_duplicates can know the specific group category through:

df[['School', 'Gender']].drop_duplicates()

1.3. Groupby Object

In the final specific grouping operation, the methods called are all from  pandas the  groupby objects in

gb = df.groupby(['School', 'Grade'])
# 通过 ngroups 属性,可以得到分组个数
gb.ngroups 
# 通过 groups 属性,可以返回从 组名 映射到 组索引列表 的字典
res = gb.groups
res.keys() # 字典的值由于是索引,元素个数过多,此处只展示字典的键

Count the number of elements in each group:

gb.size()

get_group The row corresponding to the group can be directly obtained through the  method, and the specific name of the group must be known at this time:

gb.get_group(('Fudan University', 'Freshman')).iloc[:3, :3]

1.4. Three operations of grouping

The three major operations: aggregation, transformation and filtering , respectively correspond to: agg, transform and filter functions and their operations

Two, aggregate function

2.1 Built-in aggregate functions

According to the principle of returning a scalar value, the following functions are included:

 max/min/mean/median/count/all/any/idxmax/idxmin/mad/nunique/skew/quantile/sum/std/var/sem/size/prod 

2.2 agg method

Although groupby many convenient functions are defined on the  object, there are still the following inconveniences :

  • Cannot use multiple functions at the same time

  • Cannot use specific aggregate functions for specific columns

  • Cannot use custom aggregate functions

  • It is not possible to directly customize the column names of the results before aggregation

[A] Use multiple functions

When using multiple aggregate functions, you need to pass in the string corresponding to the built-in aggregate function in the form of a list. All the strings mentioned earlier are legal.

gb.agg(['sum', 'idxmax', 'skew'])

[B] Use specific aggregate functions for specific columns

The special correspondence between methods and columns can be realized by constructing a dictionary and passing  agg in, where the dictionary takes the column name as the key and the aggregate string or string list as the value

gb.agg({'Height':['mean','max'], 'Weight':'count'})

[C] Use custom functions

It should be noted that the parameters of the incoming function are the columns in the previous data source, which are calculated column by column 

gb.agg(lambda x: x.mean()-x.min())  #计算身高和体重的极差

[D] Rename the aggregation result
. Rewrite the position of the above function as a tuple. The first element of the tuple is the new name, and the second position is the original function, including aggregate strings and custom functions.

 gb.agg([('range', lambda x: x.max()-x.min()), ('my_sum', 'sum')])

note

When using a single aggregation for one or more columns, you need to add square brackets to rename, otherwise you don’t know if it is a new name or a wrong built-in function string

Three, transformation and filtering

3.1. Transform function and transform method

The return value of the transformation function is a sequence of the same length. The most commonly used built-in transformation function is the accumulation function:,  cumcount/cumsum/cumprod/cummax/cummin their use is similar to the aggregation function, except that they complete the accumulation operation within the group.

gb.cummax().head()
gb.transform(lambda x: (x-x.mean())/x.std()).head()
  • When using a custom transformation, you need to use the  transform method. The called custom function has the incoming value of the sequence of the data source agg , which is consistent with  the incoming type, and the final return result is that the row and column index is consistent with the data source  DataFrame .
  •  transform Only sequences of the same length can be returned, but in fact a scalar can also be returned, which will cause the result to be broadcast to the entire group in which it belongs

3.2. Group Indexing and Filtering

  • Filtering is for group filtering in grouping, while index is for row filtering. The return value in the above, whether it is a boolean list, element list or position list, is essentially a row filtering, that is, if the filter conditions are met If it is selected into the result table, otherwise it is not selected.
  • Group filtering, as a promotion of row filtering, means that if the results of all rows of a group are returned,  True they will be retained,  False then the group will be filtered, and finally all unfiltered groups will be joined to their corresponding rows Get up as a  DataFrame return.
  • In the  groupby object, a filter method is defined to  filter the group. The input parameter of the custom function is the data source  DataFrame itself. In the groupby object defined in the previous example  , it is passed in  df[['Height', 'Weight']] , so all table methods and attributes can be in the self Use it accordingly in the defined function, and only need to ensure that the return of the custom function is a Boolean value.

Example: filter all groups with a capacity greater than 100 in the original table:

gb.filter(lambda x: x.shape[0] > 100).head()

Four, grouping across columns

4.1 Introduction of apply

  •  Use apply functions to process multiple columns of data at the same time

4.2. Use of apply

  •  apply The input parameters of the custom function are  filter exactly the same , but the latter only allows to return Boolean values.
  • In addition to returning a scalar, the  apply method can also return one-dimensional  Series and two-dimensional DataFrame 

[A] Scalar case: the result is  Series that the index agg is consistent with  the result

gb = df.groupby(['Gender','Test_Number'])[['Height','Weight']]

gb.apply(lambda x: 0)

# 虽然是列表,但是作为返回值仍然看作标量

gb.apply(lambda x: [0, 0])

[B]  Series Situation: What we get is  DataFrame that the row index is consistent with the scalar situation, and the column index is  Series the index

gb.apply(lambda x: pd.Series([0,0],index=['a','b']))

[C]  DataFrame Situation: What is obtained is  DataFrame that the innermost row index is added to the original agg result index of each group  , and the returned  DataFrame row index is added. At the same time DataFrame , the column index of the grouping result is the same as  the returned  DataFrame column index

gb.apply(lambda x: pd.DataFrame(np.ones((2,2)),
                                index = ['a','b'],
                               columns=pd.Index([('w','x'),('y','z')])))

note:

        apply The flexibility of the function is obtained at the cost of sacrificing certain performance. Unless you need to use the grouping processing of cross-column processing, you should use other specially designed  groupby object methods, otherwise there will be a big gap in performance. At the same time, when using aggregate functions and transformation functions, the built-in functions should also be used first. They have undergone a high degree of performance optimization, and generally speaking, they are faster than using custom functions.

Five, practice

Ex1: car data set

There is a car data set, which  Brand, Disp., HP represents the car brand, engine capacity, and engine output.

df = pd.read_csv('data/car.csv')
df.head(3)
  • First filter out Country cars that belong to  more than 2, that is, if the Country number of occurrences of the car  in the overall data set does not exceed 2, then remove, and then  Country calculate the price average, price variation coefficient, and the  Country number of cars by group, and the calculation method of the variation coefficient It is the standard deviation divided by the mean, and the coefficient of variation is renamed in the results  CoV .

  • Grouped according to the top third, middle third, and bottom third of the positions in the table, Price the average of the statistics  .

  • Type Group the types  Price and  HP calculate the maximum and minimum values ​​respectively. The result will be a multi-level index. Please use an underscore to merge the multi-level column index into a single-level index.

  • Type Group the types  and   normalize them HP within the group  min-max.

  • Type  Type packet, calculating  Disp. the  HP correlation coefficient.

Ex2: implement the transform function

  • groupby The object's construction method is my_groupby(df, group_cols)

  • Support single column grouping and multi-column grouping

  • Support my_groupby(df)[col].transform(my_func) functions with scalar broadcasting 

  • pandas Is  transform not calculated columns across, please support this function, that still returns  Series but  col parameters for the multi-column

  • No need to consider performance and exception handling, only need to implement the above functions, give the test sample and at the same time  whether pandas the transformcomparison result is consistent with  the  comparison result in

Experience: 

There is a small note in the souvenir of the National Award, which says:

  • With sleepless minutes and seconds in the study room
  • Endless sweat and tears on the field
  • One day one month one year
  • You look through the book window--
  • Lingyun's ambition, the posture of hard work, is now you
  • The backbone of the country, the light of the world is your future

Guess you like

Origin blog.csdn.net/adminkeys/article/details/111708271