Python groupby grouping and then averaging and summing agg aggregation transform does not change the shape of the application function


一、 groupby 依据某列分组; groupby 依据多列分组;
二、应用 mean sum count std median size max min等函数聚合数据;
三、transform 不改变数据形状(相当于计算后替换原来的每一个元素)


1. Grouping


insert image description here

insert image description here

The grouping function mainly uses the groupby function of pandas. Although the grouping function can also be completed with other functions, the groupby function is relatively convenient. This function has many magical functions, and it is very powerful after proficiency. The official parameters of the groupby function are as follows:

insert image description here

import pandas as pd
import glob
# 获取该目录下的所有文件
files = glob.glob('../../data/03表格数据处理Pandas/C3.7 数据的分组与聚合/*')
# 利用 concat 将所有数据拼接成一个大的 df
df = pd.concat([pd.read_csv(f) for f in files])
# 删除列(值全为空);删除行(存在任意空值)
df = df.dropna(axis='columns', how='all').dropna(axis='index', how='any')
# 对 date 这一列进行格式转换
df['date'] = df.apply({
    
    'date': lambda x: pd.to_datetime(x, format='%Y%m%d')})
# 获取到 月 和 天
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
# 删除 date 列
df = df.drop(columns='date')

# 简单分组
# 分组后的数据想要查看,必须循环打印
group = df.groupby('type')
for i in group:
    print(i)

# 获取某列元素等于特定值的数据
df[df['type']=='AQI']

# 多重分组
# 分组后的数据想要查看,必须循环打印
group = df.groupby(['type', 'month'])
for i in group:
    print(i)

# 获取同时满足多个列条件的数据
df[(df['type']=='AQI') & (df['month']==1)]

insert image description here

2. Polymerization

The so-called aggregation is to perform a series of operations on the data according to the needs after reasonably grouping the data, such as summation and conversion. Aggregation functions are usually the ultimate goal of data processing, and data grouping is often used for better aggregation.

insert image description here

1. Aggregation using built-in functions
# 对type列所有值等于AQI的数据进,行多重分组
group = df[df['type']=='AQI'].groupby(['hour', 'month'])
# 使用内置的方法,进行求平均值聚合,得到每个月每小事的平均值
group.mean()

insert image description here


2. Use agg for simple aggregation

# 对所有列数据应用相同函数的两个函数
group.agg([np.mean, np.std])

# 传入字典格式的数据
# 对不同列数据应用不同函数
group.agg({
    
    '东四': [np.mean, np.std], '天坛': [np.min, np.max]})




Three, transform function

Use the transform function to transform the groupby object, the calculation result of transform and the original data 形状保持一致.

Using the built-in aggregation function or agg will change the shape of the data, please compare the number of rows and columns of the data in the above figure. But using transform will not change the shape of the data, 相当于用算出来的值,替换原数据中的每一个值.

import pandas as pd
import glob
import numpy as np

files = glob.glob('../../data/03表格数据处理Pandas/C3.7 数据的分组与聚合/*')
df = pd.concat([pd.read_csv(f) for f in files])
df = df.dropna(axis='columns', how='all').dropna(axis='index', how='any')
df['date'] = df.apply({
    
    'date': lambda x: pd.to_datetime(x, format='%Y%m%d')})
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df = df.drop(columns='date')
aqi = df[df['type']=='AQI']
# 改变数据形状
aqi.groupby(['month']).agg(np.mean)

# 不改变数据形状 应用内置函数
aqi.groupby(['month']).transform(np.mean)
# 不改变数据形状 应用匿名函数
aqi.groupby(['hour', 'month']).transform(lambda x: x - x.mean())

insert image description here

Guess you like

Origin blog.csdn.net/qq_35240689/article/details/127073148