Review and consolidate Pandas GroupBy in a decomposed and visualized way. Whether you're just starting out with Pandas and want to master its core functionality, or want to fill in your understanding of .groupby(), it's helpful for future work.
Article directory
data preparation
- The U.S. Congressional Dataset contains public information about historical members of Congress.
- The air quality dataset contains periodic gas sensor readings.
- The news dataset contains metadata for hundreds of thousands of news articles.
Example 1: US Congress Dataset
The goal is to dissect data on members of Congress' history.
import pandas as pd
dtypes = {
"first_name": "category",
"gender": "category",
"type": "category",
"state": "category",
"party": "category",
}
df = pd.read_csv(
"数据科学必备Pandas实用操作GroupBy数据分组详解/legislators-historical.csv",
dtype=dtypes,
usecols=list(dtypes) + ["birthday", "last_name"],
parse_dates=["birthday"]
)
The dataset contains members' first and last names, date of birth, gender, type ("rep" House/"sen" Senate), US state, and political party. The first few rows of the dataset can be viewed using df.head().
df.head()
GroupBy Aggregation Operation
Question 1 (single-column aggregation): If you want to implement the entire history of the dataset, what is the number of members of Congress by state? How should it be done?
SQL operations.
SELECT state, count(name)
FROM df
GROUP BY state
ORDER BY state;
Pandas operations.
n_by_state = df.groupby("state")["last_name"].count().nlargest(10)
n_by_state.head(10)
state
NY 1467
PA 1053
OH 676
IL 488
VA 433
MA 427
KY 373
CA 368
NJ 359
NC 356
Name: last_name, dtype: int64
Question 2 (multi-column aggregation): How do I break down the number of members of Congress by state and gender?
SQL operations.
SELECT state, gender, count(name)
FROM df
GROUP BY state, gender
ORDER BY state, gender;
Pandas operations.
n_by_state = df.groupby(["state", "gender"])["first_name"].count()
n_by_state.head(10)
state gender
AK M 16
AL F 3
M 203
AR F 5
M 112
...
WI M 196
WV F 1
M 119
WY F 2
M 38
Name: last_name, Length: 104, dtype: int64
How GroupBy works
.groupby() is actually a three-step process of split-apply-combine: split the table into groups, apply some operation to each smaller table, and combine the results .
split splitting process
A useful way to GroupBy a Pandas object and see the splits is to iterate over it.
by_state = df.groupby("state")
# 查看每次聚合的元素对应的前两条数据
for state, frame in by_state:
print(f"前 2 条数据 {
state!r}")
print("------------------------")
print(frame.head(2), end="\n\n")
The .groups attribute will provide a dictionary of { 'group name' : 'group label' }, and by_state is a dict type of data, so it can be accessed by key selection.
by_state.groups["CO"]
You can use .get_group() to get relevant detailed value information.
by_state.get_group("CO")
apply application process
Apply the same operation (or callable) to each unit produced by the split stage.
state, frame = next(iter(by_state))
state
'AK'
frame.head(3)
combine process
frame["first_name"].count()
17
Example 2: Air Quality Dataset
import pandas as pd
df = pd.read_excel("数据科学必备Pandas实用操作GroupBy数据分组详解/AirQualityUCI.xlsx",parse_dates=[["Date","Time"]])
df.rename(columns={
"CO(GT)": "co",
"Date_Time": "tstamp",
"T": "temp_c",
"RH": "rel_hum",
"AH": "abs_hum",
},inplace = True
)
df.set_index("tstamp",inplace=True)
co is the average carbon monoxide reading for the hour, while temp_c, rel_hum, and abs_hum are the average temperature, relative humidity, and absolute humidity for the hour, respectively. Observations lasted from March 2004 to April 2005.
df.index.min()
Timestamp('2004-03-10 18:00:00')
df.index.max()
Timestamp('2005-04-04 14:00:00')
Derived array to group by
Use week data (transformed strings) for group aggregation.
day_names = df.index.day_name()
day_names[:10]
Index(['Wednesday', 'Wednesday', 'Wednesday', 'Wednesday', 'Wednesday',
'Wednesday', 'Thursday', 'Thursday', 'Thursday', 'Thursday'],
dtype='object', name='tstamp')
Problem 1: Find data for the average carbon monoxide (co) for the day of the week.
df.groupby(day_names)["co"].mean()
tstamp
Friday -24.583259
Monday -30.063820
Saturday -27.126414
Sunday -35.432292
Thursday -35.806176
Tuesday -41.773864
Wednesday -44.917647
Name: co, dtype: float64
Question 2: Aggregate data according to each time period of the week.
hr = df.index.hour
df.groupby([day_names, hr])["co"].mean().rename_axis(["dow", "hr"])
dow hr
Friday 0 -30.517857
1 -30.792857
2 -31.158929
3 -31.398214
4 -92.416071
...
Wednesday 19 -28.662500
20 -28.916071
21 -29.710714
22 -30.378571
23 -30.516071
Name: co, Length: 168, dtype: float64
Question 3: Divide the temperature into discrete intervals for group aggregation.
bins = pd.cut(df["temp_c"], bins=3, labels=("cool", "warm", "hot"))
df[["rel_hum", "abs_hum"]].groupby(bins).agg(["mean", "median"])
Question 4: Aggregate data processing by year and quarter.
df.groupby([df.index.year, df.index.quarter])["co"].agg(["max", "min"]).rename_axis(["year", "quarter"])
Example 3: News Aggregator Dataset
import datetime as dt
import pandas as pd
def parse_millisecond_timestamp(ts):
"""转换 UTC 日期时间 """
return dt.datetime.fromtimestamp(ts / 1000, tz=dt.timezone.utc)
df = pd.read_csv(
"数据科学必备Pandas实用操作GroupBy数据分组详解/newsCorpora.csv",
sep="\t",
header=None,
index_col=0,
names=["title", "url", "outlet", "category", "cluster", "host", "tstamp"],
parse_dates=["tstamp"],
date_parser=parse_millisecond_timestamp,
dtype={
"outlet": "category",
"category": "category",
"cluster": "category",
"host": "category",
},
)
df.head()
The categories of category are b business, t technology, e entertainment, m health.
Question 1: Calculate the aggregated statistics of categories that contain a certain keyword for data, and sort them.
df.groupby("outlet", sort=False)["title"].apply(
lambda ser: ser.str.contains("Fed").sum()
).nlargest(10)
outlet
Reuters 161
NASDAQ 103
Businessweek 93
Investing.com 66
Wall Street Journal \(blog\) 61
MarketWatch 56
Moneynews 55
Bloomberg 53
GlobalPost 51
Economic Times 44
Name: title, dtype: int64
Improve GroupBy performance
Going back to .groupby() .apply() again this pattern might not be optimal. What can happen with .apply() is that it will effectively execute a Python loop over each group. While the .groupby() .apply() patterns can provide some flexibility, it also prevents Pandas from using its Cython-based optimizations in other ways.
That is to say, is there a way to express operations in a vectorized way whenever .apply() is considered considered. In this case .groupby() can be used to accept not only one or more column names, but also many array-like structures.
Extract data containing key characters, the resulting data is a Series, and then aggregate.
mentions_fed = df["title"].str.contains("Fed")
import numpy as np
mentions_fed.groupby(
df["outlet"], sort=False
).sum().nlargest(10).astype(np.uintc)
outlet
Reuters 161
NASDAQ 103
Businessweek 93
Investing.com 66
Wall Street Journal \(blog\) 61
MarketWatch 56
Moneynews 55
Bloomberg 53
GlobalPost 51
Economic Times 44
Name: title, dtype: uint32
Determine whether data is lost.
df["outlet"].shape == mentions_fed.shape
True
Finally, look at the execution time comparison, the speed has improved a lot.
# Version 1: 使用 `.apply()`
df.groupby("outlet", sort=False)["title"].apply(
lambda ser: ser.str.contains("Fed").sum()
).nlargest(10)
# Version 2: 使用 vectorization
mentions_fed = df["title"].str.contains("Fed")
mentions_fed.groupby(
df["outlet"], sort=False
).sum().nlargest(10).astype(np.uintc)
Pandas GroupBy method summary
Aggregate methods (also known as reduce methods)
"Mix" many data points into aggregated statistics about those data points. An example would be taking the sum, average or median of 10 numbers and the result is just one number.
其中包括.agg()、.aggregate()、.all()、.any()、.apply()、.corr()、.corrwith()、.count()、.cov()、.cumcount()、.cummax()、.cummin()、.cumprod()、.cumsum()、.describe()、.idxmax()、.idxmin()、.mad()、.max()、.mean()、.median()、.min()、.nunique()、.prod()、.sem()、.size()、.skew()、.std()、.sum()、.var()。
filter method
Returns a subset of the original DataFrame. This usually means using .filter() to remove entire groups based on some comparative statistics about the group and its subtables. It would also make sense to include a number of ways to exclude specific rows from each group under this definition.
These include: .filter(), .first(), .head(), .last(), .nth(), .tail(), .take().
Conversion method
Returns a DataFrame with the same shape and indices as the original data, but different values. Using aggregation and filtering methods, the resulting DataFrame will usually be smaller than the size of the input DataFrame. This is not the case with transformation, which transforms the single value itself but preserves the shape of the original DataFrame.
These include: .bfill(), .diff(), .ffill(), .fillna(), .pct_change(), .quantile(), .rank(), .shift(), .transform(), .tshift ( ).
metamethod
Instead of focusing on the original object on which .groupby() was called, it's more focused on providing high-level information, such as the number of groups and the indices of those groups.
These include: .iter (), .get_group(), .groups, .indices, .ndim, .ngroup(), .ngroups, .dtypes.
drawing method
Models the plotting API of a Pandas Series or DataFrame, usually splitting the output into multiple subplots.
These include: .hist(), .ohlc(), .boxplot().plot()