Data Science Essential Pandas Practical Operation GroupBy Data Grouping

Review and consolidate Pandas GroupBy in a decomposed and visualized way. Whether you're just starting out with Pandas and want to master its core functionality, or want to fill in your understanding of .groupby(), it's helpful for future work.
insert image description here

data preparation

  1. The U.S. Congressional Dataset contains public information about historical members of Congress.
  2. The air quality dataset contains periodic gas sensor readings.
  3. The news dataset contains metadata for hundreds of thousands of news articles.

Example 1: US Congress Dataset

The goal is to dissect data on members of Congress' history.

import pandas as pd

dtypes = {
    
    
    "first_name": "category",
    "gender": "category",
    "type": "category",
    "state": "category",
    "party": "category",
}
df = pd.read_csv(
    "数据科学必备Pandas实用操作GroupBy数据分组详解/legislators-historical.csv",
    dtype=dtypes,
    usecols=list(dtypes) + ["birthday", "last_name"],
    parse_dates=["birthday"]
)

The dataset contains members' first and last names, date of birth, gender, type ("rep" House/"sen" Senate), US state, and political party. The first few rows of the dataset can be viewed using df.head().

df.head()

insert image description here
GroupBy Aggregation Operation

Question 1 (single-column aggregation): If you want to implement the entire history of the dataset, what is the number of members of Congress by state? How should it be done?

SQL operations.

SELECT state, count(name)
FROM df
GROUP BY state
ORDER BY state;

Pandas operations.

n_by_state = df.groupby("state")["last_name"].count().nlargest(10) 
n_by_state.head(10)

state
NY    1467
PA    1053
OH     676
IL     488
VA     433
MA     427
KY     373
CA     368
NJ     359
NC     356
Name: last_name, dtype: int64

Question 2 (multi-column aggregation): How do I break down the number of members of Congress by state and gender?
SQL operations.

SELECT state, gender, count(name)
FROM df
GROUP BY state, gender
ORDER BY state, gender;

Pandas operations.

n_by_state = df.groupby(["state", "gender"])["first_name"].count()
n_by_state.head(10)
state  gender
AK     M          16
AL     F           3
       M         203
AR     F           5
       M         112
                ...
WI     M         196
WV     F           1
       M         119
WY     F           2
       M          38
Name: last_name, Length: 104, dtype: int64

How GroupBy works

.groupby() is actually a three-step process of split-apply-combine: split the table into groups, apply some operation to each smaller table, and combine the results .

split splitting process

A useful way to GroupBy a Pandas object and see the splits is to iterate over it.

by_state = df.groupby("state")

# 查看每次聚合的元素对应的前两条数据
for state, frame in by_state:
    print(f"前 2 条数据 {
      
      state!r}")
    print("------------------------")
    print(frame.head(2), end="\n\n")

insert image description here

The .groups attribute will provide a dictionary of { 'group name' : 'group label' }, and by_state is a dict type of data, so it can be accessed by key selection.

by_state.groups["CO"]

insert image description here
You can use .get_group() to get relevant detailed value information.

by_state.get_group("CO")

insert image description here

apply application process

Apply the same operation (or callable) to each unit produced by the split stage.

state, frame = next(iter(by_state))  
state
'AK'

frame.head(3)

insert image description here
combine process

frame["first_name"].count()
17

Example 2: Air Quality Dataset

import pandas as pd

df = pd.read_excel("数据科学必备Pandas实用操作GroupBy数据分组详解/AirQualityUCI.xlsx",parse_dates=[["Date","Time"]])
df.rename(columns={
    
    
    "CO(GT)": "co",
    "Date_Time": "tstamp",
    "T": "temp_c",
    "RH": "rel_hum",
    "AH": "abs_hum",
    },inplace = True
)

df.set_index("tstamp",inplace=True)

insert image description here

co is the average carbon monoxide reading for the hour, while temp_c, rel_hum, and abs_hum are the average temperature, relative humidity, and absolute humidity for the hour, respectively. Observations lasted from March 2004 to April 2005.

df.index.min()
Timestamp('2004-03-10 18:00:00')

df.index.max()
Timestamp('2005-04-04 14:00:00')

Derived array to group by

Use week data (transformed strings) for group aggregation.

day_names = df.index.day_name()
day_names[:10]

Index(['Wednesday', 'Wednesday', 'Wednesday', 'Wednesday', 'Wednesday',
       'Wednesday', 'Thursday', 'Thursday', 'Thursday', 'Thursday'],
      dtype='object', name='tstamp')

Problem 1: Find data for the average carbon monoxide (co) for the day of the week.

df.groupby(day_names)["co"].mean()

tstamp
Friday      -24.583259
Monday      -30.063820
Saturday    -27.126414
Sunday      -35.432292
Thursday    -35.806176
Tuesday     -41.773864
Wednesday   -44.917647
Name: co, dtype: float64

Question 2: Aggregate data according to each time period of the week.

hr = df.index.hour
df.groupby([day_names, hr])["co"].mean().rename_axis(["dow", "hr"])

dow        hr
Friday     0    -30.517857
           1    -30.792857
           2    -31.158929
           3    -31.398214
           4    -92.416071
                   ...    
Wednesday  19   -28.662500
           20   -28.916071
           21   -29.710714
           22   -30.378571
           23   -30.516071
Name: co, Length: 168, dtype: float64

Question 3: Divide the temperature into discrete intervals for group aggregation.

bins = pd.cut(df["temp_c"], bins=3, labels=("cool", "warm", "hot"))
df[["rel_hum", "abs_hum"]].groupby(bins).agg(["mean", "median"])

insert image description here
Question 4: Aggregate data processing by year and quarter.

df.groupby([df.index.year, df.index.quarter])["co"].agg(["max", "min"]).rename_axis(["year", "quarter"])

insert image description here

Example 3: News Aggregator Dataset

import datetime as dt
import pandas as pd

def parse_millisecond_timestamp(ts):
    """转换 UTC 日期时间 """
    return dt.datetime.fromtimestamp(ts / 1000, tz=dt.timezone.utc)

df = pd.read_csv(
    "数据科学必备Pandas实用操作GroupBy数据分组详解/newsCorpora.csv",
    sep="\t",
    header=None,
    index_col=0,
    names=["title", "url", "outlet", "category", "cluster", "host", "tstamp"],
    parse_dates=["tstamp"],
    date_parser=parse_millisecond_timestamp,
    dtype={
    
    
        "outlet": "category",
        "category": "category",
        "cluster": "category",
        "host": "category",
    },
)
df.head()

insert image description here
The categories of category are b business, t technology, e entertainment, m health.

Question 1: Calculate the aggregated statistics of categories that contain a certain keyword for data, and sort them.

df.groupby("outlet", sort=False)["title"].apply(
	lambda ser: ser.str.contains("Fed").sum()
).nlargest(10)

outlet
Reuters                         161
NASDAQ                          103
Businessweek                     93
Investing.com                    66
Wall Street Journal \(blog\)     61
MarketWatch                      56
Moneynews                        55
Bloomberg                        53
GlobalPost                       51
Economic Times                   44
Name: title, dtype: int64

Improve GroupBy performance

Going back to .groupby() .apply() again this pattern might not be optimal. What can happen with .apply() is that it will effectively execute a Python loop over each group. While the .groupby() .apply() patterns can provide some flexibility, it also prevents Pandas from using its Cython-based optimizations in other ways.

That is to say, is there a way to express operations in a vectorized way whenever .apply() is considered considered. In this case .groupby() can be used to accept not only one or more column names, but also many array-like structures.

Extract data containing key characters, the resulting data is a Series, and then aggregate.

mentions_fed = df["title"].str.contains("Fed")

import numpy as np

mentions_fed.groupby(
   df["outlet"], sort=False
).sum().nlargest(10).astype(np.uintc)

outlet
Reuters                         161
NASDAQ                          103
Businessweek                     93
Investing.com                    66
Wall Street Journal \(blog\)     61
MarketWatch                      56
Moneynews                        55
Bloomberg                        53
GlobalPost                       51
Economic Times                   44
Name: title, dtype: uint32

Determine whether data is lost.

df["outlet"].shape == mentions_fed.shape
True

Finally, look at the execution time comparison, the speed has improved a lot.

# Version 1: 使用 `.apply()`
df.groupby("outlet", sort=False)["title"].apply(
    lambda ser: ser.str.contains("Fed").sum()
).nlargest(10)

# Version 2: 使用 vectorization
mentions_fed = df["title"].str.contains("Fed")
mentions_fed.groupby(
    df["outlet"], sort=False
).sum().nlargest(10).astype(np.uintc)

insert image description here

Pandas GroupBy method summary

insert image description here

Aggregate methods (also known as reduce methods)

"Mix" many data points into aggregated statistics about those data points. An example would be taking the sum, average or median of 10 numbers and the result is just one number.

其中包括.agg()、.aggregate()、.all()、.any()、.apply()、.corr()、.corrwith()、.count()、.cov()、.cumcount()、.cummax()、.cummin()、.cumprod()、.cumsum()、.describe()、.idxmax()、.idxmin()、.mad()、.max()、.mean()、.median()、.min()、.nunique()、.prod()、.sem()、.size()、.skew()、.std()、.sum()、.var()。

filter method

Returns a subset of the original DataFrame. This usually means using .filter() to remove entire groups based on some comparative statistics about the group and its subtables. It would also make sense to include a number of ways to exclude specific rows from each group under this definition.

These include: .filter(), .first(), .head(), .last(), .nth(), .tail(), .take().

Conversion method

Returns a DataFrame with the same shape and indices as the original data, but different values. Using aggregation and filtering methods, the resulting DataFrame will usually be smaller than the size of the input DataFrame. This is not the case with transformation, which transforms the single value itself but preserves the shape of the original DataFrame.

These include: .bfill(), .diff(), .ffill(), .fillna(), .pct_change(), .quantile(), .rank(), .shift(), .transform(), .tshift ( ).

metamethod

Instead of focusing on the original object on which .groupby() was called, it's more focused on providing high-level information, such as the number of groups and the indices of those groups.

These include: .iter (), .get_group(), .groups, .indices, .ndim, .ngroup(), .ngroups, .dtypes.

drawing method

Models the plotting API of a Pandas Series or DataFrame, usually splitting the output into multiple subplots.

These include: .hist(), .ohlc(), .boxplot().plot()

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/124191331