Detailed explanation of pandas grouping and aggregation groupby() function

1. groupby grouping and aggregation

Grouping and aggregation are usually a way of analyzing data, and are usually used together with some statistical functions to view the grouping of data

  • DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=_NoDefault.no_default, squeeze=_NoDefault.no_default, observed=False, dropna=True) : Group a DataFrame using a mapper or by a Series column . A groupby operation involves some combination of splitting objects, applying functions, and combining results. Can be used to group large amounts of data and perform computational operations on these groups
    • by: A map, function, label, or list of labels that determines the group to group by. If by is a function, it is called for each value of the object index. If a dictionary or sequence is passed, the sequence or dictionary values ​​will be used to determine the group (values ​​of the sequence are aligned first. If a list or ndarray of length equal to the selected axis is passed, the values ​​are used as-is to determine the group. Labels or a list of labels can be passed to the group by column itself. Note that tuples are interpreted as (single) keys
    • axis: {0 or 'index', 1 or 'columns'}, default 0 split along rows (0) or columns (1). For series, this parameter is not used and defaults to 0
    • level: int, a level name, or a sequence of such, default None If the axis is a MultiIndex (hierarchical), group by a specific level or levels. Do not specify by and level at the same time
    • as_index: bool, the default value is True, whether to index, for aggregate output, return an object with the group label as the index. Only relevant for DataFrame inputs. as_index=False is effectively "SQL style" grouping output
    • sort: bool, default True sorts group keys. Get better performance by turning it off. Note that this does not affect the order of observations within each group. Groupby preserves the order of rows in each group
    • group_keys: bool, optional, when apply is called with the by parameter producing index-like (ie transformed) results, group keys to add to the index to identify fragments. By default, group keys are not included when the index (and column) labels of the result match the input, otherwise they are included. This parameter has no effect if the resulting result does not have a similar index relative to the input
    • squeeze: bool, default False, reduce the dimensionality of the returned type if possible, otherwise return a consistent type
    • observed: bool, default is False, this only applies if any groupers are Categoricals. If True: only show observations for categorical groupers. If False: show all values ​​of categorical groupers
    • dropna: bool, default True, if True, and the group key contains NA values, the NA values ​​will be dropped along with the row/column. If False, NA values ​​will also be considered keys in groups
    • Returns: DataFrameGroupBy, returns a groupby object containing information about the group

The code example is as follows 

import pandas as pd
import numpy as np
df = pd.DataFrame({'颜色': ['蓝色', '灰色', '蓝色', '灰色', '黑色'], '商品': ['钢笔', '钢笔', '铅笔', '铅笔', '文具盒'],'售价':[2.5, 2.3, 1.5, 1.3, 5.2],'会员价':[2.2, 2, 1.3, 1.2, 5.0]})
df

df.groupby([ '商品']).mean(numeric_only=True)

2. Hierarchical index

Grouping by different levels of the hierarchical index can be done using the level parameter

arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
          ['Captive', 'Wild', 'Captive', 'Wild']]
index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))  # from_arrays用于将数组arrays转为多索引multiIndex,多维数组作为参数,高维指定高层索引,低维指定低层索引
index
--------------------------------------------------------------
df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]}, index=index)   # index为行标签索引
df

df.groupby(level=1).mean()   # 层次索引:可以使用级别参数按层次索引的不同级别分组

df.groupby(level='Type').mean()   # 层次索引:可以使用级别参数按层次索引的不同级别分组

df.groupby(level=0).mean()

3. Set whether to contain NaN

You can choose whether to include NA in the group key by setting the dropna parameter. The default setting is True, that is, NaN values ​​​​are not included

l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
df = pd.DataFrame(l, columns=["a", "b", "c"])
df

df.groupby(by=["b"]).sum()   # 还可以通过设置 dropna 参数来选择是否在组键中包含 NA,默认设置为 True

df.groupby(by=["b"], dropna=False).sum()  # dropna=False,即包含NaN

l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]]
l

df = pd.DataFrame(l, columns=["a", "b", "c"])   # columns为列标签索引
df

df.groupby(by="a").sum()  # 按a列分组,对其他列进行求和,默认dropna=True,即不包含NaN值

df.groupby(by="a", dropna=False).sum()  # 为False时包含NaN值

4. Exclude group keys

df = pd.DataFrame({'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'], 'Max Speed': [380., 370., 24., 26.]})
df

df.groupby("Animal", group_keys=True).apply(lambda x: x)  # 使用 group_keys 包含或排除组键,默认为 True(包含)

df.groupby("Animal", group_keys=False).apply(lambda x: x)   # group_keys=False,即排除组键

5. Starbucks Retail Store Data

Starbucks directory.csv data acquisition and download: https://pan.baidu.com/s/1LG7YlezfSvPC6I7IvfUk4Q?pwd=fsp8

# 读取星巴克店的数据
starbucks = pd.read_csv("../data/directory.csv")
starbucks.head()   # head()表示取前五
---------------------------------------------------
# 按照国家分组,求出星巴克零售店数量前10个国家
count = starbucks.groupby(['Country'])["Store Number"].count().sort_values(ascending=False)   # count()即为聚合,sort_values表示进行排序,ascending=False即降序排序
count.head(10)
---------------------------------------------------
# 按照国家分组,求出星巴克零售店数量前10个国家,并作柱状图
count = starbucks.groupby(['Country'])["Store Number"].count().sort_values(ascending=False)
count.head(10).plot(kind='bar')
plt.show()
---------------------------------------------------
# 设置多个索引,将国家省市作为分组
starbucks.groupby(['Country', 'State/Province']).count().head()

 The operation is as follows

The following structure is similar to the MultiIndex structure

 Learning to navigate: http://xqnav.top/

Guess you like

Origin blog.csdn.net/qq_43874317/article/details/128143430