在这里插入图片描述
Task03：分组（2天）

Pandas基础

关于分组
groupby函数
聚合、过滤和变换

聚合（Aggregation）
过滤（Filteration）
变换（Transformation）

apply
练习

练习一
练习二

参考内容

关于分组

对数据集进行分类，然后方便对每一组的数据进行统计分析。
分组运算过程：split（分割）、apply（应用）、combine（合并）。
切割：根据什么数据进行分组；
应用：每一个分组之后的数据怎么进行处理，怎么计算；
合并：将每一个分组计算后的结果合并起来，统一展示。

groupby函数

在SQL语言里有group by功能，在Pandas里有groupby函数与之功能相对应。DataFrame数据对象经groupby()之后有ngroups和groups等属性，本质是DataFrame类的子类DataFrameGroupBy的实例对象。ngroups反应的是分组的个数，而groups类似dict结构，key是分组的index或label，value则为index或label所对应的分组数据。size函数则是可以返回所有分组的字节大小。count函数可以统计分组后各列数据项个数。get_group函数可以返回指定组的数据信息。而discribe函数可以返回分组后的数据的统计数据。
在这里插入图片描述

grouped_single = df.groupby('School')

经过groupby后会生成一个groupby对象，该对象本身不会返回任何东西，只有当相应的方法被调用才会起作用

import numpy as np
import pandas as pd
path = 'C:/Users/86187/Desktop/第12期组队学习/组队学习Pandas/'
df = pd.read_csv(path + 'data/table.csv', index_col='ID')

# 加上这两行可以一次性输出多个变量而不用print
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

print('根据某一列进行分组')
grouped_single = df.groupby('School')
grouped_single.get_group('S_1').head()  # 取出‘S_1’这一组

print('根据某几列进行分组')
grouped_mul = df.groupby(['School', 'Class'])
grouped_mul.get_group(('S_2', 'C_4'))

print('查看数组容量')
grouped_single.size()
grouped_mul.size()  # 很清晰地看到各组状况
grouped_single.ngroups
grouped_mul.ngroups  # 查看子组数

print('调用函数count、max、mean')
grouped_single.count()
grouped_mul.max()
grouped_mul.mean()

print('组的遍历')
for name, group in grouped_single:
    print(name)
    display(group.head())

# print('多级索引分组，这里涉及'level'和'axis'参数')
df.set_index(['Gender', 'School']).groupby(level=1, axis=0).get_group('S_1')

# 对于groupby函数而言，分组的依据是非常自由的，只要是与数据框长度相同的列表即可，同时支持函数型分组
df.groupby(np.random.choice(['a', 'b', 'c'],
                            df.shape[0])).get_group('a').head()
# 相当于将np.random.choice(['a','b','c'],df.shape[0])当做新的一列进行分组

根据某一列进行分组

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+

根据某几列进行分组

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
2401	S_2	C_4	F	street_2	192	62	45.3	A
2402	S_2	C_4	M	street_7	166	82	48.7	B
2403	S_2	C_4	F	street_6	158	60	59.7	B+
2404	S_2	C_4	F	street_2	160	84	67.7	B
2405	S_2	C_4	F	street_6	193	54	47.6	B

查看数组容量：

School
S_1    15
S_2    20
dtype: int64

School  Class
S_1     C_1      5
        C_2      5
        C_3      5
S_2     C_1      5
        C_2      5
        C_3      5
        C_4      5
dtype: int64

子组数：
2
7

调用函数count、max、mean：

	Class	Gender	Address	Height	Weight	Math	Physics
School
S_1	15	15	15	15	15	15	15
S_2	20	20	20	20	20	20	20

		Gender	Address	Height	Weight	Math	Physics
School	Class
S_1	C_1	M	street_4	192	82	87.2	B-
	C_2	M	street_6	188	94	97.0	B-
	C_3	M	street_7	195	82	87.7	B-
S_2	C_1	M	street_7	174	97	83.3	C
	C_2	M	street_7	194	100	85.4	B-
	C_3	M	street_7	190	99	95.5	C
	C_4	M	street_7	193	84	67.7	B+

		Height	Weight	Math
School	Class
S_1	C_1	175.4	72.6	63.78
	C_2	170.6	68.2	64.30
	C_3	181.2	69.2	63.16
S_2	C_1	164.2	76.8	58.56
	C_2	180.0	83.6	62.80
	C_3	173.8	83.8	63.06
	C_4	173.8	68.4	53.80

组的遍历
S_1

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+

S_2

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
2101	S_2	C_1	M	street_7	174	84	83.3	C
2102	S_2	C_1	F	street_6	161	61	50.6	B+
2103	S_2	C_1	M	street_4	157	61	52.5	B-
2104	S_2	C_1	F	street_5	159	97	72.2	B+
2105	S_2	C_1	M	street_4	170	81	34.2	A

		Class	Address	Height	Weight	Math	Physics
Gender	School
M	S_1	C_1	street_1	173	63	34.0	A+
F	S_1	C_1	street_2	192	73	32.5	B+
M	S_1	C_1	street_2	186	82	87.2	B+
F	S_1	C_1	street_2	167	81	80.4	B-
F	S_1	C_1	street_4	159	64	84.8	B+
M	S_1	C_2	street_5	188	68	97.0	A-
F	S_1	C_2	street_4	176	94	63.5	B-
M	S_1	C_2	street_6	160	53	58.8	A+
F	S_1	C_2	street_5	162	63	33.8	B
F	S_1	C_2	street_6	167	63	68.4	B-
M	S_1	C_3	street_4	161	68	31.5	B+
F	S_1	C_3	street_1	175	57	87.7	A-
M	S_1	C_3	street_7	188	82	49.7	B
M	S_1	C_3	street_2	195	70	85.2	A
F	S_1	C_3	street_5	187	69	61.7	B-

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+
1204	S_1	C_2	F	street_5	162	63	33.8	B

print('从原理上说，我们可以看到利用函数时，传入的对象就是索引，因此根据这一特性可以做一些复杂的操作')
df[:5].groupby(lambda x: print(x)).head()

print('根据奇偶行进行分组')
df.groupby(lambda x: '奇数行'
           if not df.index.get_loc(x) % 2 == 1 else '偶数行').groups

从原理上说，我们可以看到利用函数时，传入的对象就是索引，因此根据这一特性可以做一些复杂的操作。
1101
1102
1103
1104
1105

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+

根据奇偶行进行分组：
{'偶数行': Int64Index([1102, 1104, 1201, 1203, 1205, 1302, 1304, 2101, 2103, 2105, 2202,
             2204, 2301, 2303, 2305, 2402, 2404],
            dtype='int64', name='ID'),
 '奇数行': Int64Index([1101, 1103, 1105, 1202, 1204, 1301, 1303, 1305, 2102, 2104, 2201,
             2203, 2205, 2302, 2304, 2401, 2403, 2405],
            dtype='int64', name='ID')}

print('查看两所学校中男女生分别均分是否及格')
math_score = df.set_index(['Gender', 'School'])['Math'].sort_index()
grouped_score = df.set_index(['Gender','School']).sort_index().\
            groupby(lambda x:(x,'均分及格' if math_score[x].mean()>=60 else '均分不及格'))
for name, _ in grouped_score:
    print(name)
    
# 可以用[]选出groupby对象的某个或者某几个列，上面的均分比较可以如下简洁地写出：
print('简洁写法：')
df.groupby(['Gender','School'])['Math'].mean()>=60

查看两所学校中男女生分别均分是否及格
(('F', 'S_1'), '均分及格')
(('F', 'S_2'), '均分及格')
(('M', 'S_1'), '均分及格')
(('M', 'S_2'), '均分不及格')

简洁写法：
Gender  School
F       S_1        True
        S_2        True
M       S_1        True
        S_2       False
Name: Math, dtype: bool

# 连续变量分组，进行区间划分
bins = [0, 40, 60, 80, 90, 100]
cuts = pd.cut(df['Math'], bins=bins)  #可选label添加自定义标签
df.groupby(cuts)['Math'].count() # 汇总

Math
(0, 40]       7
(40, 60]     10
(60, 80]      9
(80, 90]      7
(90, 100]     2
Name: Math, dtype: int64

聚合、过滤和变换

在对数据进行分组之后，可以对分组后的数据进行聚合处理统计。所谓聚合就是把一堆数，变成一个标量，因此mean/sum/size/count/std/var/sem/describe/first/last/nth/min/max都是聚合函数。

聚合（Aggregation）

# 使用单个聚合函数
group_m = grouped_single['Math']
group_m.max()

School
S_1    97.0
S_2    95.5
Name: Math, dtype: float64

# 同时使用多个聚合函数
group_m.agg(['sum','mean','std'])

	sum	mean	std
School
S_1	956.2	63.746667	23.077474
S_2	1191.1	59.555000	17.589305

# 利用元组进行重命名
group_m.agg([('rename_sum','sum'),('rename_mean','mean')])

	rename_sum	rename_mean
School
S_1	956.2	63.746667
S_2	1191.1	59.555000

# 指定哪些函数作用哪些列
grouped_mul.agg({'Math': ['mean', 'max'], 'Height': 'var'})

		Math		Height
		mean	max	var
School	Class
S_1	C_1	63.78	87.2	183.3
	C_2	64.30	97.0	132.8
	C_3	63.16	87.7	179.2
S_2	C_1	58.56	83.3	54.7
	C_2	62.80	85.4	256.0
	C_3	63.06	95.5	205.7
	C_4	53.80	67.7	300.2

# 适用自定义函数
grouped_single['Math'].agg(lambda x: print(x.head(), '间隔'))
#可以发现，agg函数的传入是分组逐列进行的，有了这个特性就可以做许多事情

1101    34.0
1102    32.5
1103    87.2
1104    80.4
1105    84.8
Name: Math, dtype: float64 间隔
2101    83.3
2102    50.6
2103    52.5
2104    72.2
2105    34.2
Name: Math, dtype: float64 间隔

School
S_1    None
S_2    None
Name: Math, dtype: object

# 组内极差计算
grouped_single['Math'].agg(lambda x: x.max() - x.min())

School
S_1    65.5
S_2    62.8
Name: Math, dtype: float64

# 利用NamedAgg函数进行多个聚合
# 注意：不支持lambda函数，但是可以使用外置的def函数
def R1(x):
    return x.max() - x.min()

def R2(x):
    return x.max() - x.median()

grouped_single['Math'].agg(min_score1=pd.NamedAgg(column='col1', aggfunc=R1),
                           max_score1=pd.NamedAgg(column='col2',
                                                  aggfunc='max'),
                           range_score2=pd.NamedAgg(column='col3',
                                                    aggfunc=R2)).head()

	min_score1	max_score1	range_score2
School
S_1	65.5	97.0	33.5
S_2	62.8	95.5	39.4

# 带参数的聚合函数
# 判断是否组内数学分数至少有一个值在50-52之间：
def f(s, low, high):
    return s.between(low, high).any()
grouped_single['Math'].agg(f, 50, 52)

School
S_1    False
S_2     True
Name: Math, dtype: bool

过滤（Filteration）

filter函数是用来筛选某些组的（务必记住结果是组的全体），因此传入的值应当是布尔标量。

grouped_single[['Math',
                'Physics']].filter(lambda x: (x['Math'] > 32).all()).head()

	Math	Physics
ID
2101	83.3	C
2102	50.6	B+
2103	52.5	B-
2104	72.2	B+
2105	34.2	A

变换（Transformation）

transform函数中传入的对象是组内的列，并且返回值需要与列长完全一致。

grouped_single[['Math', 'Height']].transform(lambda x: x - x.min()).head()

	Math	Height
ID
1101	2.5	14
1102	1.0	33
1103	55.7	27
1104	48.9	8
1105	53.3	0

# 如果返回了标量值，那么组内的所有元素会被广播为这个值
grouped_single[['Math', 'Height']].transform(lambda x: x.mean()).head()

	Math	Height
ID
1101	63.746667	175.733333
1102	63.746667	175.733333
1103	63.746667	175.733333
1104	63.746667	175.733333
1105	63.746667	175.733333

# 利用变换方法进行组内标准化
grouped_single[['Math', 'Height'
                ]].transform(lambda x: (x - x.mean()) / x.std()).head()

	Math	Height
ID
1101	-1.288991	-0.214991
1102	-1.353990	1.279460
1103	1.016287	0.807528
1104	0.721627	-0.686923
1105	0.912289	-1.316166

# 利用变换方法进行组内缺失值的均值填充
df_nan = df[['Math', 'School']].copy().reset_index()
df_nan.loc[np.random.randint(0, df.shape[0], 25), ['Math']] = np.nan
df_nan.head()

df_nan.groupby('School').transform(lambda x: x.fillna(x.mean())).join(
    df.reset_index()['School']).head()

	ID	Math	School
0	1101	34.0	S_1
1	1102	32.5	S_1
2	1103	87.2	S_1
3	1104	80.4	S_1
4	1105	84.8	S_1

	ID	Math	School
0	1101	34.0	S_1
1	1102	32.5	S_1
2	1103	87.2	S_1
3	1104	80.4	S_1
4	1105	84.8	S_1

apply

# apply是以分组的表传入函数中的
df.groupby('School').apply(lambda x: print(x.head(1)))

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
2101    S_2   C_1      M  street_7     174      84  83.3       C

# 标量返回值
df[['School', 'Math', 'Height']].groupby('School').apply(lambda x: x.max())

	School	Math	Height
School
S_1	S_1	97.0	195
S_2	S_2	95.5	194

# 列表返回值
df[['School', 'Math',
    'Height']].groupby('School').apply(lambda x: x - x.min()).head()

	Math	Height
ID
1101	2.5	14.0
1102	1.0	33.0
1103	55.7	27.0
1104	48.9	8.0
1105	53.3	0.0

# 数据框返回值
df[['School','Math','Height']].groupby('School')\
    .apply(lambda x:pd.DataFrame({'col1':x['Math']-x['Math'].max(),
                                  'col2':x['Math']-x['Math'].min(),
                                  'col3':x['Height']-x['Height'].max(),
                                  'col4':x['Height']-x['Height'].min()})).head()

	col1	col2	col3	col4
ID
1101	-63.0	2.5	-22	14
1102	-64.5	1.0	-3	33
1103	-9.8	55.7	-9	27
1104	-16.6	48.9	-28	8
1105	-12.2	53.3	-36	0

# 用apply同时统计多个指标
from collections import OrderedDict
def f(df):
    data = OrderedDict()
    data['M_sum'] = df['Math'].sum()
    data['W_var'] = df['Weight'].var()
    data['H_mean'] = df['Height'].mean()
    return pd.Series(data)
grouped_single.apply(f)

	M_sum	W_var	H_mean
School
S_1	956.2	117.428571	175.733333
S_2	1191.1	181.081579	172.950000

练习

练习一

现有一份关于diamonds的数据集，列分别记录了克拉数、颜色、开采深度、价格，请解决下列问题：
a. 在所有重量超过1克拉的钻石中，价格的极差是多少？

df = pd.read_csv('data/Diamonds.csv')
df.head()

	carat	color	depth	price
0	0.23	E	61.5	326
1	0.21	E	59.8	326
2	0.23	E	56.9	327
3	0.29	I	62.4	334
4	0.31	J	63.3	335

Max = df.loc[df['carat']>=1, 'price'].max()
Min = df.loc[df['carat']>=1, 'price'].min()
Max-Min

b. 若以开采深度的0.2\0.4\0.6\0.8分位数为分组依据，每一组中钻石颜色最多的是哪一种？该种颜色是组内平均而言单位重量最贵的吗？

bins = df['depth'].quantile(np.linspace(0,1,6)).tolist()
cuts = pd.cut(df['depth'], bins=bins)
df['cuts'] = cuts
df.head()

df.groupby('cuts')['color'].value_counts()[::7]

	carat	color	depth	price	cuts
0	0.23	E	61.5	326	(60.8, 61.6]
1	0.21	E	59.8	326	(43.0, 60.8]
2	0.23	E	56.9	327	(43.0, 60.8]
3	0.29	I	62.4	334	(62.1, 62.7]
4	0.31	J	63.3	335	(62.7, 79.0]

cuts          color
(43.0, 60.8]  E        2259
(60.8, 61.6]  G        2593
(61.6, 62.1]  G        2247
(62.1, 62.7]  G        2193
(62.7, 79.0]  G        2000
Name: color, dtype: int64

c. 以重量分组(0-0.5,0.5-1,1-1.5,1.5-2,2+)，按递增的深度为索引排序，求每组中连续的严格递增价格序列长度的最大值。

weight = np.linspace(0, 2.5, 6).tolist()
df['weight'] = pd.cut(df['carat'], bins=weight)


def f(nums):
    if not nums:
        return 0
    res = 1
    cur_len = 1
    for i in range(1, len(nums)):
        if nums[i - 1] < nums[i]:
            cur_len += 1
            res = max(cur_len, res)
        else:
            cur_len = 1
    return res


for name, group in df.groupby('weight'):
    group = group.sort_values(by='depth')
    s = group['price']
    print(name, f(s.tolist()))

(0.0, 0.5] 8
(0.5, 1.0] 8
(1.0, 1.5] 7
(1.5, 2.0] 11
(2.0, 2.5] 7

d. 请按颜色分组，分别计算价格关于克拉数的回归系数。（单变量的简单线性回归，并只使用Pandas和Numpy完成）

for name, group in df[['carat', 'price', 'color']].groupby('color'):
    x = np.array([group['carat'],
                  np.ones(group.shape[0])]).T.reshape(group.shape[0], 2)
    y = np.array(group['price']).reshape(group.shape[0], 1)
    theta = (np.linalg.inv(x.T.dot(x)).dot(x.T).dot(y)).reshape(2, 1)
    print('当颜色为%s时，回归系数为：%f, 截距项为：%f' % (name, theta[0], theta[1]))

当颜色为D时，回归系数为：8408.353126, 截距项为：-2361.017152
当颜色为E时，回归系数为：8296.212783, 截距项为：-2381.049600
当颜色为F时，回归系数为：8676.658344, 截距项为：-2665.806191
当颜色为G时，回归系数为：8525.345779, 截距项为：-2575.527643
当颜色为H时，回归系数为：7619.098320, 截距项为：-2460.418046
当颜色为I时，回归系数为：7761.041169, 截距项为：-2878.150356
当颜色为J时，回归系数为：7094.192092, 截距项为：-2920.603337

练习二

有一份关于美国10年至17年的非法药物数据集，列分别记录了年份、州（5个）、县、药物类型、报告数量，请解决下列问题：

(a) 按照年份统计，哪个县的报告数量最多？这个县所属的州在当年也是报告数最多的吗？

df = pd.read_csv('data/Drugs.csv')
df.head()

	YYYY	State	COUNTY	SubstanceName	DrugReports
0	2010	VA	ACCOMACK	Propoxyphene	1
1	2010	OH	ADAMS	Morphine	9
2	2010	PA	ADAMS	Methadone	2
3	2010	VA	ALEXANDRIA CITY	Heroin	5
4	2010	PA	ALLEGHENY	Hydromorphone	5

idx=pd.IndexSlice
for i in range(2010,2018):
    county = (df.groupby(['COUNTY','YYYY']).sum().loc[idx[:,i],:].idxmax()[0][0])
    state = df.query('COUNTY == "%s"'%county)['State'].iloc[0]
    state_true = df.groupby(['State','YYYY']).sum().loc[idx[:,i],:].idxmax()[0][0]
    if state==state_true:
        print('在%d年，%s县的报告数最多，它所属的州%s也是报告数最多的'%(i,county,state))
    else:
        print('在%d年，%s县的报告数最多，但它所属的州%s不是报告数最多的，%s州报告数最多'%(i,county,state,state_true))

在2010年，PHILADELPHIA县的报告数最多，它所属的州PA也是报告数最多的
在2011年，PHILADELPHIA县的报告数最多，但它所属的州PA不是报告数最多的，OH州报告数最多
在2012年，PHILADELPHIA县的报告数最多，但它所属的州PA不是报告数最多的，OH州报告数最多
在2013年，PHILADELPHIA县的报告数最多，但它所属的州PA不是报告数最多的，OH州报告数最多
在2014年，PHILADELPHIA县的报告数最多，但它所属的州PA不是报告数最多的，OH州报告数最多
在2015年，PHILADELPHIA县的报告数最多，但它所属的州PA不是报告数最多的，OH州报告数最多
在2016年，HAMILTON县的报告数最多，它所属的州OH也是报告数最多的
在2017年，HAMILTON县的报告数最多，它所属的州OH也是报告数最多的

(b) 从14年到15年，Heroin的数量增加最多的是哪一个州？它在这个州是所有药物中增幅最大的吗？若不是，请找出符合该条件的药物。

df_b = df[(df['YYYY'].isin([2014,2015]))&(df['SubstanceName']=='Heroin')]
df_add = df_b.groupby(['YYYY','State']).sum()
(df_add.loc[2015]-df_add.loc[2014]).idxmax()

DrugReports    OH
dtype: object

df_b = df[(df['YYYY'].isin([2014,2015]))&(df['State']=='OH')]
df_add = df_b.groupby(['YYYY','SubstanceName']).sum()
display((df_add.loc[2015]-df_add.loc[2014]).idxmax()) #这里利用了索引对齐的特点
display((df_add.loc[2015]/df_add.loc[2014]).idxmax())

DrugReports    Heroin
dtype: object



DrugReports    Acetyl fentanyl
dtype: object

参考内容

教程仓库连接
《利用Python进行数据分析》

Pandas学习笔记3——分组