pandas系列学习（七）：数据透视表

作者：chen_h
微信号 & QQ：862251340
微信公众号：coderpai

介绍

大多数人可能都有使用 Excel 中的数据透视表的经验。 pandas 提供了一个类似的功能，称为 pivot_table。虽然它非常有用，但我经常发现自己很难记住如何使用语法格式化输出以满足我的需求。本文将重点介绍 pandas pivot_table 函数以及如何将其用于数据分析。

作为一个额外的奖励，我创建了一个简单的备忘录，总结了 pivot_table 。你可以在这篇文章的最后找到它，我希望它是一个有用的参考。

数据

使用 pandas 的 pivot_table 的一个挑战是确保你了解你的数据以及你尝试使用数据透视表来处理问题。这是一个看似简单的功能，但可以非常快速的产生非常强大的分析。

在这种情况下，我将跟踪销售渠道（也称为渠道）。基本问题是一些销售周期很长，管理层希望全年更详细的了解它。

典型问题包括：

管道中有多少收入？
什么产品正在筹备中？
谁在什么阶段有什么产品？
我们有多大可能在年底前完成交易？

许多公司将拥有销售用于跟踪流程的 CRM 工具或其他软件。虽然他们可能有分析数据的有用工具，但不可避免的有人会将数据导出到 Excel 并使用数据透视表来汇总数据。

使用 padnas 的数据透视表可能是一个很好的选择，因为它是：

更快（一旦设置）
自我记录（查看代码，你知道它做了什么）
易于使用生成报告或者电子邮件
更灵活，因为你可以定义客户关心的内容

读入数据

让我们先建立我们的环境，数据你可以点击这个下载。然后将我们的销售数据读入 DataFrame ，如下：

import pandas as pd
import numpy as np

df = pd.read_excel("./sales-funnel.xlsx")
df.head()

	Account	Name	Rep	Manager	Product	Quantity	Price	Status
0	714466	Trantow-Barrows	Craig Booker	Debra Henley	CPU	1	30000	presented
1	714466	Trantow-Barrows	Craig Booker	Debra Henley	Software	1	10000	presented
2	714466	Trantow-Barrows	Craig Booker	Debra Henley	Maintenance	2	5000	pending
3	737550	Fritsch, Russel and Anderson	Craig Booker	Debra Henley	CPU	1	35000	declined
4	146832	Kiehn-Spinka	Daniel Hilton	Debra Henley	CPU	2	65000	won

为了方便起见，我们将状态列定义为类别并设置我们要查看的顺序。

这不是严格要求的，但可以帮助我们在分析数据时保持我们想要的顺序。

df["Status"] = df["Status"].astype("category")
df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)

透视数据

在我们构建数据透视表时，我认为最简单的方法就是一步一步。添加项目并检查每个步骤以验证你是否获得了预期的结果。不要害怕使用订单和变量来查看哪些需求。

最简单的数据透视表必须具有数据框和索引。在这种情况下，让我们使用 Name 作为索引。

pd.pivot_table(df,index=["Name"])

	Account	Price	Quantity
Name
Barton LLC	740150.0	35000.0	1.000000
Fritsch, Russel and Anderson	737550.0	35000.0	1.000000
Herman LLC	141962.0	65000.0	2.000000
Jerde-Hilpert	412290.0	5000.0	2.000000
Kassulke, Ondricka and Metz	307599.0	7000.0	3.000000
Keeling LLC	688981.0	100000.0	5.000000
Kiehn-Spinka	146832.0	65000.0	2.000000
Koepp Ltd	729833.0	35000.0	2.000000
Kulas Inc	218895.0	25000.0	1.500000
Purdy-Kunde	163416.0	30000.0	1.000000
Stokes LLC	239344.0	7500.0	1.000000
Trantow-Barrows	714466.0	15000.0	1.333333

你也可以拥有多个索引。是加上，大多数的 pivot_table args 都可以通过列表获取多个值。

pd.pivot_table(df,index=["Name","Rep","Manager"])

			Account	Price	Quantity
Name	Rep	Manager
Barton LLC	John Smith	Debra Henley	740150.0	35000.0	1.000000
Fritsch, Russel and Anderson	Craig Booker	Debra Henley	737550.0	35000.0	1.000000
Herman LLC	Cedric Moss	Fred Anderson	141962.0	65000.0	2.000000
Jerde-Hilpert	John Smith	Debra Henley	412290.0	5000.0	2.000000
Kassulke, Ondricka and Metz	Wendy Yule	Fred Anderson	307599.0	7000.0	3.000000
Keeling LLC	Wendy Yule	Fred Anderson	688981.0	100000.0	5.000000
Kiehn-Spinka	Daniel Hilton	Debra Henley	146832.0	65000.0	2.000000
Koepp Ltd	Wendy Yule	Fred Anderson	729833.0	35000.0	2.000000
Kulas Inc	Daniel Hilton	Debra Henley	218895.0	25000.0	1.500000
Purdy-Kunde	Cedric Moss	Fred Anderson	163416.0	30000.0	1.000000
Stokes LLC	Cedric Moss	Fred Anderson	239344.0	7500.0	1.000000
Trantow-Barrows	Craig Booker	Debra Henley	714466.0	15000.0	1.333333

这很有趣但不是特别有用。我们可能想要做的是通过 Manager 和 Rep 查看。通过更改索引可以轻松完成。

pd.pivot_table(df,index=["Manager","Rep"])

		Account	Price	Quantity
Manager	Rep
Debra Henley	Craig Booker	720237.0	20000.000000	1.250000
Daniel Hilton	194874.0	38333.333333	1.666667
John Smith	576220.0	20000.000000	1.500000
Fred Anderson	Cedric Moss	196016.5	27500.000000	1.250000
Wendy Yule	614061.5	44250.000000	3.000000

你可以看到数据透视表非常智能，可以通过将 reps 与 manager 分组来开始汇总数据并对其进行汇总。现在我们开始了解数据透视表可以为我们做些什么。

为此，“账户” 和 “数量” 列不真正有用。让我们通过使用 values 字段显式定义我们关心的列来删除它。

pd.pivot_table(df,index=["Manager","Rep"],values=["Price"])

		Price
Manager	Rep
Debra Henley	Craig Booker	20000.000000
Daniel Hilton	38333.333333
John Smith	20000.000000
Fred Anderson	Cedric Moss	27500.000000
Wendy Yule	44250.000000

价格列自动平均数据，但我们可以进行计数或者总和。使用 aggfunc 和 np.sum 添加它们很简单。

pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=np.sum)

		Price
Manager	Rep
Debra Henley	Craig Booker	80000
Daniel Hilton	115000
John Smith	40000
Fred Anderson	Cedric Moss	110000
Wendy Yule	177000

aggfunc 可以获取一系列函数。让我们尝试使用 np.mean 函数和 len 来计算。

pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=[np.mean,len])

		mean	len
		Price	Price
Manager	Rep
Debra Henley	Craig Booker	20000.000000	4
Daniel Hilton	38333.333333	3
John Smith	20000.000000	2
Fred Anderson	Cedric Moss	27500.000000	4
Wendy Yule	44250.000000	4

如果我们想要查看按产品细分的销售额，则 columns 变量允许我们定义一个或者多格列。

我认为 pivot_table 的一个令人困惑的问题是使用列和值。请记住，列是可选的 —— 它们提供了一种额外的方法来细分你关心的实际值。聚合函数将应用于你列出的值。

pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],
               columns=["Product"],aggfunc=[np.sum])

		sum
		Price
	Product	CPU	Maintenance	Monitor	Software
Manager	Rep
Debra Henley	Craig Booker	65000.0	5000.0	NaN	10000.0
Daniel Hilton	105000.0	NaN	NaN	10000.0
John Smith	35000.0	5000.0	NaN	NaN
Fred Anderson	Cedric Moss	95000.0	5000.0	NaN	10000.0
Wendy Yule	165000.0	7000.0	5000.0	NaN

NaN 有点让人抓狂。如果我们想要删除它们，我们可以使用 fill_value 将它们设置为 0 。

pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],
               columns=["Product"],aggfunc=[np.sum],fill_value=0)

		sum
		Price
	Product	CPU	Maintenance	Monitor	Software
Manager	Rep
Debra Henley	Craig Booker	65000	5000	0	10000
Daniel Hilton	105000	0	0	10000
John Smith	35000	5000	0	0
Fred Anderson	Cedric Moss	95000	5000	0	10000
Wendy Yule	165000	7000	5000	0

我认为添加数量列也是非常有用的，将数量添加到值列表中。

pd.pivot_table(df,index=["Manager","Rep"],values=["Price","Quantity"],
               columns=["Product"],aggfunc=[np.sum],fill_value=0)

		sum
		Price	Quantity
	Product	CPU	Maintenance	Monitor	Software	CPU	Maintenance	Monitor	Software
Manager	Rep
Debra Henley	Craig Booker	65000	5000	0	10000	2	2	0	1
Daniel Hilton	105000	0	0	10000	4	0	0	1
John Smith	35000	5000	0	0	1	2	0	0
Fred Anderson	Cedric Moss	95000	5000	0	10000	3	1	0	1
Wendy Yule	165000	7000	5000	0	7	3	2	0

有趣的是，你可以将项目移动到索引以获得不同的可视化结果。从列中删除产品并添加都索引中。

pd.pivot_table(df,index=["Manager","Rep","Product"],
               values=["Price","Quantity"],aggfunc=[np.sum],fill_value=0)

			sum
			Price	Quantity
Manager	Rep	Product
Debra Henley	Craig Booker	CPU	65000	2
Maintenance	5000	2
Software	10000	1
Daniel Hilton	CPU	105000	4
Software	10000	1
John Smith	CPU	35000	1
Maintenance	5000	2
Fred Anderson	Cedric Moss	CPU	95000	3
Maintenance	5000	1
Software	10000	1
Wendy Yule	CPU	165000	7
Maintenance	7000	3
Monitor	5000	2

对于此数据集，此表示更有意义。现在，如果我想看一些总数怎么办？marginins = True 可以帮助我们实现。

pd.pivot_table(df,index=["Manager","Rep","Product"],
               values=["Price","Quantity"],
               aggfunc=[np.sum,np.mean],fill_value=0,margins=True)

			sum	mean
			Price	Quantity	Price	Quantity
Manager	Rep	Product
Debra Henley	Craig Booker	CPU	65000	2	32500	1.000000
Maintenance	5000	2	5000	2.000000
Software	10000	1	10000	1.000000
Daniel Hilton	CPU	105000	4	52500	2.000000
Software	10000	1	10000	1.000000
John Smith	CPU	35000	1	35000	1.000000
Maintenance	5000	2	5000	2.000000
Fred Anderson	Cedric Moss	CPU	95000	3	47500	1.500000
Maintenance	5000	1	5000	1.000000
Software	10000	1	10000	1.000000
Wendy Yule	CPU	165000	7	82500	3.500000
Maintenance	7000	3	7000	3.000000
Monitor	5000	2	5000	2.000000
All			522000	30	30705	1.764706

让我们将分析提升到一个水平，并在 manager 级别查看我们的数据管道。请注意如何根据我们之前的类别定义对状态进行排序。

pd.pivot_table(df,index=["Manager","Status"],values=["Price"],
               aggfunc=[np.sum],fill_value=0,margins=True)

		sum
		Price
Manager	Status
Debra Henley	won	65000
pending	50000
presented	50000
declined	70000
Fred Anderson	won	172000
pending	5000
presented	45000
declined	65000
All		522000

一个非常方便的功能是能够将字典传递给 aggfunc，因此你可以对你选择的每个值执行不同的功能。这具有使标签更清洁的作用。

pd.pivot_table(df,index=["Manager","Status"],columns=["Product"],values=["Quantity","Price"],
               aggfunc={"Quantity":len,"Price":np.sum},fill_value=0)

		Price	Quantity
	Product	CPU	Maintenance	Monitor	Software	CPU	Maintenance	Monitor	Software
Manager	Status
Debra Henley	won	65000	0	0	0	1	0	0	0
pending	40000	10000	0	0	1	2	0	0
presented	30000	0	0	20000	1	0	0	2
declined	70000	0	0	0	2	0	0	0
Fred Anderson	won	165000	7000	0	0	2	1	0	0
pending	0	5000	0	0	0	1	0	0
presented	30000	0	5000	10000	1	0	1	1
declined	65000	0	0	0	1	0	0	0

你也可以提供要应用于每个值的聚合函数列表：

table = pd.pivot_table(df,index=["Manager","Status"],columns=["Product"],values=["Quantity","Price"],
               aggfunc={"Quantity":len,"Price":[np.sum,np.mean]},fill_value=0)
table

		Price	Quantity
		mean	sum	len
	Product	CPU	Maintenance	Monitor	Software	CPU	Maintenance	Monitor	Software	CPU	Maintenance	Monitor	Software
Manager	Status
Debra Henley	won	65000	0	0	0	65000	0	0	0	1	0	0	0
pending	40000	5000	0	0	40000	10000	0	0	1	2	0	0
presented	30000	0	0	10000	30000	0	0	20000	1	0	0	2
declined	35000	0	0	0	70000	0	0	0	2	0	0	0
Fred Anderson	won	82500	7000	0	0	165000	7000	0	0	2	1	0	0
pending	0	5000	0	0	0	5000	0	0	0	1	0	0
presented	30000	0	5000	10000	30000	0	5000	10000	1	0	1	1
declined	65000	0	0	0	65000	0	0	0	1	0	0	0

尝试将这一切全部拉到一起可能看起来非常疯狂，但是一旦你开始玩数据并慢慢添加项目，你就可以了解它是如何工作的。我的一般经验法则是，一旦你使用多个 group，你应该评估一个数据透视表是否是一个有用的方法。

高级数据透视表过滤

生成数据后，它就位于 DataFrame 中，因此你可以使用标准 DataFrame 函数对其进行过滤。

如果你只想看一个 manager：

table.query('Manager == ["Debra Henley"]')

		Price	Quantity
		mean	sum	len
	Product	CPU	Maintenance	Monitor	Software	CPU	Maintenance	Monitor	Software	CPU	Maintenance	Monitor	Software
Manager	Status
Debra Henley	won	65000	0	0	0	65000	0	0	0	1	0	0	0
pending	40000	5000	0	0	40000	10000	0	0	1	2	0	0
presented	30000	0	0	10000	30000	0	0	20000	1	0	0	2
declined	35000	0	0	0	70000	0	0	0	2	0	0	0

我们可以查看所有待处理和赢得的交易。

table.query('Status == ["pending","won"]')

		Price	Quantity
		mean	sum	len
	Product	CPU	Maintenance	Monitor	Software	CPU	Maintenance	Monitor	Software	CPU	Maintenance	Monitor	Software
Manager	Status
Debra Henley	won	65000	0	0	0	65000	0	0	0	1	0	0	0
pending	40000	5000	0	0	40000	10000	0	0	1	2	0	0
Fred Anderson	won	82500	7000	0	0	165000	7000	0	0	2	1	0	0
pending	0	5000	0	0	0	5000	0	0	0	1	0	0

这是 pivot_table 的强大功能，所以一旦你将数据转换为你需要的 pivot_table 格式，请不要忘记你拥有一个强大的功能。

为了总结所有这些，我创建了一个被王丹，希望能帮助你记住 pandas pivot_table 的使用方法，如下：

在这里插入图片描述