数据分析day04

7.pandas高级操作

In [53]:

import pandas as pd
from pandas import DataFrame
import numpy as np

替换操作

替换操作可以同步作用于Series和DataFrame中
单值替换
- 普通替换：替换所有符合要求的元素:to_replace=15,value='e'
- 按列指定单值替换： to_replace={列标签：替换值} value='value'
多值替换
- 列表替换: to_replace=[] value=[]
- 字典替换（推荐） to_replace={to_replace:value,to_replace:value}

In [4]:

df = DataFrame(data=np.random.randint(0,100,size=(6,7)))
df

Out[4]:

	0	1	2	3	4	5	6
0	44	62	3	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	77	49

In [5]:

df.replace(to_replace=3,value='Three')

Out[5]:

	0	1	2	3	4	5	6
0	44	62	Three	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	77	49

In [6]:

df.replace(to_replace={3:'aaa'})

Out[6]:

	0	1	2	3	4	5	6
0	44	62	aaa	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	77	49

In [8]:

#替换指定列中的值
df.replace(to_replace={5:77},value='6666666')

Out[8]:

	0	1	2	3	4	5	6
0	44	62	3	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	6666666	49

映射操作

概念：创建一个映射关系列表，把values元素和一个特定的标签或者字符串绑定（给一个元素值提供不同的表现形式）
创建一个df，两列分别是姓名和薪资，然后给其名字起对应的中文名

扫描二维码关注公众号，回复： 8846933 查看本文章

In [10]:

dic = {
    'name':['jay','tom','jay'],
    'salary':[10000,20000,10000]
}
df = DataFrame(data=dic)
df

Out[10]:

	name	salary
0	jay	10000
1	tom	20000
2	jay	10000

In [14]:

#映射关系表
dic = {
    'jay':'张三',
    'tom':'李四'
}
df['c_name'] = df['name'].map(dic)
df

Out[14]:

	name	salary	c_name
0	jay	10000	张三
1	tom	20000	李四
2	jay	10000	张三

运算工具

超过3000部分的钱缴纳50%的税，计算每个人的税后薪资

In [16]:

def after_sal(s):
    return s - (s-3000)*0.5

In [18]:

df['after_salary'] = df['salary'].map(after_sal)
df

Out[18]:

	name	salary	c_name	after_salary
0	jay	10000	张三	6500.0
1	tom	20000	李四	11500.0
2	jay	10000	张三	6500.0

映射索引

使用rename()函数替换行列索引
参数介绍：
- index 替换行索引
- columns 替换列索引

In [19]:

df4 = DataFrame({'color':['white','gray','purple','blue','green'],'value':np.random.randint(10,size = 5)})
df4

Out[19]:

	color	value
0	white	2
1	gray	5
2	purple	9
3	blue	0
4	green	1

In [20]:

new_index = {0:'first',1:'two',2:'three',3:'four',4:'five'}
new_col={'color':'cc','value':'vv'}
df4.rename(new_index,columns=new_col)

Out[20]:

	cc	vv
first	white	2
two	gray	5
three	purple	9
four	blue	0
five	green	1

排序实现的随机抽样

take()
np.random.permutation()

In [22]:

df = DataFrame(data=np.random.randint(0,100,size=(100,3)),columns=['A','B','C'])
df

In [24]:

# df.take(['B','A','C'],axis=1)
df.take([1,0,2],axis=1)

In [32]:

np.random.permutation(3) #返回随机序列

Out[32]:

array([0, 1, 2])

In [31]:

#将行列索引打乱
df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)

In [35]:

df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)[0:50]

数据的分类处理

数据分类处理的核心：
- groupby()函数
- groups属性查看分组情况

In [36]:

df = DataFrame({'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
                'price':[4,3,3,2.5,4,2],
               'color':['red','yellow','yellow','green','green','green'],
               'weight':[12,20,50,30,20,44]})
df

Out[36]:

	color	item	price	weight
0	red	Apple	4.0	12
1	yellow	Banana	3.0	20
2	yellow	Orange	3.0	50
3	green	Banana	2.5	30
4	green	Orange	4.0	20
5	green	Apple	2.0	44

In [37]:

#根据水果的种类进行分组
df.groupby(by='item')

Out[37]:

<pandas.core.groupby.DataFrameGroupBy object at 0x0000019782507F60>

In [38]:

#调用groups查看分组情况
df.groupby(by='item').groups

Out[38]:

{'Apple': Int64Index([0, 5], dtype='int64'),
 'Banana': Int64Index([1, 3], dtype='int64'),
 'Orange': Int64Index([2, 4], dtype='int64')}

In [40]:

#计算出每一种水果的平均价格
df.groupby(by='item').mean()['price']

Out[40]:

item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64

In [41]:

df.groupby(by='item')['price'].mean() #推荐

Out[41]:

item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64

In [42]:

#计算不同颜色水果的平均重量
df.groupby(by='color')['weight'].mean()

Out[42]:

color
green     31.333333
red       12.000000
yellow    35.000000
Name: weight, dtype: float64

In [44]:

#将每一种水果的平均价格计算出来且汇总到原数据中
df

Out[44]:

	color	item	price	weight
0	red	Apple	4.0	12
1	yellow	Banana	3.0	20
2	yellow	Orange	3.0	50
3	green	Banana	2.5	30
4	green	Orange	4.0	20
5	green	Apple	2.0	44

In [47]:

series_price = df.groupby(by='item')['price'].mean() 
dic = series_price.to_dict()
dic #映射关系表

Out[47]:

{'Apple': 3.0, 'Banana': 2.75, 'Orange': 3.5}

In [49]:

df['mean_price'] = df['item'].map(dic)
df

Out[49]:

	color	item	price	weight	mean_price
0	red	Apple	4.0	12	3.00
1	yellow	Banana	3.0	20	2.75
2	yellow	Orange	3.0	50	3.50
3	green	Banana	2.5	30	2.75
4	green	Orange	4.0	20	3.50
5	green	Apple	2.0	44	3.00

高级数据聚合

使用groupby分组后，也可以使用transform和apply提供自定义函数实现更多的运算
df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
transform和apply都会进行运算，在transform或者apply中传入函数即可
transform和apply也可以传入一个lambda表达式

In [56]:

def myMean(s):
    sum = 0
    for i in s:
        sum += i
    return sum/len(s)

In [57]:

df.groupby(by='item')['price'].apply(myMean) #apply充当聚合的运算工具

Out[57]:

item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64

In [58]:

df.groupby(by='item')['price'].transform(myMean) #apply充当聚合的运算工具

Out[58]:

0    3.00
1    2.75
2    3.50
3    2.75
4    3.50
5    3.00
Name: price, dtype: float64

数据加载

读取type-.txt文件数据

In [50]:

data_1 = pd.read_csv('./data/type-.txt',sep='-',header=None)

将文件中每一个词作为元素存放在DataFrame中

In [ ]:

读取数据库中的数据

In [46]:

#连接数据库，获取连接对象
import sqlite3 as sqlite3
conn=sqlite3.connect('./data/weather_2012.sqlite')

In [47]:

#读取库表中的数据值
sql_df=pd.read_sql('select * from weather_2012',conn)
sql_df

In [51]:

#将一个df中的数据值写入存储到db
data_1.to_sql('sql_data123',conn)

In [52]:

pd.read_sql('select * from sql_data123',conn)

Out[52]:

	index	0	1	2
0	0	你好	我好	他也好
1	1	也许	大概	有可能
2	2	然而	未必	不见得

透视表

透视表是一种可以对数据动态排布并且分类汇总的表格格式。或许大多数人都在Excel使用过数据透视表，也体会到它的强大功能，而在pandas中它被称作pivot_table。
透视表的优点：
- 灵活性高，可以随意定制你的分析计算要求
- 脉络清晰易于理解数据
- 操作性强，报表神器

In [6]:

import pandas as pd
import numpy as np

In [15]:

df = pd.read_csv('./data/games.csv',encoding='utf-8')
df.head()

Out[15]:

	对手	胜负	主客场	命中	投篮数	投篮命中率	3分命中率	篮板	助攻	得分
0	勇士	胜	客	10	23	0.435	0.444	6	11	27
1	国王	胜	客	8	21	0.381	0.286	3	9	27
2	小牛	胜	主	10	19	0.526	0.462	3	7	29
3	灰熊	负	主	8	20	0.400	0.250	5	8	22
4	76人	胜	客	10	20	0.500	0.250	3	13	27

pivot_table有四个最重要的参数index、values、columns、aggfunc

index参数：分类汇总的分类条件
- 每个pivot_table必须拥有一个index。如果想查看哈登对阵每个队伍的得分则需要对每一个队进行分类并计算其各类得分的平均值：

In [16]:

df.pivot_table(index='对手')

想看看对阵同一对手在不同主客场下的数据，分类条件为对手和主客场

In [17]:

df.pivot_table(index=['对手','主客场'])

values参数：需要对计算的数据进行筛选
- 如果我们只需要哈登在主客场和不同胜负情况下的得分、篮板与助攻三项数据：

In [19]:

df.pivot_table(index=['主客场','胜负'],values=['得分','篮板','助攻'])

Out[19]:

		助攻	得分	篮板
主客场	胜负
主	胜	10.555556	34.222222	5.444444
负	8.666667	29.666667	5.000000
客	胜	9.000000	32.000000	4.916667
负	8.000000	20.000000	4.000000

Aggfunc参数：设置我们对数据聚合时进行的函数操作
- 当我们未设置aggfunc时，它默认aggfunc='mean'计算均值。
还想获得james harden在主客场和不同胜负情况下的总得分、总篮板、总助攻时：

In [23]:

df.pivot_table(index=['主客场','胜负'],values=['得分','篮板','助攻'],aggfunc='sum')

Out[23]:

		助攻	得分	篮板
主客场	胜负
主	胜	95	308	49
负	26	89	15
客	胜	108	384	59
负	8	20	4

In [24]:

#还想获得james harden在主客场和不同胜负情况下的总得分、平均篮板、最大助攻时
df.pivot_table(index=['主客场','胜负'],aggfunc={'得分':'sum','篮板':'mean','助攻':'max'})

Out[24]:

		助攻	得分	篮板
主客场	胜负
主	胜	17	308	5.444444
负	11	89	5.000000
客	胜	15	384	4.916667
负	8	20	4.000000

Columns:可以设置列层次字段
- 对values字段进行分类

In [35]:

df.pivot_table(index='主客场',values='得分',aggfunc='sum',columns='对手')

Out[35]:

对手	76人	勇士	国王	太阳	小牛	尼克斯	开拓者	掘金	步行者	湖人	灰熊	爵士	猛龙	篮网	老鹰	骑士	鹈鹕	黄蜂
主客场
主	29.0	NaN	NaN	NaN	29.0	37.0	NaN	21.0	29.0	NaN	60.0	56.0	38.0	37.0	NaN	35.0	26.0	NaN
客	27.0	27.0	27.0	48.0	NaN	31.0	48.0	NaN	26.0	36.0	49.0	29.0	NaN	NaN	29.0	NaN	NaN	27.0

交叉表

是一种用于计算分组的特殊透视图,对数据进行汇总
pd.crosstab(index,colums)
- index:分组数据，交叉表的行索引
- columns:交叉表的列索引

In [36]:

df = DataFrame({'sex':['man','man','women','women','man','women','man','women','women'],
               'age':[15,23,25,17,35,57,24,31,22],
               'smoke':[True,False,False,True,True,False,False,True,False],
               'height':[168,179,181,166,173,178,188,190,160]})
df

Out[36]:

	age	height	sex	smoke
0	15	168	man	True
1	23	179	man	False
2	25	181	women	False
3	17	166	women	True
4	35	173	man	True
5	57	178	women	False
6	24	188	man	False
7	31	190	women	True
8	22	160	women	False

求出各个性别抽烟的人数

In [37]:

pd.crosstab(df.smoke,df.sex)

Out[37]:

sex	man	women
smoke
False	2	3
True	2	2

In [38]:

pd.crosstab(df.sex,df.smoke)

Out[38]:

smoke	False	True
sex
man	2	2
women	3	2

求出各个年龄段抽烟人情况

In [41]:

pd.crosstab(df.age,df.smoke)

Out[41]:

smoke	False	True
age
15	0	1
17	0	1
22	1	0
23	1	0
24	1	0
25	1	0
31	0	1
35	0	1
57	1	0

8. 2012美国大选献金项目数据分析

In [51]:

import pandas as pd
from pandas import DataFrame
import numpy as np

In [52]:

#方便大家操作，将月份和参选人以及所在政党进行定义：
months = {'JAN' : 1, 'FEB' : 2, 'MAR' : 3, 'APR' : 4, 'MAY' : 5, 'JUN' : 6,
          'JUL' : 7, 'AUG' : 8, 'SEP' : 9, 'OCT': 10, 'NOV': 11, 'DEC' : 12}
parties = {
  'Bachmann, Michelle': 'Republican',
  'Romney, Mitt': 'Republican',
  'Obama, Barack': 'Democrat',
  "Roemer, Charles E. 'Buddy' III": 'Reform',
  'Pawlenty, Timothy': 'Republican',
  'Johnson, Gary Earl': 'Libertarian',
  'Paul, Ron': 'Republican',
  'Santorum, Rick': 'Republican',
  'Cain, Herman': 'Republican',
  'Gingrich, Newt': 'Republican',
  'McCotter, Thaddeus G': 'Republican',
  'Huntsman, Jon': 'Republican',
  'Perry, Rick': 'Republican'           
 }

需求

加载数据
查看数据的基本信息
指定数据截取，将如下字段的数据进行提取，其他数据舍弃
- cand_nm ：候选人姓名
- contbr_nm ：捐赠人姓名
- contbr_st ：捐赠人所在州
- contbr_employer ：捐赠人所在公司
- contbr_occupation ：捐赠人职业
- contb_receipt_amt ：捐赠数额（美元）
- contb_receipt_dt ：捐款的日期
对新数据进行总览df.info(),查看是否存在缺失数据
用统计学指标快速描述数值型属性的概要。df.describe()
空值处理。可能因为忘记填写或者保密等等原因，相关字段出现了空值，将其填充为NOT PROVIDE
异常值处理。将捐款金额<=0的数据删除
新建一列为各个候选人所在党派party
查看party这一列中有哪些不同的元素
统计party列中各个元素出现次数
查看各个党派收到的政治献金总数contb_receipt_amt
查看具体每天各个党派收到的政治献金总数contb_receipt_amt
将表中日期格式转换为'yyyy-mm-dd'。
查看老兵(捐献者职业)DISABLED VETERAN主要支持谁
找出各个候选人的捐赠者中，捐赠金额最大的人的职业以及捐献额

In [53]:

#加载数据，查看数据的基本信息
df = pd.read_csv('./data/usa_election.txt')
df.head()
C:\Users\laonanhai\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2728: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Out[53]:

	cmte_id	cand_id	cand_nm	contbr_nm	contbr_city	contbr_st	contbr_zip	contbr_employer	contbr_occupation	contb_receipt_amt	contb_receipt_dt	receipt_desc	memo_cd	memo_text	form_tp	file_num
0	C00410118	P20002978	Bachmann, Michelle	HARVEY, WILLIAM	MOBILE	AL	3.6601e+08	RETIRED	RETIRED	250.0	20-JUN-11	NaN	NaN	NaN	SA17A	736166
1	C00410118	P20002978	Bachmann, Michelle	HARVEY, WILLIAM	MOBILE	AL	3.6601e+08	RETIRED	RETIRED	50.0	23-JUN-11	NaN	NaN	NaN	SA17A	736166
2	C00410118	P20002978	Bachmann, Michelle	SMITH, LANIER	LANETT	AL	3.68633e+08	INFORMATION REQUESTED	INFORMATION REQUESTED	250.0	05-JUL-11	NaN	NaN	NaN	SA17A	749073
3	C00410118	P20002978	Bachmann, Michelle	BLEVINS, DARONDA	PIGGOTT	AR	7.24548e+08	NONE	RETIRED	250.0	01-AUG-11	NaN	NaN	NaN	SA17A	749073
4	C00410118	P20002978	Bachmann, Michelle	WARDENBURG, HAROLD	HOT SPRINGS NATION	AR	7.19016e+08	NONE	RETIRED	300.0	20-JUN-11	NaN	NaN	NaN	SA17A	736166

In [54]:

#查看原始数据中是否存在缺失数据
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 16 columns):
cmte_id              536041 non-null object
cand_id              536041 non-null object
cand_nm              536041 non-null object
contbr_nm            536041 non-null object
contbr_city          536026 non-null object
contbr_st            536040 non-null object
contbr_zip           535973 non-null object
contbr_employer      525088 non-null object
contbr_occupation    530520 non-null object
contb_receipt_amt    536041 non-null float64
contb_receipt_dt     536041 non-null object
receipt_desc         8479 non-null object
memo_cd              49718 non-null object
memo_text            52740 non-null object
form_tp              536041 non-null object
file_num             536041 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 65.4+ MB

In [55]:

df.describe()

Out[55]:

	contb_receipt_amt	file_num
count	5.360410e+05	536041.000000
mean	3.750373e+02	761472.107800
std	3.564436e+03	5148.893508
min	-3.080000e+04	723511.000000
25%	5.000000e+01	756218.000000
50%	1.000000e+02	763233.000000
75%	2.500000e+02	763621.000000
max	1.944042e+06	767394.000000

In [56]:

#空值处理。可能因为忘记填写或者保密等等原因，相关字段出现了空值，将其填充为NOT PROVIDE
df.fillna(value='NOT PROVIDE',inplace=True)

In [57]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 16 columns):
cmte_id              536041 non-null object
cand_id              536041 non-null object
cand_nm              536041 non-null object
contbr_nm            536041 non-null object
contbr_city          536041 non-null object
contbr_st            536041 non-null object
contbr_zip           536041 non-null object
contbr_employer      536041 non-null object
contbr_occupation    536041 non-null object
contb_receipt_amt    536041 non-null float64
contb_receipt_dt     536041 non-null object
receipt_desc         536041 non-null object
memo_cd              536041 non-null object
memo_text            536041 non-null object
form_tp              536041 non-null object
file_num             536041 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 65.4+ MB

In [58]:

#异常值处理。将捐款金额<=0的数据删除
df = df.loc[~(df['contb_receipt_amt'] <= 0)]

In [59]:

#查看当前有多少人参与了竞选
df['cand_nm'].unique()

Out[59]:

array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
       "Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
       'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick',
       'Cain, Herman', 'Gingrich, Newt', 'McCotter, Thaddeus G',
       'Huntsman, Jon', 'Perry, Rick'], dtype=object)

In [60]:

df['cand_nm'].nunique()

Out[60]:

In [61]:

#新建一列为各个候选人所在党派party
df['party'] = df['cand_nm'].map(parties)
df.head()

Out[61]:

	cmte_id	cand_id	cand_nm	contbr_nm	contbr_city	contbr_st	contbr_zip	contbr_employer	contbr_occupation	contb_receipt_amt	contb_receipt_dt	receipt_desc	memo_cd	memo_text	form_tp	file_num	party
0	C00410118	P20002978	Bachmann, Michelle	HARVEY, WILLIAM	MOBILE	AL	3.6601e+08	RETIRED	RETIRED	250.0	20-JUN-11	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	736166	Republican
1	C00410118	P20002978	Bachmann, Michelle	HARVEY, WILLIAM	MOBILE	AL	3.6601e+08	RETIRED	RETIRED	50.0	23-JUN-11	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	736166	Republican
2	C00410118	P20002978	Bachmann, Michelle	SMITH, LANIER	LANETT	AL	3.68633e+08	INFORMATION REQUESTED	INFORMATION REQUESTED	250.0	05-JUL-11	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	749073	Republican
3	C00410118	P20002978	Bachmann, Michelle	BLEVINS, DARONDA	PIGGOTT	AR	7.24548e+08	NONE	RETIRED	250.0	01-AUG-11	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	749073	Republican
4	C00410118	P20002978	Bachmann, Michelle	WARDENBURG, HAROLD	HOT SPRINGS NATION	AR	7.19016e+08	NONE	RETIRED	300.0	20-JUN-11	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	736166	Republican

In [62]:

#查看party这一列中有哪些不同的元素,统计party列中各个元素出现次数
df['party'].unique()

Out[62]:

array(['Republican', 'Democrat', 'Reform', 'Libertarian'], dtype=object)

In [63]:

df['party'].value_counts()#value_counts()统计Serise中每一元素出现的次数

Out[63]:

Democrat       289999
Republican     234300
Reform           5313
Libertarian       702
Name: party, dtype: int64

In [64]:

#查看各个党派收到的政治献金总数contb_receipt_amt
df.groupby(by='party',axis=0)['contb_receipt_amt'].sum()

Out[64]:

party
Democrat       8.259441e+07
Libertarian    4.132769e+05
Reform         3.429658e+05
Republican     1.251181e+08
Name: contb_receipt_amt, dtype: float64

In [65]:

#查看具体每天各个党派收到的政治献金总数contb_receipt_amt
df.groupby(by=['contb_receipt_dt','party'])['contb_receipt_amt'].sum()

Out[65]:

contb_receipt_dt  party      
01-APR-11         Reform              50.00
                  Republican       12635.00
01-AUG-11         Democrat        182198.00
                  Libertarian       1000.00
                  Reform            1847.00
                  Republican      268903.02
01-DEC-11         Democrat        651982.82
                  Libertarian        725.00
                  Reform             875.00
                  Republican      505255.96
01-FEB-11         Republican         250.00
01-JAN-11         Republican        8600.00
01-JAN-12         Democrat         74303.80
                  Reform             515.00
                  Republican       76804.72
01-JUL-11         Democrat        175364.00
                  Libertarian       2000.00
                  Reform             100.00
                  Republican      125973.72
01-JUN-11         Democrat        148409.00
                  Libertarian        500.00
                  Reform              50.00
                  Republican      435609.20
01-MAR-11         Republican        1000.00
01-MAY-11         Democrat         82644.00
                  Reform             480.00
                  Republican       28663.87
01-NOV-11         Democrat        129309.87
                  Libertarian       3000.00
                  Reform            1792.00
                                    ...    
30-OCT-11         Reform            3910.00
                  Republican       46413.16
30-SEP-11         Democrat       3409587.24
                  Libertarian        550.00
                  Reform            2050.00
                  Republican     5094824.20
31-AUG-11         Democrat        375487.44
                  Libertarian      10750.00
                  Reform             450.00
                  Republican     1038330.90
31-DEC-11         Democrat       3571793.57
                  Reform             695.00
                  Republican     1165777.72
31-JAN-11         Republican        6000.00
31-JAN-12         Democrat       1421887.31
                  Reform             150.00
                  Republican      963681.41
31-JUL-11         Democrat         20305.00
                  Reform            1066.00
                  Republican       12781.02
31-MAR-11         Reform             200.00
                  Republican       74575.00
31-MAY-11         Democrat        352005.66
                  Libertarian        250.00
                  Reform             100.00
                  Republican      313839.80
31-OCT-11         Democrat        216971.87
                  Libertarian       4250.00
                  Reform            3205.00
                  Republican      751542.36
Name: contb_receipt_amt, Length: 1183, dtype: float64

In [66]:

df.columns

Out[66]:

Index(['cmte_id', 'cand_id', 'cand_nm', 'contbr_nm', 'contbr_city',
       'contbr_st', 'contbr_zip', 'contbr_employer', 'contbr_occupation',
       'contb_receipt_amt', 'contb_receipt_dt', 'receipt_desc', 'memo_cd',
       'memo_text', 'form_tp', 'file_num', 'party'],
      dtype='object')

In [67]:

#将表中日期格式转换为'yyyy-mm-dd'。
def tranformDate(d):
    day,month,year = d.split('-')
    month = months[month]
    return '20'+year+'-'+str(month)+'-'+day
df['contb_receipt_dt'] = df['contb_receipt_dt'].map(tranformDate)
df.head()

Out[67]:

	cmte_id	cand_id	cand_nm	contbr_nm	contbr_city	contbr_st	contbr_zip	contbr_employer	contbr_occupation	contb_receipt_amt	contb_receipt_dt	receipt_desc	memo_cd	memo_text	form_tp	file_num	party
0	C00410118	P20002978	Bachmann, Michelle	HARVEY, WILLIAM	MOBILE	AL	3.6601e+08	RETIRED	RETIRED	250.0	2011-6-20	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	736166	Republican
1	C00410118	P20002978	Bachmann, Michelle	HARVEY, WILLIAM	MOBILE	AL	3.6601e+08	RETIRED	RETIRED	50.0	2011-6-23	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	736166	Republican
2	C00410118	P20002978	Bachmann, Michelle	SMITH, LANIER	LANETT	AL	3.68633e+08	INFORMATION REQUESTED	INFORMATION REQUESTED	250.0	2011-7-05	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	749073	Republican
3	C00410118	P20002978	Bachmann, Michelle	BLEVINS, DARONDA	PIGGOTT	AR	7.24548e+08	NONE	RETIRED	250.0	2011-8-01	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	749073	Republican
4	C00410118	P20002978	Bachmann, Michelle	WARDENBURG, HAROLD	HOT SPRINGS NATION	AR	7.19016e+08	NONE	RETIRED	300.0	2011-6-20	NOT PROVIDE	NOT PROVIDE	NOT PROVIDE	SA17A	736166	Republican

In [68]:

#查看老兵(捐献者职业)DISABLED VETERAN主要支持谁(捐钱数量)

#1.将老兵对应的行数据取出
old_bing_df = df.loc[df['contbr_occupation'] == 'DISABLED VETERAN']

#2.对候选人分组钱数的聚合
old_bing_df.groupby(by='cand_nm')['contb_receipt_amt'].sum()

Out[68]:

cand_nm
Cain, Herman       300.00
Obama, Barack     4205.00
Paul, Ron         2425.49
Santorum, Rick     250.00
Name: contb_receipt_amt, dtype: float64

In [69]:

#找出各个候选人的捐赠者中，捐赠金额最大的人的职业以及捐献额
s = df.groupby(by=['cand_nm'])['contb_receipt_amt'].max()
s

Out[69]:

cand_nm
Bachmann, Michelle                   3022.00
Cain, Herman                        10000.00
Gingrich, Newt                       5100.00
Huntsman, Jon                        5000.00
Johnson, Gary Earl                   2500.00
McCotter, Thaddeus G                 4000.00
Obama, Barack                     1944042.43
Paul, Ron                            5000.00
Pawlenty, Timothy                   10000.00
Perry, Rick                         10000.00
Roemer, Charles E. 'Buddy' III        200.00
Romney, Mitt                        12700.00
Santorum, Rick                       5000.00
Name: contb_receipt_amt, dtype: float64

In [70]:

s.index[0]

Out[70]:

'Bachmann, Michelle'

In [71]:

for i in range(len(s)):
    q_str = 'cand_nm == "%s" & contb_receipt_amt==%d'%(s.index[i],s.values[i])
    display(df.query(q_str))

9.matplotlib绘图

plt.plot()绘制线性图

绘制单条线形图
绘制多条线形图
多个曲线图绘制在一个table区域中（subplot()函数）
设置坐标系的比例plt.figure(figsize=(a,b))
设置图例legend()
设置轴的标识
图例保存
- fig = plt.figure()
- plt.plot(x,y)
- figure.savefig（）
曲线的样式和风格（自学）

In [6]:

import numpy as np

In [4]:

import matplotlib.pyplot as plt
%matplotlib inline #保证绘制的图像可以被正常的显示加载出来
UsageError: unrecognized arguments: #保证绘制的图像可以被正常的显示加载出来

In [5]:

x = [1,2,3,4,5]
y = [5,4,3,2,1]

plt.plot(x,y)

Out[5]:

[<matplotlib.lines.Line2D at 0x1a72b583cf8>]

In [8]:

#在一个坐标系中绘制两条线段
xx = np.linspace(-np.pi,np.pi,num=20)
yy = xx ** 2
plt.plot(x,y)  #plot多次被调用，绘制多条线段
plt.plot(xx,yy)

Out[8]:

[<matplotlib.lines.Line2D at 0x1a72b870ef0>]

In [15]:

#将多个坐标放置在一个表格中
ax1 = plt.subplot(2,2,1) #表格大小和坐标存放的位置
ax1.plot(x,y)


ax2 = plt.subplot(2,2,2)
ax2.plot(xx,yy)


ax3 = plt.subplot(2,2,3)
ax3.plot(xx,yy)


ax4 = plt.subplot(2,2,4)
ax4.plot(x,y)

Out[15]:

[<matplotlib.lines.Line2D at 0x1a72bd5fb70>]

In [17]:

#plt.figure(figsize=(a,b))
plt.figure(figsize=(4,8))
plt.plot(x,y)

In [23]:

#图例的设定
plt.plot(xx,yy,label='aaa')
plt.plot(xx-1,yy+1,label='bbb')
plt.legend(loc=1)

Out[23]:

<matplotlib.legend.Legend at 0x1a72d8d4198>

In [25]:

#给坐标轴设定标识
plt.plot(xx-1,yy+1,label='bbb')
plt.xlabel('distence')
plt.ylabel('temp')
plt.title('aaa')

Out[25]:

Text(0.5,1,'aaa')

柱状图：plt.bar()

参数：第一个参数是索引。第二个参数是数据值。第三个参数是条形的宽度

In [33]:

plt.bar(x,y)

Out[33]:

<Container object of 5 artists>

In [29]:

plt.barh(x,y)

Out[29]:

<Container object of 5 artists>

直方图

是一个特殊的柱状图，又叫做密度图
plt.hist()的参数
- bins
  可以是一个bin数量的整数值，也可以是表示bin的一个序列。默认值为10
- normed
  如果值为True，直方图的值将进行归一化处理，形成概率密度，默认值为False
- color
  指定直方图的颜色。可以是单一颜色值或颜色的序列。如果指定了多个数据集合,例如DataFrame对象，颜色序列将会设置为相同的顺序。如果未指定，将会使用一个默认的线条颜色
- orientation
  通过设置orientation为horizontal创建水平直方图。默认值为vertical

In [40]:

x = [1,1,2,3,4,5,5,5,6,7,7,7,7,7,7,7,8]
plt.hist(x,bins=20)

Out[40]:

(array([2., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 3., 0., 0., 1., 0., 0.,
        7., 0., 1.]),
 array([1.  , 1.35, 1.7 , 2.05, 2.4 , 2.75, 3.1 , 3.45, 3.8 , 4.15, 4.5 ,
        4.85, 5.2 , 5.55, 5.9 , 6.25, 6.6 , 6.95, 7.3 , 7.65, 8.  ]),
 <a list of 20 Patch objects>)

饼图

pie()，饼图也只有一个参数x
饼图适合展示各部分占总体的比例，条形图适合比较各部分的大小

In [41]:

arr=[11,22,31,15]
plt.pie(arr)

Out[41]:

([<matplotlib.patches.Wedge at 0x1a72dae13c8>,
  <matplotlib.patches.Wedge at 0x1a72dae1748>,
  <matplotlib.patches.Wedge at 0x1a72c37ea58>,
  <matplotlib.patches.Wedge at 0x1a72bc76ac8>],
 [Text(0.996424,0.465981,''),
  Text(-0.195798,1.08243,''),
  Text(-0.830021,-0.721848,''),
  Text(0.910034,-0.61793,'')])

In [42]:

arr=[0.2,0.3,0.1]
plt.pie(arr)

Out[42]:

([<matplotlib.patches.Wedge at 0x1a72ef21630>,
  <matplotlib.patches.Wedge at 0x1a72ef21b00>,
  <matplotlib.patches.Wedge at 0x1a72ef28080>],
 [Text(0.889919,0.646564,''),
  Text(-0.646564,0.889919,''),
  Text(-1.04616,-0.339919,'')])

In [43]:

arr=[11,22,31,15]
plt.pie(arr,labels=['a','b','c','d'])

Out[43]:

([<matplotlib.patches.Wedge at 0x1a72ef61d68>,
  <matplotlib.patches.Wedge at 0x1a72ef69278>,
  <matplotlib.patches.Wedge at 0x1a72ef697b8>,
  <matplotlib.patches.Wedge at 0x1a72ef69cf8>],
 [Text(0.996424,0.465981,'a'),
  Text(-0.195798,1.08243,'b'),
  Text(-0.830021,-0.721848,'c'),
  Text(0.910034,-0.61793,'d')])

In [44]:

arr=[11,22,31,15]
plt.pie(arr,labels=['a','b','c','d'],labeldistance=0.3)

Out[44]:

([<matplotlib.patches.Wedge at 0x1a72efb18d0>,
  <matplotlib.patches.Wedge at 0x1a72efb1da0>,
  <matplotlib.patches.Wedge at 0x1a72efbb320>,
  <matplotlib.patches.Wedge at 0x1a72efbb860>],
 [Text(0.271752,0.127086,'a'),
  Text(-0.0533994,0.295209,'b'),
  Text(-0.226369,-0.196868,'c'),
  Text(0.248191,-0.168526,'d')])

In [45]:

arr=[11,22,31,15]
plt.pie(arr,labels=['a','b','c','d'],labeldistance=0.3,autopct='%.6f%%')

Out[45]:

([<matplotlib.patches.Wedge at 0x1a72f0024a8>,
  <matplotlib.patches.Wedge at 0x1a72f002ba8>,
  <matplotlib.patches.Wedge at 0x1a72f00a358>,
  <matplotlib.patches.Wedge at 0x1a72f00aac8>],
 [Text(0.271752,0.127086,'a'),
  Text(-0.0533994,0.295209,'b'),
  Text(-0.226369,-0.196868,'c'),
  Text(0.248191,-0.168526,'d')],
 [Text(0.543504,0.254171,'13.924050%'),
  Text(-0.106799,0.590419,'27.848101%'),
  Text(-0.452739,-0.393735,'39.240506%'),
  Text(0.496382,-0.337053,'18.987341%')])

In [46]:

arr=[11,22,31,15]
plt.pie(arr,labels=['a','b','c','d'],labeldistance=0.3,shadow=True,explode=[0.2,0.3,0.2,0.4])

Out[46]:

([<matplotlib.patches.Wedge at 0x1a72f04e940>,
  <matplotlib.patches.Wedge at 0x1a72f056128>,
  <matplotlib.patches.Wedge at 0x1a72f056940>,
  <matplotlib.patches.Wedge at 0x1a72f062198>],
 [Text(0.45292,0.21181,'a'),
  Text(-0.106799,0.590419,'b'),
  Text(-0.377282,-0.328113,'c'),
  Text(0.579113,-0.393228,'d')])

散点图scatter()

因变量随自变量而变化的大致趋势

In [49]:

x = np.array([1,2,3,4,5])
y = x ** 2
plt.scatter(x,y)

Out[49]:

<matplotlib.collections.PathCollection at 0x1a72f089438>

In [51]:

x = np.random.random(size=(20,))
y = np.random.random(size=(20,))
plt.scatter(x,y)

Out[51]:

<matplotlib.collections.PathCollection at 0x1a72fe59e10>

temp dist

Type Markdown and LaTeX: α2α2

项目需求

第一部分：数据类型处理

数据加载
- 字段含义：
  - user_id:用户ID
  - order_dt:购买日期
  - order_product:购买产品的数量
  - order_amount:购买金额
观察数据
- 查看数据的数据类型
- 数据中是否存储在缺失值
- 将order_dt转换成时间类型
- 查看数据的统计描述
  - 计算所有用户购买商品的平均数量
  - 计算所有用户购买商品的平均花费
- 在源数据中添加一列表示月份

第二部分：按月数据分析

用户每月购买的总金额
- 绘制曲线图展示
所有用户每月的消费总次数
所有用户每月的产品购买量
统计每月的消费人数

第三部分：用户个体消费数据分析

用户消费总金额和消费总次数的统计描述
用户消费金额和消费次数的散点图
用户消费金额的分布图
用户消费次数的分布图(消费次数在100次之内的分布)

第四部分：用户消费行为分析

用户第一次消费的时间分布，和人数统计
- 绘制线形图
用户最后一次消费的时间分布，和人数统计
- 绘制线形图
新老客户的占比
- 消费一次为新用户
- 消费多次为老用户
用户分层
- 分析得出每个用户的总购买量和总消费金额and最近一次消费的时间
用户的生命周期
- 将用户划分为活跃用户和其他用户
- 查看每月活跃用户和其他用户的占比

	0	1	2	3	4	5	6
0	44	62	3	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	77	49

	0	1	2	3	4	5	6
0	44	62	aaa	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	77	49

	0	1	2	3	4	5	6
0	44	62	3	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	77	49

	0	1	2	3	4	5	6
0	44	62	aaa	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	77	49