数据集里面包含了支持者的名字,职业和雇主,地址,赞助金额。一个有意思的数据集是关于2012年美国总统选举的。这个数据集有150MB。
数据集中一些特征:
contbr_employer:捐款雇主
cand_nm:候选人
contbr_occupation:捐款人职业
contb_receipt_amt:捐赠金额
读取数据集:
import numpy as np
import pandas as pd
In [7]:
file_name = r'D:\datasets\P00000001-ALL.csv'
fec = pd.read_csv(file_name, low_memory=False) # 需要设定 low_memory ,如果不设定的话, 会报错
In [8]:
fec.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
cmte_id 1001731 non-null object
cand_id 1001731 non-null object
cand_nm 1001731 non-null object
contbr_nm 1001731 non-null object
contbr_city 1001712 non-null object
contbr_st 1001727 non-null object
contbr_zip 1001620 non-null object
contbr_employer 988002 non-null object
contbr_occupation 993301 non-null object
contb_receipt_amt 1001731 non-null float64
contb_receipt_dt 1001731 non-null object
receipt_desc 14166 non-null object
memo_cd 92482 non-null object
memo_text 97770 non-null object
form_tp 1001731 non-null object
file_num 1001731 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 122.3+ MB
In [10]:
fec.iloc[1000]
Out[10]:
cmte_id C00431171
cand_id P80003353
cand_nm Romney, Mitt
contbr_nm NEUWIEN, SUSAN W. MS.
contbr_city ENTERPRISE
contbr_st AL
contbr_zip 363302385
contbr_employer RETIRED
contbr_occupation RETIRED
contb_receipt_amt 1000
contb_receipt_dt 13-FEB-12
receipt_desc NaN
memo_cd NaN
memo_text NaN
form_tp SA17A
file_num 780124
Name: 1000, dtype: object
在数据集上增加政治党派,民主党,共和党republican,使用映射map的方法
unique_cands = fec.cand_nm.unique()
unique_cands#打印出所有的候选人
Out[12]:
array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
"Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick', 'Cain, Herman',
'Gingrich, Newt', 'McCotter, Thaddeus G', 'Huntsman, Jon',
'Perry, Rick'], dtype=object)
In [13]:
unique_cands[2]
Out[13]:
'Obama, Barack'
In [14]:
parties = {'Bachmann, Michelle': 'Republican',
'Cain, Herman': 'Republican',
'Gingrich, Newt': 'Republican',
'Huntsman, Jon': 'Republican',
'Johnson, Gary Earl': 'Republican',
'McCotter, Thaddeus G': 'Republican',
'Obama, Barack': 'Democrat',
'Paul, Ron': 'Republican',
'Pawlenty, Timothy': 'Republican',
'Perry, Rick': 'Republican',
"Roemer, Charles E. 'Buddy' III": 'Republican',
'Romney, Mitt': 'Republican',
'Santorum, Rick': 'Republican'}
In [15]:
fec.cand_nm[123456:123461]
fec.cand_nm[123456:123461]
Out[15]:
123456 Obama, Barack
123457 Obama, Barack
123458 Obama, Barack
123459 Obama, Barack
123460 Obama, Barack
Name: cand_nm, dtype: object
In [16]:
#进行映射
fec.cand_nm[123456:123461].map(parties)#进行映射
Out[16]:
123456 Democrat
123457 Democrat
123458 Democrat
123459 Democrat
123460 Democrat
Name: cand_nm, dtype: object
In [17]:
fec['party'] = fec.cand_nm.map(parties)
fec.party.value_counts()#打印出支持两党的人数
Out[17]:
Democrat 593746
Republican 407985
Name: party, dtype: int64
数据集中的包含了捐款和退款
In [18]:
(fec.contb_receipt_amt> 0).value_counts()#负号表示的是退款, 使用判断可以查找出退款的人数
Out[18]:
True 991475
False 10256
Name: contb_receipt_amt, dtype: int64
In [19]:
#为了简化分析,只使用捐款的数据
fec = fec[fec.contb_receipt_amt >0]#为了简化分析,只使用捐款的数据
由于Barack Obama和Mitt Romney是两个最主要的候选者使用一个子集包含2 人的数据
In [21]:
fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]
1.按职业和雇主对捐款数据进行划分
职业与捐赠是有关系的, 比如律师倾向于捐钱给民主党, 企业主管倾向于捐钱给共和党
fec.contbr_occupation.value_counts()[:10]
Out[23]:
RETIRED 233990
INFORMATION REQUESTED 35107
ATTORNEY 34286
HOMEMAKER 29931
PHYSICIAN 23432
INFORMATION REQUESTED PER BEST EFFORTS 21138
ENGINEER 14334
TEACHER 13990
CONSULTANT 13273
PROFESSOR 12555
Name: contbr_occupation, dtype: int64
对职业进行映射处理,使用dict.get 方法时, 他会忽略没有映射关系的职业
In [24]:
occ_mapping = {
'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
'INFORMATION REQUESTED' : 'NOT PROVIDED',
'INFORMATION REQUESTED (BEST EFFORTS)' : 'NOT PROVIDED',
'C.E.O.': 'CEO'
}
In [25]:
f = lambda x: occ_mapping.get(x, x)#没有想要的key会返回数据集上的值
fec.contbr_occupation = fec.contbr_occupation.map(f)
In [26]:
fec.contbr_occupation[:10]
Out[26]:
0 RETIRED
1 RETIRED
2 NOT PROVIDED
3 RETIRED
4 RETIRED
5 RETIRED
6 NOT PROVIDED
7 RETIRED
8 RN
9 ELECTRICAL ENGINEER
Name: contbr_occupation, dtype: object
In [27]:
同样的映射
emp_mapping = {
'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
'INFORMATION REQUESTED' : 'NOT PROVIDED',
'SELF' : 'SELF-EMPLOYED',
'SELF EMPLOYED' : 'SELF-EMPLOYED',
}#对雇主做同样的映射
In [28]:
# 在原数据集上进行映射
f = lambda x: emp_mapping.get(x, x)#没有想要的key会返回数据集上的值
fec.contbr_employer = fec.contbr_occupation.map(f)# 在原数据集上进行映射
In [29]:
fec.contbr_employer[:5]
Out[29]:
0 RETIRED
1 RETIRED
2 NOT PROVIDED
3 RETIRED
4 RETIRED
Name: contbr_employer, dtype: object
使用pivot_table , 按照政党和职业队数据进行聚合
by_occupation = fec.pivot_table(values = 'contb_receipt_amt', index = 'contbr_occupation', columns = 'party', aggfunc='sum')
In [31]:
by_occupation[:5]
Out[31]:
party Democrat Republican
contbr_occupation
MIXED-MEDIA ARTIST / STORYTELLER 100.0 NaN
AREA VICE PRESIDENT 250.0 NaN
RESEARCH ASSOCIATE 100.0 NaN
TEACHER 500.0 NaN
THERAPIST 3900.0 NaN
In [41]:
滤得到捐款大于2百万美元的职业
over_2mm = by_occupation[by_occupation.sum(1) > 2000000]
over_2mm#过滤得到捐款大于2百万美元的职业
Out[41]:
party Democrat Republican
contbr_occupation
ATTORNEY 11141982.97 7.477194e+06
CEO 2074974.79 4.211041e+06
CONSULTANT 2459912.71 2.544725e+06
ENGINEER 951525.55 1.818374e+06
EXECUTIVE 1355161.05 4.138850e+06
HOMEMAKER 4248875.80 1.363428e+07
INVESTOR 884133.00 2.431769e+06
LAWYER 3160478.87 3.912243e+05
MANAGER 762883.22 1.444532e+06
NOT PROVIDED 4866973.96 2.056547e+07
OWNER 1001567.36 2.408287e+06
PHYSICIAN 3735124.94 3.594320e+06
PRESIDENT 1878509.95 4.720924e+06
PROFESSOR 2165071.08 2.967027e+05
REAL ESTATE 528902.09 1.625902e+06
RETIRED 25305116.38 2.356124e+07
SELF-EMPLOYED 672393.40 1.640253e+06
In [42]:
pe
over_2mm.shape
Out[42]:
(17, 2)
In [43]:
import seaborn as ans
%matplotlib inline
In [44]:
over_2mm.plot(kind='barh', figsize=(10,8))
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x114ec780>
对捐款给Obama和Romney的顶级捐助者职业,或顶级捐助公司感兴趣。想要得到这些信息的话,可以按候选者名字进行分组
def get_top_amounts(group, key, n=5):
totals = group.groupby(key)['contb_receipt_amt'].sum()
return totals.nlargest(n)
In [51]:
grouped = fec_mrbo.groupby('cand_nm')
In [52]:
grouped
grouped
Out[52]:
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x00000000118FB9E8>
In [53]:
种职业
grouped.apply(get_top_amounts, 'contbr_occupation', n=7)#得到捐款最多的 7种职业
Out[53]:
cand_nm contbr_occupation
Obama, Barack RETIRED 25305116.38
ATTORNEY 11141982.97
INFORMATION REQUESTED 4866973.96
HOMEMAKER 4248875.80
PHYSICIAN 3735124.94
LAWYER 3160478.87
CONSULTANT 2459912.71
Romney, Mitt RETIRED 11508473.59
INFORMATION REQUESTED PER BEST EFFORTS 11396894.84
HOMEMAKER 8147446.22
ATTORNEY 5364718.82
PRESIDENT 2491244.89
EXECUTIVE 2300947.03
C.E.O. 1968386.11
Name: contb_receipt_amt, dtype: float64
桶捐赠额,使用cunt函数,将捐赠额去中心化,按捐赠额大小分为多个桶
bins = np.array([0, 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000])
In [56]:
labels
labels = pd.cut(fec_mrbo.contb_receipt_amt, bins)
labels
Out[56]:
411 (10, 100]
412 (100, 1000]
413 (100, 1000]
414 (10, 100]
415 (10, 100]
416 (10, 100]
417 (100, 1000]
418 (10, 100]
419 (100, 1000]
420 (10, 100]
421 (10, 100]
422 (100, 1000]
423 (100, 1000]
424 (100, 1000]
425 (100, 1000]
426 (100, 1000]
427 (1000, 10000]
428 (100, 1000]
429 (100, 1000]
430 (10, 100]
431 (1000, 10000]
432 (100, 1000]
433 (100, 1000]
434 (100, 1000]
435 (100, 1000]
436 (100, 1000]
437 (10, 100]
438 (100, 1000]
439 (100, 1000]
440 (10, 100]
...
701356 (10, 100]
701357 (1, 10]
701358 (10, 100]
701359 (10, 100]
701360 (10, 100]
701361 (10, 100]
701362 (100, 1000]
701363 (10, 100]
701364 (10, 100]
701365 (10, 100]
701366 (10, 100]
701367 (10, 100]
701368 (100, 1000]
701369 (10, 100]
701370 (10, 100]
701371 (10, 100]
701372 (10, 100]
701373 (10, 100]
701374 (10, 100]
701375 (10, 100]
701376 (1000, 10000]
701377 (10, 100]
701378 (10, 100]
701379 (100, 1000]
701380 (1000, 10000]
701381 (10, 100]
701382 (100, 1000]
701383 (1, 10]
701384 (10, 100]
701385 (100, 1000]
Name: contb_receipt_amt, Length: 694282, dtype: category
Categories (8, interval[int64]): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]
In [61]:
grouped = fec_mrbo.groupby(['cand_nm', labels])
grouped.size().unstack(0)
Out[61]:
cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 493.0 77.0
(1, 10] 40070.0 3681.0
(10, 100] 372280.0 31853.0
(100, 1000] 153991.0 43357.0
(1000, 10000] 22284.0 26186.0
(10000, 100000] 2.0 1.0
(100000, 1000000] 3.0 NaN
(1000000, 10000000] 4.0 NaN
说明Obama收到的小额捐助是远超过与Romney的, 对每一个箱进行归一化,得到其百分比的数据
In [62]:
bucket_sums
bucket_sums = grouped.contb_receipt_amt.sum().unstack(0)
bucket_sums
Out[62]:
cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 318.24 77.00
(1, 10] 337267.62 29819.66
(10, 100] 20288981.41 1987783.76
(100, 1000] 54798531.46 22363381.69
(1000, 10000] 51753705.67 63942145.42
(10000, 100000] 59100.00 12700.00
(100000, 1000000] 1490683.08 NaN
(1000000, 10000000] 7148839.76 NaN
In [64]:
normed_sum = bucket_sums.div(bucket_sums.sum(axis=1), axis =0)#对每行的数据进行归一化,
normed_sum
Out[64]:
cand_nm Obama, Barack Romney, Mitt
contb_receipt_amt
(0, 1] 0.805182 0.194818
(1, 10] 0.918767 0.081233
(10, 100] 0.910769 0.089231
(100, 1000] 0.710176 0.289824
(1000, 10000] 0.447326 0.552674
(10000, 100000] 0.823120 0.176880
(100000, 1000000] 1.000000 NaN
(1000000, 10000000] 1.000000 NaN
In [65]:
有2个数字是为空的
normed_sum[:-2].plot(kind='barh', figsize=(10,8))#去除最后2行的数据, 有2个数字是为空的
Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x11805e48>
3.按州对数据集进行划分
grouped = fec_mrbo.groupby(['cand_nm', 'contbr_st'])#依据州和候选人进行分组
totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)
In [69]:
totals[:5]
Out[69]:
cand_nm Obama, Barack Romney, Mitt
contbr_st
AA 56405.00 135.00
AB 2048.00 0.00
AE 42973.75 5680.00
AK 281840.15 86204.24
AL 543123.48 527303.51
In [72]:
tals
totals.shape
Out[72]:
(67, 2)
获得2 个候选人在每个州获得金额的百分比
In [73]:
percent = totals.div(totals.sum(1), axis =0)
percent[:8]
Out[73]:
cand_nm Obama, Barack Romney, Mitt
contbr_st
AA 0.997612 0.002388
AB 1.000000 0.000000
AE 0.883257 0.116743
AK 0.765778 0.234222
AL 0.507390 0.492610
AP 0.957329 0.042671
AR 0.772902 0.227098
AS 1.000000 0.000000
In [74]:
percent[:8].plot(kind='barh', figsize=(10,8))
Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x119083c8>