数据分析实例-美国2012联邦选举委员会数据库

数据集里面包含了支持者的名字，职业和雇主，地址，赞助金额。一个有意思的数据集是关于2012年美国总统选举的。这个数据集有150MB。

数据集中一些特征：

contbr_employer：捐款雇主

cand_nm:候选人

contbr_occupation：捐款人职业

contb_receipt_amt：捐赠金额

读取数据集：

import numpy as np
import pandas as pd
In [7]:

file_name = r'D:\datasets\P00000001-ALL.csv'
fec = pd.read_csv(file_name, low_memory=False) # 需要设定 low_memory ，如果不设定的话， 会报错
In [8]:

fec.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
cmte_id              1001731 non-null object
cand_id              1001731 non-null object
cand_nm              1001731 non-null object
contbr_nm            1001731 non-null object
contbr_city          1001712 non-null object
contbr_st            1001727 non-null object
contbr_zip           1001620 non-null object
contbr_employer      988002 non-null object
contbr_occupation    993301 non-null object
contb_receipt_amt    1001731 non-null float64
contb_receipt_dt     1001731 non-null object
receipt_desc         14166 non-null object
memo_cd              92482 non-null object
memo_text            97770 non-null object
form_tp              1001731 non-null object
file_num             1001731 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 122.3+ MB
In [10]:

fec.iloc[1000]
Out[10]:
cmte_id                          C00431171
cand_id                          P80003353
cand_nm                       Romney, Mitt
contbr_nm            NEUWIEN, SUSAN W. MS.
contbr_city                     ENTERPRISE
contbr_st                               AL
contbr_zip                       363302385
contbr_employer                    RETIRED
contbr_occupation                  RETIRED
contb_receipt_amt                     1000
contb_receipt_dt                 13-FEB-12
receipt_desc                           NaN
memo_cd                                NaN
memo_text                              NaN
form_tp                              SA17A
file_num                            780124
Name: 1000, dtype: object

在数据集上增加政治党派，民主党，共和党republican,使用映射map的方法


unique_cands = fec.cand_nm.unique()
unique_cands#打印出所有的候选人

Out[12]:
array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
       "Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
       'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick', 'Cain, Herman',
       'Gingrich, Newt', 'McCotter, Thaddeus G', 'Huntsman, Jon',
       'Perry, Rick'], dtype=object)
In [13]:

unique_cands[2]
Out[13]:
'Obama, Barack'
In [14]:

parties = {'Bachmann, Michelle': 'Republican',
           'Cain, Herman': 'Republican', 
           'Gingrich, Newt': 'Republican', 
           'Huntsman, Jon': 'Republican', 
           'Johnson, Gary Earl': 'Republican', 
           'McCotter, Thaddeus G': 'Republican', 
           'Obama, Barack': 'Democrat', 
           'Paul, Ron': 'Republican', 
           'Pawlenty, Timothy': 'Republican', 
           'Perry, Rick': 'Republican', 
           "Roemer, Charles E. 'Buddy' III": 'Republican', 
           'Romney, Mitt': 'Republican', 
           'Santorum, Rick': 'Republican'}

In [15]:

fec.cand_nm[123456:123461]
fec.cand_nm[123456:123461]
Out[15]:
123456    Obama, Barack
123457    Obama, Barack
123458    Obama, Barack
123459    Obama, Barack
123460    Obama, Barack
Name: cand_nm, dtype: object
In [16]:

#进行映射
fec.cand_nm[123456:123461].map(parties)#进行映射
Out[16]:
123456    Democrat
123457    Democrat
123458    Democrat
123459    Democrat
123460    Democrat
Name: cand_nm, dtype: object
In [17]:


fec['party'] = fec.cand_nm.map(parties)
fec.party.value_counts()#打印出支持两党的人数
Out[17]:
Democrat      593746
Republican    407985
Name: party, dtype: int64

数据集中的包含了捐款和退款
In [18]:

(fec.contb_receipt_amt> 0).value_counts()#负号表示的是退款， 使用判断可以查找出退款的人数
Out[18]:
True     991475
False     10256
Name: contb_receipt_amt, dtype: int64
In [19]:

#为了简化分析，只使用捐款的数据
fec = fec[fec.contb_receipt_amt >0]#为了简化分析，只使用捐款的数据
由于Barack Obama和Mitt Romney是两个最主要的候选者使用一个子集包含2 人的数据
In [21]:

fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]

1.按职业和雇主对捐款数据进行划分

职业与捐赠是有关系的，比如律师倾向于捐钱给民主党，企业主管倾向于捐钱给共和党

fec.contbr_occupation.value_counts()[:10]
Out[23]:
RETIRED                                   233990
INFORMATION REQUESTED                      35107
ATTORNEY                                   34286
HOMEMAKER                                  29931
PHYSICIAN                                  23432
INFORMATION REQUESTED PER BEST EFFORTS     21138
ENGINEER                                   14334
TEACHER                                    13990
CONSULTANT                                 13273
PROFESSOR                                  12555
Name: contbr_occupation, dtype: int64
对职业进行映射处理，使用dict.get 方法时， 他会忽略没有映射关系的职业
In [24]:

occ_mapping = { 
    'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED', 
    'INFORMATION REQUESTED' : 'NOT PROVIDED', 
    'INFORMATION REQUESTED (BEST EFFORTS)' : 'NOT PROVIDED', 
    'C.E.O.': 'CEO' 
}
In [25]:

f = lambda x: occ_mapping.get(x, x)#没有想要的key会返回数据集上的值
fec.contbr_occupation = fec.contbr_occupation.map(f)
In [26]:

fec.contbr_occupation[:10]
Out[26]:
0                RETIRED
1                RETIRED
2           NOT PROVIDED
3                RETIRED
4                RETIRED
5                RETIRED
6           NOT PROVIDED
7                RETIRED
8                     RN
9    ELECTRICAL ENGINEER
Name: contbr_occupation, dtype: object
In [27]:

同样的映射
emp_mapping = { 
    'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED', 
    'INFORMATION REQUESTED' : 'NOT PROVIDED', 
    'SELF' : 'SELF-EMPLOYED', 
    'SELF EMPLOYED' : 'SELF-EMPLOYED', 
}#对雇主做同样的映射
In [28]:

# 在原数据集上进行映射
f = lambda x: emp_mapping.get(x, x)#没有想要的key会返回数据集上的值
fec.contbr_employer = fec.contbr_occupation.map(f)# 在原数据集上进行映射
In [29]:

fec.contbr_employer[:5]
Out[29]:
0         RETIRED
1         RETIRED
2    NOT PROVIDED
3         RETIRED
4         RETIRED
Name: contbr_employer, dtype: object

使用pivot_table ，按照政党和职业队数据进行聚合

by_occupation = fec.pivot_table(values = 'contb_receipt_amt', index = 'contbr_occupation', columns = 'party', aggfunc='sum')
In [31]:

by_occupation[:5]
Out[31]:
party	Democrat	Republican
contbr_occupation		
MIXED-MEDIA ARTIST / STORYTELLER	100.0	NaN
AREA VICE PRESIDENT	250.0	NaN
RESEARCH ASSOCIATE	100.0	NaN
TEACHER	500.0	NaN
THERAPIST	3900.0	NaN
In [41]:

滤得到捐款大于2百万美元的职业
over_2mm = by_occupation[by_occupation.sum(1) > 2000000]
over_2mm#过滤得到捐款大于2百万美元的职业
Out[41]:
party	Democrat	Republican
contbr_occupation		
ATTORNEY	11141982.97	7.477194e+06
CEO	2074974.79	4.211041e+06
CONSULTANT	2459912.71	2.544725e+06
ENGINEER	951525.55	1.818374e+06
EXECUTIVE	1355161.05	4.138850e+06
HOMEMAKER	4248875.80	1.363428e+07
INVESTOR	884133.00	2.431769e+06
LAWYER	3160478.87	3.912243e+05
MANAGER	762883.22	1.444532e+06
NOT PROVIDED	4866973.96	2.056547e+07
OWNER	1001567.36	2.408287e+06
PHYSICIAN	3735124.94	3.594320e+06
PRESIDENT	1878509.95	4.720924e+06
PROFESSOR	2165071.08	2.967027e+05
REAL ESTATE	528902.09	1.625902e+06
RETIRED	25305116.38	2.356124e+07
SELF-EMPLOYED	672393.40	1.640253e+06
In [42]:

pe
over_2mm.shape
Out[42]:
(17, 2)
In [43]:

import seaborn as ans
%matplotlib inline
In [44]:

over_2mm.plot(kind='barh', figsize=(10,8))
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x114ec780>

对捐款给Obama和Romney的顶级捐助者职业，或顶级捐助公司感兴趣。想要得到这些信息的话，可以按候选者名字进行分组

def get_top_amounts(group, key, n=5):
    totals = group.groupby(key)['contb_receipt_amt'].sum()
    return totals.nlargest(n)
In [51]:

grouped = fec_mrbo.groupby('cand_nm')
In [52]:

grouped
grouped
Out[52]:
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x00000000118FB9E8>
In [53]:

种职业
grouped.apply(get_top_amounts, 'contbr_occupation', n=7)#得到捐款最多的 7种职业
Out[53]:
cand_nm        contbr_occupation                     
Obama, Barack  RETIRED                                   25305116.38
               ATTORNEY                                  11141982.97
               INFORMATION REQUESTED                      4866973.96
               HOMEMAKER                                  4248875.80
               PHYSICIAN                                  3735124.94
               LAWYER                                     3160478.87
               CONSULTANT                                 2459912.71
Romney, Mitt   RETIRED                                   11508473.59
               INFORMATION REQUESTED PER BEST EFFORTS    11396894.84
               HOMEMAKER                                  8147446.22
               ATTORNEY                                   5364718.82
               PRESIDENT                                  2491244.89
               EXECUTIVE                                  2300947.03
               C.E.O.                                     1968386.11
Name: contb_receipt_amt, dtype: float64

桶捐赠额，使用cunt函数，将捐赠额去中心化，按捐赠额大小分为多个桶

bins = np.array([0, 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000])
In [56]:

labels
labels = pd.cut(fec_mrbo.contb_receipt_amt, bins)
labels
Out[56]:
411           (10, 100]
412         (100, 1000]
413         (100, 1000]
414           (10, 100]
415           (10, 100]
416           (10, 100]
417         (100, 1000]
418           (10, 100]
419         (100, 1000]
420           (10, 100]
421           (10, 100]
422         (100, 1000]
423         (100, 1000]
424         (100, 1000]
425         (100, 1000]
426         (100, 1000]
427       (1000, 10000]
428         (100, 1000]
429         (100, 1000]
430           (10, 100]
431       (1000, 10000]
432         (100, 1000]
433         (100, 1000]
434         (100, 1000]
435         (100, 1000]
436         (100, 1000]
437           (10, 100]
438         (100, 1000]
439         (100, 1000]
440           (10, 100]
              ...      
701356        (10, 100]
701357          (1, 10]
701358        (10, 100]
701359        (10, 100]
701360        (10, 100]
701361        (10, 100]
701362      (100, 1000]
701363        (10, 100]
701364        (10, 100]
701365        (10, 100]
701366        (10, 100]
701367        (10, 100]
701368      (100, 1000]
701369        (10, 100]
701370        (10, 100]
701371        (10, 100]
701372        (10, 100]
701373        (10, 100]
701374        (10, 100]
701375        (10, 100]
701376    (1000, 10000]
701377        (10, 100]
701378        (10, 100]
701379      (100, 1000]
701380    (1000, 10000]
701381        (10, 100]
701382      (100, 1000]
701383          (1, 10]
701384        (10, 100]
701385      (100, 1000]
Name: contb_receipt_amt, Length: 694282, dtype: category
Categories (8, interval[int64]): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]
In [61]:

grouped = fec_mrbo.groupby(['cand_nm', labels])
grouped.size().unstack(0)
Out[61]:
cand_nm	Obama, Barack	Romney, Mitt
contb_receipt_amt		
(0, 1]	493.0	77.0
(1, 10]	40070.0	3681.0
(10, 100]	372280.0	31853.0
(100, 1000]	153991.0	43357.0
(1000, 10000]	22284.0	26186.0
(10000, 100000]	2.0	1.0
(100000, 1000000]	3.0	NaN
(1000000, 10000000]	4.0	NaN
说明Obama收到的小额捐助是远超过与Romney的， 对每一个箱进行归一化，得到其百分比的数据
In [62]:

bucket_sums
bucket_sums = grouped.contb_receipt_amt.sum().unstack(0)
bucket_sums
Out[62]:
cand_nm	Obama, Barack	Romney, Mitt
contb_receipt_amt		
(0, 1]	318.24	77.00
(1, 10]	337267.62	29819.66
(10, 100]	20288981.41	1987783.76
(100, 1000]	54798531.46	22363381.69
(1000, 10000]	51753705.67	63942145.42
(10000, 100000]	59100.00	12700.00
(100000, 1000000]	1490683.08	NaN
(1000000, 10000000]	7148839.76	NaN
In [64]:

normed_sum = bucket_sums.div(bucket_sums.sum(axis=1), axis =0)#对每行的数据进行归一化，
normed_sum
Out[64]:
cand_nm	Obama, Barack	Romney, Mitt
contb_receipt_amt		
(0, 1]	0.805182	0.194818
(1, 10]	0.918767	0.081233
(10, 100]	0.910769	0.089231
(100, 1000]	0.710176	0.289824
(1000, 10000]	0.447326	0.552674
(10000, 100000]	0.823120	0.176880
(100000, 1000000]	1.000000	NaN
(1000000, 10000000]	1.000000	NaN
In [65]:

有2个数字是为空的
normed_sum[:-2].plot(kind='barh', figsize=(10,8))#去除最后2行的数据， 有2个数字是为空的
Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x11805e48>

3.按州对数据集进行划分

grouped = fec_mrbo.groupby(['cand_nm', 'contbr_st'])#依据州和候选人进行分组
totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)
In [69]:

totals[:5]
Out[69]:
cand_nm	Obama, Barack	Romney, Mitt
contbr_st		
AA	56405.00	135.00
AB	2048.00	0.00
AE	42973.75	5680.00
AK	281840.15	86204.24
AL	543123.48	527303.51
In [72]:

tals
totals.shape
Out[72]:
(67, 2)
获得2 个候选人在每个州获得金额的百分比
In [73]:

percent = totals.div(totals.sum(1), axis =0)
percent[:8]
Out[73]:
cand_nm	Obama, Barack	Romney, Mitt
contbr_st		
AA	0.997612	0.002388
AB	1.000000	0.000000
AE	0.883257	0.116743
AK	0.765778	0.234222
AL	0.507390	0.492610
AP	0.957329	0.042671
AR	0.772902	0.227098
AS	1.000000	0.000000
In [74]:

percent[:8].plot(kind='barh', figsize=(10,8))
Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x119083c8>

数据分析实例-美国2012联邦选举委员会数据库

1.按职业和雇主对捐款数据进行划分

3.按州对数据集进行划分

猜你喜欢