数据分析实例——美国2012年总统候选人政治献金[value_counts()/to_datetime()/unstack()/cumsum()/groupby()/reset_index()]

一、导包

import numpy as np

import pandas as pd
from pandas import Series,DataFrame

import matplotlib.pyplot as plt
%matplotlib inline
  • 方便大家操作,将月份和参选人以及所在政党进行定义
months = {'JAN' : 1, 'FEB' : 2, 'MAR' : 3, 'APR' : 4, 'MAY' : 5, 'JUN' : 6,
          'JUL' : 7, 'AUG' : 8, 'SEP' : 9, 'OCT': 10, 'NOV': 11, 'DEC' : 12}
of_interest = ['Obama, Barack', 'Romney, Mitt', 'Santorum, Rick', 
               'Paul, Ron', 'Gingrich, Newt']
parties = {
  'Bachmann, Michelle': 'Republican',
  'Romney, Mitt': 'Republican',
  'Obama, Barack': 'Democrat',
  "Roemer, Charles E. 'Buddy' III": 'Reform',
  'Pawlenty, Timothy': 'Republican',
  'Johnson, Gary Earl': 'Libertarian',
  'Paul, Ron': 'Republican',
  'Santorum, Rick': 'Republican',
  'Cain, Herman': 'Republican',
  'Gingrich, Newt': 'Republican',
  'McCotter, Thaddeus G': 'Republican',
  'Huntsman, Jon': 'Republican',
  'Perry, Rick': 'Republican'           
 }

二、读取文件

#low_memory : boolean, default True
#分块加载到内存,再低内存消耗中解析。但是可能出现类型混淆。
#确保类型不被混淆需要设置为False。或者使用dtype 参数指定类型。
#注意使用chunksize 或者iterator 参数分块读入会将整个文件读入到一个Dataframe,
#而忽略类型(只能在C解析器中有效)
ele = pd.read_csv('./usa_election.csv',low_memory=False)

三、查看文件样式以及基本信息

ele.shape
Out: (536041, 16)

ele.head()

四、【知识点】使用map函数+字典,新建一列各个候选人所在党派party

ele['party'] = ele['cand_nm'].map(parties)
ele.head()

  • 查看单独一行,是否加上了'party'一列

五、使用np.unique()函数查看colums:party这一列中有哪些元素

# 4政党
ele['party'].unique()
Out:
array(['Republican', 'Democrat', 'Reform', 'Libertarian'], dtype=object)

六、使用value_counts()函数,统计party列中各个元素出现次数

# 53万多条数据,募捐数据中,政党出现的次数不一样
ele['party'].value_counts()
Out:
Democrat       292400
Republican     237575
Reform           5364
Libertarian       702
Name: party, dtype: int64

七、【知识点】使用groupby()函数,查看各个党派收到的政治献金总数contb_receipt_amt

ele.columns
Out:
Index(['cmte_id', 'cand_id', 'cand_nm', 'contbr_nm', 'contbr_city',
       'contbr_st', 'contbr_zip', 'contbr_employer', 'contbr_occupation',
       'contb_receipt_amt', 'contb_receipt_dt', 'receipt_desc', 'memo_cd',
       'memo_text', 'form_tp', 'file_num', 'party'],
      dtype='object')

ele.groupby(['party'])['contb_receipt_amt'].sum()
Out:
party
Democrat       8.105758e+07
Libertarian    4.132769e+05
Reform         3.390338e+05
Republican     1.192255e+08
Name: contb_receipt_amt, dtype: float6

八、查看具体每天各个党派收到的政治献金总数contb_receipt_amt

  • 使用groupby([多个分组参数])
le['contb_receipt_dt'].unique().size
Out: 376

ele.groupby(['party','contb_receipt_dt'])['contb_receipt_amt'].apply(sum)

九、查看日期格式,并将其转换为Pandas的日期格式,通过函数加map方式进行转换

ele.dtypes
Out:
cmte_id               object
cand_id               object
cand_nm               object
contbr_nm             object
contbr_city           object
contbr_st             object
contbr_zip            object
contbr_employer       object
contbr_occupation     object
contb_receipt_amt    float64
contb_receipt_dt      object
receipt_desc          object
memo_cd               object
memo_text             object
form_tp               object
file_num               int64
party                 object
dtype: object

months
Out:
{'APR': 4,
 'AUG': 8,
 'DEC': 12,
 'FEB': 2,
 'JAN': 1,
 'JUL': 7,
 'JUN': 6,
 'MAR': 3,
 'MAY': 5,
 'NOV': 11,
 'OCT': 10,
 'SEP': 9}

# 定义一个函数,按照规定格式输出
def convert(x):
#     01-JAN-12
    day,m,year = x.split('-')
    
    month = months[m]
#     2012-1-01
    return '20'+year+'-' + str(month) +'-'+day

ele['contb_receipt_dt'] = ele['contb_receipt_dt'].map(convert)
# 使用pd.to_datetime进行时间格式的转换
ele['contb_receipt_dt'] = pd.to_datetime(ele['contb_receipt_dt'])
# 查看是否转换成功
ele.dtypes
Out:【只截取时间字段】 contb_receipt_dt     datetime64[ns]
ele.head()

十、得到转换后的,每天各政党所收政治献金数目。

  • 考察知识点:groupby(多个字段)
ele2 = ele.groupby(['party','contb_receipt_dt'])['contb_receipt_amt'].sum()
ele2

  • 【知识点】使用unstack()将上面所得数据中的party从一级索引变成列索引,unstack('party')
ele2.unstack()

  • 使用stack()把party变成二级行索引,注意所有的值都不能为nan,需要填充为0
ele3 = ele2.unstack(level = 0,fill_value=0)
ele3

  • 使用上面获取的数据,画出各党派累计政治献金,cumsum()累加函数
plot = ele3.cumsum().plot()

fig = plot.get_figure()

fig.set_size_inches(12,9)

  • 把时间作为列,党派作为行来观察,unstack('contb_receipt')
ele2.unstack(level = -1)

十一、查看候选人姓名cand_nm和政治献金捐献者职业contbr_occupation,以及捐献情况。能看出各个候选人主要的支持者分布情况

  • 练习:groupy,sum()求和
ele5 = ele.groupby(['cand_nm','contbr_occupation'])['contb_receipt_amt'].sum()
ele5

ele['cand_nm'].unique()
Out:
array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
       "Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
       'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick',
       'Cain, Herman', 'Gingrich, Newt', 'McCotter, Thaddeus G',
       'Huntsman, Jon', 'Perry, Rick'], dtype=object)

ele5['Obama, Barack']
Out:

十二、查看老兵主要支持谁:DISABLED VETERAN  

  • 考察Series索引
ele5.index
Out:
MultiIndex(levels=[['Bachmann, Michelle', 'Cain, Herman', 'Gingrich, Newt', 'Huntsman, Jon', 'Johnson, Gary Earl', 'McCotter, Thaddeus G', 'Obama, Barack', 'Paul, Ron', 'Pawlenty, Timothy', 'Perry, Rick', 'Roemer, Charles E. 'Buddy' III', 'Romney, Mitt', 'Santorum, Rick'], ['   MIXED-MEDIA ARTIST / STORYTELLER', ' AREA VICE PRESIDENT', ' RESEARCH ASSOCIATE', ' TEACHER', ' THERAPIST', ………………略
ele5.loc[:,'DISABLED VETERAN'] == ele5[:,'DISABLED VETERAN']
Out:
cand_nm
Cain, Herman       300.00
Obama, Barack     4205.00
Paul, Ron         2425.49
Santorum, Rick     250.00
Name: contb_receipt_amt, dtype: float64

ele5[:,'LAWYER']
Out:
cand_nm
Bachmann, Michelle         8318.00
Cain, Herman               3850.00
Gingrich, Newt            47005.00
Huntsman, Jon             49263.00
McCotter, Thaddeus G        500.00
Obama, Barack           1974727.92
Paul, Ron                 56209.87
Pawlenty, Timothy         58025.00
Perry, Rick               86505.00
Romney, Mitt               7225.00
Santorum, Rick            14207.00
Name: contb_receipt_amt, dtype: float64

ele5[:,'ATTORNEY']
Out:
cand_nm
Bachmann, Michelle                  46214.00
Cain, Herman                        76472.87
Gingrich, Newt                     205577.00
Huntsman, Jon                      143532.50
Johnson, Gary Earl                   9425.00
McCotter, Thaddeus G                  500.00
Obama, Barack                     7112343.35
Paul, Ron                          195712.11
Pawlenty, Timothy                  238331.10
Perry, Rick                        768778.80
Roemer, Charles E. 'Buddy' III      14186.00
Romney, Mitt                      3662610.21
Santorum, Rick                     107130.00
Name: contb_receipt_amt, dtype: float64

  • 把索引变成列,Series.reset_index()
ele5.reset_index()

十三、找出各个候选人的捐赠者中,捐赠金额最大的人的职业以及捐献额

  • 通过query("查询条件来查找捐献人职业")
ele.groupby(['cand_nm'])['contb_receipt_amt'].max()
Out:
cand_nm
Bachmann, Michelle                   3022.00
Cain, Herman                        10000.00
Gingrich, Newt                       5100.00
Huntsman, Jon                        5000.00
Johnson, Gary Earl                   2500.00
McCotter, Thaddeus G                 4000.00
Obama, Barack                     1944042.43
Paul, Ron                            5000.00
Pawlenty, Timothy                   10000.00
Perry, Rick                         10000.00
Roemer, Charles E. 'Buddy' III        200.00
Romney, Mitt                        12700.00
Santorum, Rick                       5000.00
Name: contb_receipt_amt, dtype: float64

ele.query("cand_nm =='Obama, Barack' and contb_receipt_amt == 1944042.43")

猜你喜欢

转载自blog.csdn.net/Dorisi_H_n_q/article/details/82418002