数据分析day04

数据分析day04

7.pandas高级操作

In [53]:

import pandas as pd
from pandas import DataFrame
import numpy as np

替换操作

  • 替换操作可以同步作用于Series和DataFrame中
  • 单值替换
    • 普通替换: 替换所有符合要求的元素:to_replace=15,value='e'
    • 按列指定单值替换: to_replace={列标签:替换值} value='value'
  • 多值替换
    • 列表替换: to_replace=[] value=[]
    • 字典替换(推荐) to_replace={to_replace:value,to_replace:value}

In [4]:

df = DataFrame(data=np.random.randint(0,100,size=(6,7)))
df

Out[4]:

0 1 2 3 4 5 6
0 44 62 3 85 26 47 14
1 15 78 32 98 85 4 51
2 53 75 87 21 45 8 18
3 54 31 67 49 77 25 49
4 18 21 18 31 93 11 0
5 21 54 76 95 70 77 49

In [5]:

df.replace(to_replace=3,value='Three')

Out[5]:

0 1 2 3 4 5 6
0 44 62 Three 85 26 47 14
1 15 78 32 98 85 4 51
2 53 75 87 21 45 8 18
3 54 31 67 49 77 25 49
4 18 21 18 31 93 11 0
5 21 54 76 95 70 77 49

In [6]:

df.replace(to_replace={3:'aaa'})

Out[6]:

0 1 2 3 4 5 6
0 44 62 aaa 85 26 47 14
1 15 78 32 98 85 4 51
2 53 75 87 21 45 8 18
3 54 31 67 49 77 25 49
4 18 21 18 31 93 11 0
5 21 54 76 95 70 77 49

In [8]:

#替换指定列中的值
df.replace(to_replace={5:77},value='6666666')

Out[8]:

0 1 2 3 4 5 6
0 44 62 3 85 26 47 14
1 15 78 32 98 85 4 51
2 53 75 87 21 45 8 18
3 54 31 67 49 77 25 49
4 18 21 18 31 93 11 0
5 21 54 76 95 70 6666666 49

映射操作

  • 概念:创建一个映射关系列表,把values元素和一个特定的标签或者字符串绑定(给一个元素值提供不同的表现形式)

  • 创建一个df,两列分别是姓名和薪资,然后给其名字起对应的中文名

    扫描二维码关注公众号,回复: 8846933 查看本文章

In [10]:

dic = {
    'name':['jay','tom','jay'],
    'salary':[10000,20000,10000]
}
df = DataFrame(data=dic)
df

Out[10]:

name salary
0 jay 10000
1 tom 20000
2 jay 10000

In [14]:

#映射关系表
dic = {
    'jay':'张三',
    'tom':'李四'
}
df['c_name'] = df['name'].map(dic)
df

Out[14]:

name salary c_name
0 jay 10000 张三
1 tom 20000 李四
2 jay 10000 张三

运算工具

  • 超过3000部分的钱缴纳50%的税,计算每个人的税后薪资

In [16]:

def after_sal(s):
    return s - (s-3000)*0.5

In [18]:

df['after_salary'] = df['salary'].map(after_sal)
df

Out[18]:

name salary c_name after_salary
0 jay 10000 张三 6500.0
1 tom 20000 李四 11500.0
2 jay 10000 张三 6500.0

映射索引

  • 使用rename()函数替换行列索引
  • 参数介绍:
    • index 替换行索引
    • columns 替换列索引

In [19]:

df4 = DataFrame({'color':['white','gray','purple','blue','green'],'value':np.random.randint(10,size = 5)})
df4

Out[19]:

color value
0 white 2
1 gray 5
2 purple 9
3 blue 0
4 green 1

In [20]:

new_index = {0:'first',1:'two',2:'three',3:'four',4:'five'}
new_col={'color':'cc','value':'vv'}
df4.rename(new_index,columns=new_col)

Out[20]:

cc vv
first white 2
two gray 5
three purple 9
four blue 0
five green 1

排序实现的随机抽样

  • take()
  • np.random.permutation()

In [22]:

df = DataFrame(data=np.random.randint(0,100,size=(100,3)),columns=['A','B','C'])
df

In [24]:

# df.take(['B','A','C'],axis=1)
df.take([1,0,2],axis=1)

In [32]:

np.random.permutation(3) #返回随机序列

Out[32]:

array([0, 1, 2])

In [31]:

#将行列索引打乱
df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)

In [35]:

df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)[0:50]

数据的分类处理

  • 数据分类处理的核心:
    • groupby()函数
    • groups属性查看分组情况

In [36]:

df = DataFrame({'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
                'price':[4,3,3,2.5,4,2],
               'color':['red','yellow','yellow','green','green','green'],
               'weight':[12,20,50,30,20,44]})
df

Out[36]:

color item price weight
0 red Apple 4.0 12
1 yellow Banana 3.0 20
2 yellow Orange 3.0 50
3 green Banana 2.5 30
4 green Orange 4.0 20
5 green Apple 2.0 44

In [37]:

#根据水果的种类进行分组
df.groupby(by='item')

Out[37]:

<pandas.core.groupby.DataFrameGroupBy object at 0x0000019782507F60>

In [38]:

#调用groups查看分组情况
df.groupby(by='item').groups

Out[38]:

{'Apple': Int64Index([0, 5], dtype='int64'),
 'Banana': Int64Index([1, 3], dtype='int64'),
 'Orange': Int64Index([2, 4], dtype='int64')}

In [40]:

#计算出每一种水果的平均价格
df.groupby(by='item').mean()['price']

Out[40]:

item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64

In [41]:

df.groupby(by='item')['price'].mean() #推荐

Out[41]:

item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64

In [42]:

#计算不同颜色水果的平均重量
df.groupby(by='color')['weight'].mean()

Out[42]:

color
green     31.333333
red       12.000000
yellow    35.000000
Name: weight, dtype: float64

In [44]:

#将每一种水果的平均价格计算出来且汇总到原数据中
df

Out[44]:

color item price weight
0 red Apple 4.0 12
1 yellow Banana 3.0 20
2 yellow Orange 3.0 50
3 green Banana 2.5 30
4 green Orange 4.0 20
5 green Apple 2.0 44

In [47]:

series_price = df.groupby(by='item')['price'].mean() 
dic = series_price.to_dict()
dic #映射关系表

Out[47]:

{'Apple': 3.0, 'Banana': 2.75, 'Orange': 3.5}

In [49]:

df['mean_price'] = df['item'].map(dic)
df

Out[49]:

color item price weight mean_price
0 red Apple 4.0 12 3.00
1 yellow Banana 3.0 20 2.75
2 yellow Orange 3.0 50 3.50
3 green Banana 2.5 30 2.75
4 green Orange 4.0 20 3.50
5 green Apple 2.0 44 3.00

高级数据聚合

  • 使用groupby分组后,也可以使用transform和apply提供自定义函数实现更多的运算
  • df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
  • transform和apply都会进行运算,在transform或者apply中传入函数即可
  • transform和apply也可以传入一个lambda表达式

In [56]:

def myMean(s):
    sum = 0
    for i in s:
        sum += i
    return sum/len(s)

In [57]:

df.groupby(by='item')['price'].apply(myMean) #apply充当聚合的运算工具

Out[57]:

item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64

In [58]:

df.groupby(by='item')['price'].transform(myMean) #apply充当聚合的运算工具

Out[58]:

0    3.00
1    2.75
2    3.50
3    2.75
4    3.50
5    3.00
Name: price, dtype: float64

数据加载

  • 读取type-.txt文件数据

In [50]:

data_1 = pd.read_csv('./data/type-.txt',sep='-',header=None)
  • 将文件中每一个词作为元素存放在DataFrame中

In [ ]:

  • 读取数据库中的数据

In [46]:

#连接数据库,获取连接对象
import sqlite3 as sqlite3
conn=sqlite3.connect('./data/weather_2012.sqlite')

In [47]:

#读取库表中的数据值
sql_df=pd.read_sql('select * from weather_2012',conn)
sql_df

In [51]:

#将一个df中的数据值写入存储到db
data_1.to_sql('sql_data123',conn)

In [52]:

pd.read_sql('select * from sql_data123',conn)

Out[52]:

index 0 1 2
0 0 你好 我好 他也好
1 1 也许 大概 有可能
2 2 然而 未必 不见得

透视表

  • 透视表是一种可以对数据动态排布并且分类汇总的表格格式。或许大多数人都在Excel使用过数据透视表,也体会到它的强大功能,而在pandas中它被称作pivot_table。

  • 透视表的优点:
    • 灵活性高,可以随意定制你的分析计算要求
    • 脉络清晰易于理解数据
    • 操作性强,报表神器

In [6]:

import pandas as pd
import numpy as np

In [15]:

df = pd.read_csv('./data/games.csv',encoding='utf-8')
df.head()

Out[15]:

对手 胜负 主客场 命中 投篮数 投篮命中率 3分命中率 篮板 助攻 得分
0 勇士 10 23 0.435 0.444 6 11 27
1 国王 8 21 0.381 0.286 3 9 27
2 小牛 10 19 0.526 0.462 3 7 29
3 灰熊 8 20 0.400 0.250 5 8 22
4 76人 10 20 0.500 0.250 3 13 27

pivot_table有四个最重要的参数index、values、columns、aggfunc

  • index参数:分类汇总的分类条件
    • 每个pivot_table必须拥有一个index。如果想查看哈登对阵每个队伍的得分则需要对每一个队进行分类并计算其各类得分的平均值:

In [16]:

df.pivot_table(index='对手')
  • 想看看对阵同一对手在不同主客场下的数据,分类条件为对手和主客场

In [17]:

df.pivot_table(index=['对手','主客场'])
  • values参数:需要对计算的数据进行筛选
    • 如果我们只需要哈登在主客场和不同胜负情况下的得分、篮板与助攻三项数据:

In [19]:

df.pivot_table(index=['主客场','胜负'],values=['得分','篮板','助攻'])

Out[19]:

助攻 得分 篮板
主客场 胜负
10.555556 34.222222 5.444444
8.666667 29.666667 5.000000
9.000000 32.000000 4.916667
8.000000 20.000000 4.000000
  • Aggfunc参数:设置我们对数据聚合时进行的函数操作
    • 当我们未设置aggfunc时,它默认aggfunc='mean'计算均值。
  • 还想获得james harden在主客场和不同胜负情况下的总得分、总篮板、总助攻时:

In [23]:

df.pivot_table(index=['主客场','胜负'],values=['得分','篮板','助攻'],aggfunc='sum')

Out[23]:

助攻 得分 篮板
主客场 胜负
95 308 49
26 89 15
108 384 59
8 20 4

In [24]:

#还想获得james harden在主客场和不同胜负情况下的总得分、平均篮板、最大助攻时
df.pivot_table(index=['主客场','胜负'],aggfunc={'得分':'sum','篮板':'mean','助攻':'max'})

Out[24]:

助攻 得分 篮板
主客场 胜负
17 308 5.444444
11 89 5.000000
15 384 4.916667
8 20 4.000000
  • Columns:可以设置列层次字段
    • 对values字段进行分类

In [35]:

df.pivot_table(index='主客场',values='得分',aggfunc='sum',columns='对手')

Out[35]:

对手 76人 勇士 国王 太阳 小牛 尼克斯 开拓者 掘金 步行者 湖人 灰熊 爵士 猛龙 篮网 老鹰 骑士 鹈鹕 黄蜂
主客场
29.0 NaN NaN NaN 29.0 37.0 NaN 21.0 29.0 NaN 60.0 56.0 38.0 37.0 NaN 35.0 26.0 NaN
27.0 27.0 27.0 48.0 NaN 31.0 48.0 NaN 26.0 36.0 49.0 29.0 NaN NaN 29.0 NaN NaN 27.0

交叉表

  • 是一种用于计算分组的特殊透视图,对数据进行汇总
  • pd.crosstab(index,colums)
    • index:分组数据,交叉表的行索引
    • columns:交叉表的列索引

In [36]:

df = DataFrame({'sex':['man','man','women','women','man','women','man','women','women'],
               'age':[15,23,25,17,35,57,24,31,22],
               'smoke':[True,False,False,True,True,False,False,True,False],
               'height':[168,179,181,166,173,178,188,190,160]})
df

Out[36]:

age height sex smoke
0 15 168 man True
1 23 179 man False
2 25 181 women False
3 17 166 women True
4 35 173 man True
5 57 178 women False
6 24 188 man False
7 31 190 women True
8 22 160 women False
  • 求出各个性别抽烟的人数

In [37]:

pd.crosstab(df.smoke,df.sex)

Out[37]:

sex man women
smoke
False 2 3
True 2 2

In [38]:

pd.crosstab(df.sex,df.smoke)

Out[38]:

smoke False True
sex
man 2 2
women 3 2
  • 求出各个年龄段抽烟人情况

In [41]:

pd.crosstab(df.age,df.smoke)

Out[41]:

smoke False True
age
15 0 1
17 0 1
22 1 0
23 1 0
24 1 0
25 1 0
31 0 1
35 0 1
57 1 0

8. 2012美国大选献金项目数据分析

In [51]:

import pandas as pd
from pandas import DataFrame
import numpy as np

In [52]:

#方便大家操作,将月份和参选人以及所在政党进行定义:
months = {'JAN' : 1, 'FEB' : 2, 'MAR' : 3, 'APR' : 4, 'MAY' : 5, 'JUN' : 6,
          'JUL' : 7, 'AUG' : 8, 'SEP' : 9, 'OCT': 10, 'NOV': 11, 'DEC' : 12}
parties = {
  'Bachmann, Michelle': 'Republican',
  'Romney, Mitt': 'Republican',
  'Obama, Barack': 'Democrat',
  "Roemer, Charles E. 'Buddy' III": 'Reform',
  'Pawlenty, Timothy': 'Republican',
  'Johnson, Gary Earl': 'Libertarian',
  'Paul, Ron': 'Republican',
  'Santorum, Rick': 'Republican',
  'Cain, Herman': 'Republican',
  'Gingrich, Newt': 'Republican',
  'McCotter, Thaddeus G': 'Republican',
  'Huntsman, Jon': 'Republican',
  'Perry, Rick': 'Republican'           
 }

需求

  • 加载数据
  • 查看数据的基本信息
  • 指定数据截取,将如下字段的数据进行提取,其他数据舍弃
    • cand_nm :候选人姓名
    • contbr_nm : 捐赠人姓名
    • contbr_st :捐赠人所在州
    • contbr_employer : 捐赠人所在公司
    • contbr_occupation : 捐赠人职业
    • contb_receipt_amt :捐赠数额(美元)
    • contb_receipt_dt : 捐款的日期
  • 对新数据进行总览df.info(),查看是否存在缺失数据
  • 用统计学指标快速描述数值型属性的概要。df.describe()
  • 空值处理。可能因为忘记填写或者保密等等原因,相关字段出现了空值,将其填充为NOT PROVIDE
  • 异常值处理。将捐款金额<=0的数据删除
  • 新建一列为各个候选人所在党派party
  • 查看party这一列中有哪些不同的元素
  • 统计party列中各个元素出现次数
  • 查看各个党派收到的政治献金总数contb_receipt_amt
  • 查看具体每天各个党派收到的政治献金总数contb_receipt_amt
  • 将表中日期格式转换为'yyyy-mm-dd'。
  • 查看老兵(捐献者职业)DISABLED VETERAN主要支持谁
  • 找出各个候选人的捐赠者中,捐赠金额最大的人的职业以及捐献额

In [53]:

#加载数据,查看数据的基本信息
df = pd.read_csv('./data/usa_election.txt')
df.head()
C:\Users\laonanhai\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2728: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Out[53]:

cmte_id cand_id cand_nm contbr_nm contbr_city contbr_st contbr_zip contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num
0 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e+08 RETIRED RETIRED 250.0 20-JUN-11 NaN NaN NaN SA17A 736166
1 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e+08 RETIRED RETIRED 50.0 23-JUN-11 NaN NaN NaN SA17A 736166
2 C00410118 P20002978 Bachmann, Michelle SMITH, LANIER LANETT AL 3.68633e+08 INFORMATION REQUESTED INFORMATION REQUESTED 250.0 05-JUL-11 NaN NaN NaN SA17A 749073
3 C00410118 P20002978 Bachmann, Michelle BLEVINS, DARONDA PIGGOTT AR 7.24548e+08 NONE RETIRED 250.0 01-AUG-11 NaN NaN NaN SA17A 749073
4 C00410118 P20002978 Bachmann, Michelle WARDENBURG, HAROLD HOT SPRINGS NATION AR 7.19016e+08 NONE RETIRED 300.0 20-JUN-11 NaN NaN NaN SA17A 736166

In [54]:

#查看原始数据中是否存在缺失数据
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 16 columns):
cmte_id              536041 non-null object
cand_id              536041 non-null object
cand_nm              536041 non-null object
contbr_nm            536041 non-null object
contbr_city          536026 non-null object
contbr_st            536040 non-null object
contbr_zip           535973 non-null object
contbr_employer      525088 non-null object
contbr_occupation    530520 non-null object
contb_receipt_amt    536041 non-null float64
contb_receipt_dt     536041 non-null object
receipt_desc         8479 non-null object
memo_cd              49718 non-null object
memo_text            52740 non-null object
form_tp              536041 non-null object
file_num             536041 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 65.4+ MB

In [55]:

df.describe()

Out[55]:

contb_receipt_amt file_num
count 5.360410e+05 536041.000000
mean 3.750373e+02 761472.107800
std 3.564436e+03 5148.893508
min -3.080000e+04 723511.000000
25% 5.000000e+01 756218.000000
50% 1.000000e+02 763233.000000
75% 2.500000e+02 763621.000000
max 1.944042e+06 767394.000000

In [56]:

#空值处理。可能因为忘记填写或者保密等等原因,相关字段出现了空值,将其填充为NOT PROVIDE
df.fillna(value='NOT PROVIDE',inplace=True)

In [57]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 16 columns):
cmte_id              536041 non-null object
cand_id              536041 non-null object
cand_nm              536041 non-null object
contbr_nm            536041 non-null object
contbr_city          536041 non-null object
contbr_st            536041 non-null object
contbr_zip           536041 non-null object
contbr_employer      536041 non-null object
contbr_occupation    536041 non-null object
contb_receipt_amt    536041 non-null float64
contb_receipt_dt     536041 non-null object
receipt_desc         536041 non-null object
memo_cd              536041 non-null object
memo_text            536041 non-null object
form_tp              536041 non-null object
file_num             536041 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 65.4+ MB

In [58]:

#异常值处理。将捐款金额<=0的数据删除
df = df.loc[~(df['contb_receipt_amt'] <= 0)]

In [59]:

#查看当前有多少人参与了竞选
df['cand_nm'].unique()

Out[59]:

array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
       "Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
       'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick',
       'Cain, Herman', 'Gingrich, Newt', 'McCotter, Thaddeus G',
       'Huntsman, Jon', 'Perry, Rick'], dtype=object)

In [60]:

df['cand_nm'].nunique()

Out[60]:

13

In [61]:

#新建一列为各个候选人所在党派party
df['party'] = df['cand_nm'].map(parties)
df.head()

Out[61]:

cmte_id cand_id cand_nm contbr_nm contbr_city contbr_st contbr_zip contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num party
0 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e+08 RETIRED RETIRED 250.0 20-JUN-11 NOT PROVIDE NOT PROVIDE NOT PROVIDE SA17A 736166 Republican
1 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e+08 RETIRED RETIRED 50.0 23-JUN-11 NOT PROVIDE NOT PROVIDE NOT PROVIDE SA17A 736166 Republican
2 C00410118 P20002978 Bachmann, Michelle SMITH, LANIER LANETT AL 3.68633e+08 INFORMATION REQUESTED INFORMATION REQUESTED 250.0 05-JUL-11 NOT PROVIDE NOT PROVIDE NOT PROVIDE SA17A 749073 Republican
3 C00410118 P20002978 Bachmann, Michelle BLEVINS, DARONDA PIGGOTT AR 7.24548e+08 NONE RETIRED 250.0 01-AUG-11 NOT PROVIDE NOT PROVIDE NOT PROVIDE SA17A 749073 Republican
4 C00410118 P20002978 Bachmann, Michelle WARDENBURG, HAROLD HOT SPRINGS NATION AR 7.19016e+08 NONE RETIRED 300.0 20-JUN-11 NOT PROVIDE NOT PROVIDE NOT PROVIDE SA17A 736166 Republican

In [62]:

#查看party这一列中有哪些不同的元素,统计party列中各个元素出现次数
df['party'].unique()

Out[62]:

array(['Republican', 'Democrat', 'Reform', 'Libertarian'], dtype=object)

In [63]:

df['party'].value_counts()#value_counts()统计Serise中每一元素出现的次数

Out[63]:

Democrat       289999
Republican     234300
Reform           5313
Libertarian       702
Name: party, dtype: int64

In [64]:

#查看各个党派收到的政治献金总数contb_receipt_amt
df.groupby(by='party',axis=0)['contb_receipt_amt'].sum()

Out[64]:

party
Democrat       8.259441e+07
Libertarian    4.132769e+05
Reform         3.429658e+05
Republican     1.251181e+08
Name: contb_receipt_amt, dtype: float64

In [65]:

#查看具体每天各个党派收到的政治献金总数contb_receipt_amt
df.groupby(by=['contb_receipt_dt','party'])['contb_receipt_amt'].sum()

Out[65]:

contb_receipt_dt  party      
01-APR-11         Reform              50.00
                  Republican       12635.00
01-AUG-11         Democrat        182198.00
                  Libertarian       1000.00
                  Reform            1847.00
                  Republican      268903.02
01-DEC-11         Democrat        651982.82
                  Libertarian        725.00
                  Reform             875.00
                  Republican      505255.96
01-FEB-11         Republican         250.00
01-JAN-11         Republican        8600.00
01-JAN-12         Democrat         74303.80
                  Reform             515.00
                  Republican       76804.72
01-JUL-11         Democrat        175364.00
                  Libertarian       2000.00
                  Reform             100.00
                  Republican      125973.72
01-JUN-11         Democrat        148409.00
                  Libertarian        500.00
                  Reform              50.00
                  Republican      435609.20
01-MAR-11         Republican        1000.00
01-MAY-11         Democrat         82644.00
                  Reform             480.00
                  Republican       28663.87
01-NOV-11         Democrat        129309.87
                  Libertarian       3000.00
                  Reform            1792.00
                                    ...    
30-OCT-11         Reform            3910.00
                  Republican       46413.16
30-SEP-11         Democrat       3409587.24
                  Libertarian        550.00
                  Reform            2050.00
                  Republican     5094824.20
31-AUG-11         Democrat        375487.44
                  Libertarian      10750.00
                  Reform             450.00
                  Republican     1038330.90
31-DEC-11         Democrat       3571793.57
                  Reform             695.00
                  Republican     1165777.72
31-JAN-11         Republican        6000.00
31-JAN-12         Democrat       1421887.31
                  Reform             150.00
                  Republican      963681.41
31-JUL-11         Democrat         20305.00
                  Reform            1066.00
                  Republican       12781.02
31-MAR-11         Reform             200.00
                  Republican       74575.00
31-MAY-11         Democrat        352005.66
                  Libertarian        250.00
                  Reform             100.00
                  Republican      313839.80
31-OCT-11         Democrat        216971.87
                  Libertarian       4250.00
                  Reform            3205.00
                  Republican      751542.36
Name: contb_receipt_amt, Length: 1183, dtype: float64

In [66]:

df.columns

Out[66]:

Index(['cmte_id', 'cand_id', 'cand_nm', 'contbr_nm', 'contbr_city',
       'contbr_st', 'contbr_zip', 'contbr_employer', 'contbr_occupation',
       'contb_receipt_amt', 'contb_receipt_dt', 'receipt_desc', 'memo_cd',
       'memo_text', 'form_tp', 'file_num', 'party'],
      dtype='object')

In [67]:

#将表中日期格式转换为'yyyy-mm-dd'。
def tranformDate(d):
    day,month,year = d.split('-')
    month = months[month]
    return '20'+year+'-'+str(month)+'-'+day
df['contb_receipt_dt'] = df['contb_receipt_dt'].map(tranformDate)
df.head()

Out[67]:

cmte_id cand_id cand_nm contbr_nm contbr_city contbr_st contbr_zip contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num party
0 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e+08 RETIRED RETIRED 250.0 2011-6-20 NOT PROVIDE NOT PROVIDE NOT PROVIDE SA17A 736166 Republican
1 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e+08 RETIRED RETIRED 50.0 2011-6-23 NOT PROVIDE NOT PROVIDE NOT PROVIDE SA17A 736166 Republican
2 C00410118 P20002978 Bachmann, Michelle SMITH, LANIER LANETT AL 3.68633e+08 INFORMATION REQUESTED INFORMATION REQUESTED 250.0 2011-7-05 NOT PROVIDE NOT PROVIDE NOT PROVIDE SA17A 749073 Republican
3 C00410118 P20002978 Bachmann, Michelle BLEVINS, DARONDA PIGGOTT AR 7.24548e+08 NONE RETIRED 250.0 2011-8-01 NOT PROVIDE NOT PROVIDE NOT PROVIDE SA17A 749073 Republican
4 C00410118 P20002978 Bachmann, Michelle WARDENBURG, HAROLD HOT SPRINGS NATION AR 7.19016e+08 NONE RETIRED 300.0 2011-6-20 NOT PROVIDE NOT PROVIDE NOT PROVIDE SA17A 736166 Republican

In [68]:

#查看老兵(捐献者职业)DISABLED VETERAN主要支持谁(捐钱数量)

#1.将老兵对应的行数据取出
old_bing_df = df.loc[df['contbr_occupation'] == 'DISABLED VETERAN']

#2.对候选人分组钱数的聚合
old_bing_df.groupby(by='cand_nm')['contb_receipt_amt'].sum()

Out[68]:

cand_nm
Cain, Herman       300.00
Obama, Barack     4205.00
Paul, Ron         2425.49
Santorum, Rick     250.00
Name: contb_receipt_amt, dtype: float64

In [69]:

#找出各个候选人的捐赠者中,捐赠金额最大的人的职业以及捐献额
s = df.groupby(by=['cand_nm'])['contb_receipt_amt'].max()
s

Out[69]:

cand_nm
Bachmann, Michelle                   3022.00
Cain, Herman                        10000.00
Gingrich, Newt                       5100.00
Huntsman, Jon                        5000.00
Johnson, Gary Earl                   2500.00
McCotter, Thaddeus G                 4000.00
Obama, Barack                     1944042.43
Paul, Ron                            5000.00
Pawlenty, Timothy                   10000.00
Perry, Rick                         10000.00
Roemer, Charles E. 'Buddy' III        200.00
Romney, Mitt                        12700.00
Santorum, Rick                       5000.00
Name: contb_receipt_amt, dtype: float64

In [70]:

s.index[0]

Out[70]:

'Bachmann, Michelle'

In [71]:

for i in range(len(s)):
    q_str = 'cand_nm == "%s" & contb_receipt_amt==%d'%(s.index[i],s.values[i])
    display(df.query(q_str))

9.matplotlib绘图

plt.plot()绘制线性图

  • 绘制单条线形图
  • 绘制多条线形图
  • 多个曲线图绘制在一个table区域中(subplot()函数)
  • 设置坐标系的比例plt.figure(figsize=(a,b))
  • 设置图例legend()
  • 设置轴的标识
  • 图例保存
    • fig = plt.figure()
    • plt.plot(x,y)
    • figure.savefig()
  • 曲线的样式和风格(自学)

In [6]:

import numpy as np

In [4]:

import matplotlib.pyplot as plt
%matplotlib inline #保证绘制的图像可以被正常的显示加载出来
UsageError: unrecognized arguments: #保证绘制的图像可以被正常的显示加载出来

In [5]:

x = [1,2,3,4,5]
y = [5,4,3,2,1]

plt.plot(x,y)

Out[5]:

[<matplotlib.lines.Line2D at 0x1a72b583cf8>]

img

In [8]:

#在一个坐标系中绘制两条线段
xx = np.linspace(-np.pi,np.pi,num=20)
yy = xx ** 2
plt.plot(x,y)  #plot多次被调用,绘制多条线段
plt.plot(xx,yy)

Out[8]:

[<matplotlib.lines.Line2D at 0x1a72b870ef0>]

img

In [15]:

#将多个坐标放置在一个表格中
ax1 = plt.subplot(2,2,1) #表格大小和坐标存放的位置
ax1.plot(x,y)


ax2 = plt.subplot(2,2,2)
ax2.plot(xx,yy)


ax3 = plt.subplot(2,2,3)
ax3.plot(xx,yy)


ax4 = plt.subplot(2,2,4)
ax4.plot(x,y)

Out[15]:

[<matplotlib.lines.Line2D at 0x1a72bd5fb70>]

img

In [17]:

#plt.figure(figsize=(a,b))
plt.figure(figsize=(4,8))
plt.plot(x,y)

In [23]:

#图例的设定
plt.plot(xx,yy,label='aaa')
plt.plot(xx-1,yy+1,label='bbb')
plt.legend(loc=1)

Out[23]:

<matplotlib.legend.Legend at 0x1a72d8d4198>

img

In [25]:

#给坐标轴设定标识
plt.plot(xx-1,yy+1,label='bbb')
plt.xlabel('distence')
plt.ylabel('temp')
plt.title('aaa')

Out[25]:

Text(0.5,1,'aaa')

img

柱状图:plt.bar()

  • 参数:第一个参数是索引。第二个参数是数据值。第三个参数是条形的宽度

In [33]:

plt.bar(x,y)

Out[33]:

<Container object of 5 artists>

img

In [29]:

plt.barh(x,y)

Out[29]:

<Container object of 5 artists>

img

直方图

  • 是一个特殊的柱状图,又叫做密度图
  • plt.hist()的参数
    • bins
      可以是一个bin数量的整数值,也可以是表示bin的一个序列。默认值为10
    • normed
      如果值为True,直方图的值将进行归一化处理,形成概率密度,默认值为False
    • color
      指定直方图的颜色。可以是单一颜色值或颜色的序列。如果指定了多个数据集合,例如DataFrame对象,颜色序列将会设置为相同的顺序。如果未指定,将会使用一个默认的线条颜色
    • orientation
      通过设置orientation为horizontal创建水平直方图。默认值为vertical

In [40]:

x = [1,1,2,3,4,5,5,5,6,7,7,7,7,7,7,7,8]
plt.hist(x,bins=20)

Out[40]:

(array([2., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 3., 0., 0., 1., 0., 0.,
        7., 0., 1.]),
 array([1.  , 1.35, 1.7 , 2.05, 2.4 , 2.75, 3.1 , 3.45, 3.8 , 4.15, 4.5 ,
        4.85, 5.2 , 5.55, 5.9 , 6.25, 6.6 , 6.95, 7.3 , 7.65, 8.  ]),
 <a list of 20 Patch objects>)

img

饼图

  • pie(),饼图也只有一个参数x
  • 饼图适合展示各部分占总体的比例,条形图适合比较各部分的大小

In [41]:

arr=[11,22,31,15]
plt.pie(arr)

Out[41]:

([<matplotlib.patches.Wedge at 0x1a72dae13c8>,
  <matplotlib.patches.Wedge at 0x1a72dae1748>,
  <matplotlib.patches.Wedge at 0x1a72c37ea58>,
  <matplotlib.patches.Wedge at 0x1a72bc76ac8>],
 [Text(0.996424,0.465981,''),
  Text(-0.195798,1.08243,''),
  Text(-0.830021,-0.721848,''),
  Text(0.910034,-0.61793,'')])

img

In [42]:

arr=[0.2,0.3,0.1]
plt.pie(arr)

Out[42]:

([<matplotlib.patches.Wedge at 0x1a72ef21630>,
  <matplotlib.patches.Wedge at 0x1a72ef21b00>,
  <matplotlib.patches.Wedge at 0x1a72ef28080>],
 [Text(0.889919,0.646564,''),
  Text(-0.646564,0.889919,''),
  Text(-1.04616,-0.339919,'')])

img

In [43]:

arr=[11,22,31,15]
plt.pie(arr,labels=['a','b','c','d'])

Out[43]:

([<matplotlib.patches.Wedge at 0x1a72ef61d68>,
  <matplotlib.patches.Wedge at 0x1a72ef69278>,
  <matplotlib.patches.Wedge at 0x1a72ef697b8>,
  <matplotlib.patches.Wedge at 0x1a72ef69cf8>],
 [Text(0.996424,0.465981,'a'),
  Text(-0.195798,1.08243,'b'),
  Text(-0.830021,-0.721848,'c'),
  Text(0.910034,-0.61793,'d')])

img

In [44]:

arr=[11,22,31,15]
plt.pie(arr,labels=['a','b','c','d'],labeldistance=0.3)

Out[44]:

([<matplotlib.patches.Wedge at 0x1a72efb18d0>,
  <matplotlib.patches.Wedge at 0x1a72efb1da0>,
  <matplotlib.patches.Wedge at 0x1a72efbb320>,
  <matplotlib.patches.Wedge at 0x1a72efbb860>],
 [Text(0.271752,0.127086,'a'),
  Text(-0.0533994,0.295209,'b'),
  Text(-0.226369,-0.196868,'c'),
  Text(0.248191,-0.168526,'d')])

img

In [45]:

arr=[11,22,31,15]
plt.pie(arr,labels=['a','b','c','d'],labeldistance=0.3,autopct='%.6f%%')

Out[45]:

([<matplotlib.patches.Wedge at 0x1a72f0024a8>,
  <matplotlib.patches.Wedge at 0x1a72f002ba8>,
  <matplotlib.patches.Wedge at 0x1a72f00a358>,
  <matplotlib.patches.Wedge at 0x1a72f00aac8>],
 [Text(0.271752,0.127086,'a'),
  Text(-0.0533994,0.295209,'b'),
  Text(-0.226369,-0.196868,'c'),
  Text(0.248191,-0.168526,'d')],
 [Text(0.543504,0.254171,'13.924050%'),
  Text(-0.106799,0.590419,'27.848101%'),
  Text(-0.452739,-0.393735,'39.240506%'),
  Text(0.496382,-0.337053,'18.987341%')])

img

In [46]:

arr=[11,22,31,15]
plt.pie(arr,labels=['a','b','c','d'],labeldistance=0.3,shadow=True,explode=[0.2,0.3,0.2,0.4])

Out[46]:

([<matplotlib.patches.Wedge at 0x1a72f04e940>,
  <matplotlib.patches.Wedge at 0x1a72f056128>,
  <matplotlib.patches.Wedge at 0x1a72f056940>,
  <matplotlib.patches.Wedge at 0x1a72f062198>],
 [Text(0.45292,0.21181,'a'),
  Text(-0.106799,0.590419,'b'),
  Text(-0.377282,-0.328113,'c'),
  Text(0.579113,-0.393228,'d')])

img

散点图scatter()

  • 因变量随自变量而变化的大致趋势

In [49]:

x = np.array([1,2,3,4,5])
y = x ** 2
plt.scatter(x,y)

Out[49]:

<matplotlib.collections.PathCollection at 0x1a72f089438>

img

In [51]:

x = np.random.random(size=(20,))
y = np.random.random(size=(20,))
plt.scatter(x,y)

Out[51]:

<matplotlib.collections.PathCollection at 0x1a72fe59e10>

img

temp dist

Type Markdown and LaTeX: α2α2

Type Markdown and LaTeX: α2α2

项目需求

第一部分:数据类型处理

  • 数据加载
    • 字段含义:
      • user_id:用户ID
      • order_dt:购买日期
      • order_product:购买产品的数量
      • order_amount:购买金额
  • 观察数据
    • 查看数据的数据类型
    • 数据中是否存储在缺失值
    • 将order_dt转换成时间类型
    • 查看数据的统计描述
      • 计算所有用户购买商品的平均数量
      • 计算所有用户购买商品的平均花费
    • 在源数据中添加一列表示月份

第二部分:按月数据分析

  • 用户每月购买的总金额
    • 绘制曲线图展示
  • 所有用户每月的消费总次数
  • 所有用户每月的产品购买量
  • 统计每月的消费人数

第三部分:用户个体消费数据分析

  • 用户消费总金额和消费总次数的统计描述
  • 用户消费金额和消费次数的散点图
  • 用户消费金额的分布图
  • 用户消费次数的分布图(消费次数在100次之内的分布)

第四部分:用户消费行为分析

  • 用户第一次消费的时间分布,和人数统计
    • 绘制线形图
  • 用户最后一次消费的时间分布,和人数统计
    • 绘制线形图
  • 新老客户的占比
    • 消费一次为新用户
    • 消费多次为老用户
  • 用户分层
    • 分析得出每个用户的总购买量和总消费金额and最近一次消费的时间
  • 用户的生命周期
    • 将用户划分为活跃用户和其他用户
    • 查看每月活跃用户和其他用户的占比

猜你喜欢

转载自www.cnblogs.com/bky20061005/p/12233244.html
今日推荐