在pandas中创建图像
import pandas as pd
import matplotlib. pyplot as plt
1. 使用空气质量数据air_quality_no2.csv
air_quality= pd. read_csv( "data/air_quality_no2.csv" , index_col= 0 , parse_dates= True )
air_quality. head( )
station_antwerp
station_paris
station_london
datetime
2019-05-07 02:00:00
NaN
NaN
23.0
2019-05-07 03:00:00
50.5
25.0
19.0
2019-05-07 04:00:00
45.0
27.7
19.0
2019-05-07 05:00:00
NaN
50.4
16.0
2019-05-07 06:00:00
NaN
61.9
NaN
2. 可视化数据
plt. plot( air_quality)
[<matplotlib.lines.Line2D at 0x227a8689488>,
<matplotlib.lines.Line2D at 0x227a91e7608>,
<matplotlib.lines.Line2D at 0x227a91e7688>]
air_quality. plot( )
<matplotlib.axes._subplots.AxesSubplot at 0x227a80e86c8>
2. visually compare the N02 values measured in London versus Paris.
air_quality. plot. scatter( x= "station_london" , y= "station_paris" , alpha= 0.5 )
<matplotlib.axes._subplots.AxesSubplot at 0x227ab575848>
plt. scatter( air_quality[ "station_london" ] , air_quality[ "station_paris" ] , alpha= 0.2 )
<matplotlib.collections.PathCollection at 0x227a96d3388>
3.一些标准的Python来概述可用的绘图方法:
[ method_name for method_name in dir ( air_quality. plot) if not method_name. startswith( "_" ) ]
['area',
'bar',
'barh',
'box',
'density',
'hexbin',
'hist',
'kde',
'line',
'pie',
'scatter']
注意
在许多开发环境比如ipython和jupyter nootbook,请使用TAB按钮获得全部的可用方法,例如air_quality.plot.+ TAB。
air_quality. plot. box( )
<matplotlib.axes._subplots.AxesSubplot at 0x227ac9382c8>
4.每个列在单独的子图中:subplots
axs= air_quality. plot. area( figsize= ( 12 , 4 ) , subplots= True )
5.进一步自定义,扩展或保存生成的图。
fig, axs = plt. subplots( figsize= ( 12 , 4 ) )
air_quality. plot. area( ax= axs)
axs. set_ylabel( "NO$_2$ concentration" )
fig. savefig( "no2_concentrations.png" )
小结:
该.plot.*方法适用于Series和DataFrames
默认情况下,每个列都用不同的元素(线,箱线图等)绘制
pandas创建的任何plot都是Matplotlib对象。
扫描二维码关注公众号,回复:
11628347 查看本文章
从现有列派生出新的列
air_quality[ "londom-mg_per_cubic" ] = air_quality[ "station_london" ] * 1.882
air_quality. head( )
station_antwerp
station_paris
station_london
londom-mg_per_cubic
datetime
2019-05-07 02:00:00
NaN
NaN
23.0
43.286
2019-05-07 03:00:00
50.5
25.0
19.0
35.758
2019-05-07 04:00:00
45.0
27.7
19.0
35.758
2019-05-07 05:00:00
NaN
50.4
16.0
30.112
2019-05-07 06:00:00
NaN
61.9
NaN
NaN
air_quality[ "ratio_paris_anwerp" ] = air_quality[ "station_paris" ] / air_quality[ "station_antwerp" ]
air_quality. head( )
station_antwerp
station_paris
station_london
londom-mg_per_cubic
ratio_paris_anwerp
datetime
2019-05-07 02:00:00
NaN
NaN
23.0
43.286
NaN
2019-05-07 03:00:00
50.5
25.0
19.0
35.758
0.495050
2019-05-07 04:00:00
45.0
27.7
19.0
35.758
0.615556
2019-05-07 05:00:00
NaN
50.4
16.0
30.112
NaN
2019-05-07 06:00:00
NaN
61.9
NaN
NaN
NaN
rename()
可用于行标签和列标签。
提供一个字典,其中包含键,当前名称和值,以及新名称.
air_quality_renamed= air_quality. rename( columns= {
"station_anterp" : "BETR801" ,
"staion_paris" : "FR04014" ,
"staion_london" : "London Westminster" } )
air_quality_renamed. head( )
station_antwerp
station_paris
station_london
londom-mg_per_cubic
ratio_paris_anwerp
datetime
2019-05-07 02:00:00
NaN
NaN
23.0
43.286
NaN
2019-05-07 03:00:00
50.5
25.0
19.0
35.758
0.495050
2019-05-07 04:00:00
45.0
27.7
19.0
35.758
0.615556
2019-05-07 05:00:00
NaN
50.4
16.0
30.112
NaN
2019-05-07 06:00:00
NaN
61.9
NaN
NaN
NaN
air_quality_renamed= air_quality. rename( columns= str . lower)
air_quality_renamed. head( )
station_antwerp
station_paris
station_london
londom-mg_per_cubic
ratio_paris_anwerp
datetime
2019-05-07 02:00:00
NaN
NaN
23.0
43.286
NaN
2019-05-07 03:00:00
50.5
25.0
19.0
35.758
0.495050
2019-05-07 04:00:00
45.0
27.7
19.0
35.758
0.615556
2019-05-07 05:00:00
NaN
50.4
16.0
30.112
NaN
2019-05-07 06:00:00
NaN
61.9
NaN
NaN
NaN
小结:
创建新列:把基于现有列的计算结果赋值给[]中的新列名。
操作是基于元素的,不需要遍历行。
rename与字典或函数一起使用来重命名行标签或列名。
计算汇总统计信息
titanic= pd. read_csv( "data/titanic.csv" )
titanic. head( )
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
1.平均数
titanic[ "Age" ] . mean( )
29.69911764705882
2. 中位数
titanic[ [ "Age" , "Fare" ] ] . median( )
Age 28.0000
Fare 14.4542
dtype: float64
titanic[ [ "Age" , "Fare" ] ] . describe( )
Age
Fare
count
714.000000
891.000000
mean
29.699118
32.204208
std
14.526497
49.693429
min
0.420000
0.000000
25%
20.125000
7.910400
50%
28.000000
14.454200
75%
38.000000
31.000000
max
80.000000
512.329200
3. DataFrame.agg()
使用指定轴上的一项或多项操作进行汇总
titanic. agg( { 'Age' : [ 'min' , 'max' , 'median' , 'skew' ] ,
'Fare' : [ 'min' , 'max' , 'median' , 'mean' ] } )
Age
Fare
max
80.000000
512.329200
mean
NaN
32.204208
median
28.000000
14.454200
min
0.420000
0.000000
skew
0.389108
NaN
4.groupby():按类别分组
获取按类别(属性)分组的统计信息
分组可以同时由多个列完成。提供列名作为groupby()方法的列表
titanic[ [ "Sex" , "Age" ] ] . groupby( 'Sex' ) . mean( )
Age
Sex
female
27.915709
male
30.726645
titanic. groupby( 'Sex' ) . mean( )
PassengerId
Survived
Pclass
Age
SibSp
Parch
Fare
Sex
female
431.028662
0.742038
2.159236
27.915709
0.694268
0.649682
44.479818
male
454.147314
0.188908
2.389948
30.726645
0.429809
0.235702
25.523893
titanic. groupby( 'Sex' ) [ 'Age' ] . mean( )
Sex
female 27.915709
male 30.726645
Name: Age, dtype: float64
titanic. groupby( [ 'Sex' , 'Pclass' ] ) [ 'Fare' ] . mean( )
Sex Pclass
female 1 106.125798
2 21.970121
3 16.118810
male 1 67.226127
2 19.741782
3 12.661633
Name: Fare, dtype: float64
5.value_counts():按类别统计数目
titanic[ 'Pclass' ] . value_counts( )
3 491
1 216
2 184
Name: Pclass, dtype: int64
titanic. groupby( 'Pclass' ) [ 'Pclass' ] . count( )
Pclass
1 216
2 184
3 491
Name: Pclass, dtype: int64
注意:
size()和count()都可以和groupby()一起用
size包括NaN(size of the table)
count排除了缺省值
value_counts()可以使用drapna()去增/去 Nan
小结:
可以对整列或整行计算聚合统计信息
groupby提供 拆分-应用-组合 模式的功能
value_counts 是一种方便的快捷方式,用于计算每个类别中的条目数
对表格进行重新布局
1. 对表格进行排序
使用.sort_values(),将根据定义的列对表中的行进行排序。索引将遵循行顺序。
titanic. sort_values( by= 'Age' ) . head( )
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
803
804
1
3
Thomas, Master. Assad Alexander
male
0.42
0
1
2625
8.5167
NaN
C
755
756
1
2
Hamalainen, Master. Viljo
male
0.67
1
1
250649
14.5000
NaN
S
644
645
1
3
Baclini, Miss. Eugenie
female
0.75
2
1
2666
19.2583
NaN
C
469
470
1
3
Baclini, Miss. Helene Barbara
female
0.75
2
1
2666
19.2583
NaN
C
78
79
1
2
Caldwell, Master. Alden Gates
male
0.83
0
2
248738
29.0000
NaN
S
titanic. sort_values( by= [ "Pclass" , "Age" ] , ascending= False ) . head( )
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
851
852
0
3
Svensson, Mr. Johan
male
74.0
0
0
347060
7.7750
NaN
S
116
117
0
3
Connors, Mr. Patrick
male
70.5
0
0
370369
7.7500
NaN
Q
280
281
0
3
Duane, Mr. Frank
male
65.0
0
0
336439
7.7500
NaN
Q
483
484
1
3
Turkula, Mrs. (Hedwig)
female
63.0
0
0
4134
9.5875
NaN
S
326
327
0
3
Nysveen, Mr. Johan Hansen
male
61.0
0
0
345364
6.2375
NaN
S
air_quality= pd. read_csv( "data/air_quality_long.csv" , index_col= "date.utc" , parse_dates= True )
air_quality. head( )
city
country
location
parameter
value
unit
date.utc
2019-06-18 06:00:00+00:00
Antwerpen
BE
BETR801
pm25
18.0
µg/m³
2019-06-17 08:00:00+00:00
Antwerpen
BE
BETR801
pm25
6.5
µg/m³
2019-06-17 07:00:00+00:00
Antwerpen
BE
BETR801
pm25
18.5
µg/m³
2019-06-17 06:00:00+00:00
Antwerpen
BE
BETR801
pm25
16.0
µg/m³
2019-06-17 05:00:00+00:00
Antwerpen
BE
BETR801
pm25
7.5
µg/m³
2.长到宽表格式???Long to wide table format
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MNKpTCte-1595753237357)(attachment:image.png)]
no2= air_quality[ air_quality[ 'parameter' ] == 'no2' ]
no2_subset= no2. sort_index( ) . groupby( "location" ) . head( 2 )
no2_subset
city
country
location
parameter
value
unit
date.utc
2019-04-09 01:00:00+00:00
Antwerpen
BE
BETR801
no2
22.5
µg/m³
2019-04-09 01:00:00+00:00
Paris
FR
FR04014
no2
24.4
µg/m³
2019-04-09 02:00:00+00:00
London
GB
London Westminster
no2
67.0
µg/m³
2019-04-09 02:00:00+00:00
Antwerpen
BE
BETR801
no2
53.5
µg/m³
2019-04-09 02:00:00+00:00
Paris
FR
FR04014
no2
27.4
µg/m³
2019-04-09 03:00:00+00:00
London
GB
London Westminster
no2
67.0
µg/m³
3.pivot():
该pivot_table()功能纯粹是对数据进行整形:每个索引/列组合都需要一个值。
no2_subset. pivot( columns= "location" , values= "value" )
location
BETR801
FR04014
London Westminster
date.utc
2019-04-09 01:00:00+00:00
22.5
24.4
NaN
2019-04-09 02:00:00+00:00
53.5
27.4
67.0
2019-04-09 03:00:00+00:00
NaN
NaN
67.0
no2. head( )
city
country
location
parameter
value
unit
date.utc
2019-06-21 00:00:00+00:00
Paris
FR
FR04014
no2
20.0
µg/m³
2019-06-20 23:00:00+00:00
Paris
FR
FR04014
no2
21.8
µg/m³
2019-06-20 22:00:00+00:00
Paris
FR
FR04014
no2
26.5
µg/m³
2019-06-20 21:00:00+00:00
Paris
FR
FR04014
no2
24.9
µg/m³
2019-06-20 20:00:00+00:00
Paris
FR
FR04014
no2
21.4
µg/m³
no2. pivot( columns= 'location' , values= 'value' ) . plot( )
<matplotlib.axes._subplots.AxesSubplot at 0x227af5abfc8>
4.透视表:
air_quality. pivot_table( values= "value" , index= "location" , columns= "parameter" , aggfunc= "mean" )
parameter
no2
pm25
location
BETR801
26.950920
23.169492
FR04014
29.374284
NaN
London Westminster
29.740050
13.443568
air_quality. pivot_table( values= "value" , index= "location" , columns= "parameter" , aggfunc= "mean" , margins= True )
parameter
no2
pm25
All
location
BETR801
26.950920
23.169492
24.982353
FR04014
29.374284
NaN
29.374284
London Westminster
29.740050
13.443568
21.491708
All
29.430316
14.386849
24.222743
air_quality. groupby( [ "parameter" , "location" ] ) . mean( )
value
parameter
location
no2
BETR801
26.950920
FR04014
29.374284
London Westminster
29.740050
pm25
BETR801
23.169492
London Westminster
13.443568
5. Wide to long format
pandas.melt()方法DataFrame将数据表从宽格式转换为长格式。
列标题成为新创建的列中的变量名称。 [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Nqf0iDni-1595753237361)(attachment:image.png)]
no2_pivoted = no2. pivot( columns= "location" , values= "value" ) . reset_index( )
no2_pivoted. head( )
location
date.utc
BETR801
FR04014
London Westminster
0
2019-04-09 01:00:00+00:00
22.5
24.4
NaN
1
2019-04-09 02:00:00+00:00
53.5
27.4
67.0
2
2019-04-09 03:00:00+00:00
54.5
34.2
67.0
3
2019-04-09 04:00:00+00:00
34.5
48.5
41.0
4
2019-04-09 05:00:00+00:00
46.5
59.5
41.0
no_2= no2_pivoted. melt( id_vars= "date.utc" )
no_2. head( )
date.utc
location
value
0
2019-04-09 01:00:00+00:00
BETR801
22.5
1
2019-04-09 02:00:00+00:00
BETR801
53.5
2
2019-04-09 03:00:00+00:00
BETR801
54.5
3
2019-04-09 04:00:00+00:00
BETR801
34.5
4
2019-04-09 05:00:00+00:00
BETR801
46.5
no_2= no2_pivoted. melt( id_vars= "date.utc" ,
value_vars= [ "BETR801" ,
"FR04014" ,
"London Westminster" ] ,
value_name= "NO_2" ,
var_name= "id_location" )
no_2. head( )
date.utc
id_location
NO_2
0
2019-04-09 01:00:00+00:00
BETR801
22.5
1
2019-04-09 02:00:00+00:00
BETR801
53.5
2
2019-04-09 03:00:00+00:00
BETR801
54.5
3
2019-04-09 04:00:00+00:00
BETR801
34.5
4
2019-04-09 05:00:00+00:00
BETR801
46.5
结果相同,但定义更详细:
value_vars明确定义要融合在一起的列
value_name 提供值列的自定义列名,而不是默认列名 value
var_name为收集列标题名称的列提供自定义列名称。否则,它采用索引名称或默认值variable
因此,参数value_name和var_name是两个生成的列的用户定义名称。要熔化的列由id_vars和定义value_vars。
小结:
支持按一列或多列排序 sort_values
该pivot功能是纯粹的数据重组, pivot_table支持聚合
pivot(长到宽格式)的相反是melt(宽到长格式)