笔记——pandas学习3

合并来自多个表的数据

import pandas as pd
air_quality_no2=pd.read_csv("data/air_quality_long.csv",parse_dates=True)
#parse_dates参数:将csv中的时间字符串转换成日期格式
air_quality_no2=air_quality_no2[["date.utc","location","parameter","value"]]
air_quality_no2.head()
date.utc location parameter value
0 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0
1 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5
2 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5
3 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0
4 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5
air_quality_pm25=pd.read_csv("data/air_quality_pm25_long.csv",parse_dates=True)
air_quality_pm25.head()
city country date.utc location parameter value unit
0 Antwerpen BE 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0 µg/m³
1 Antwerpen BE 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5 µg/m³
2 Antwerpen BE 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5 µg/m³
3 Antwerpen BE 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0 µg/m³
4 Antwerpen BE 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5 µg/m³

1. concat()函数:

  1. 沿一个轴(行或列)执行多个表的串联操作
  2. 默认沿行进行拼接
  3. concat()的keys参数:提供了一个附加的(分层的)行索引
    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ina28uDp-1595754064474)(attachment:image.png)]
air_quality=pd.concat([air_quality_pm25,air_quality_no2])
air_quality.head()
city country date.utc location parameter value unit
0 Antwerpen BE 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0 µg/m³
1 Antwerpen BE 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5 µg/m³
2 Antwerpen BE 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5 µg/m³
3 Antwerpen BE 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0 µg/m³
4 Antwerpen BE 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5 µg/m³
print("Shape of pm25:",air_quality_pm25.shape)
print("Shape of no2:",air_quality_no2.shape)
print("shape of quality:",air_quality.shape)
#3178 = 1110 + 2068行。
Shape of pm25: (1110, 7)
Shape of no2: (5272, 4)
shape of quality: (6382, 7)
air_quality=air_quality.sort_values("date.utc")
air_quality
city country date.utc location parameter value unit
3500 NaN NaN 2019-04-09 01:00:00+00:00 FR04014 no2 24.4 NaN
176 NaN NaN 2019-04-09 01:00:00+00:00 BETR801 pm25 76.0 NaN
3663 NaN NaN 2019-04-09 01:00:00+00:00 BETR801 no2 22.5 NaN
5271 NaN NaN 2019-04-09 02:00:00+00:00 London Westminster no2 67.0 NaN
1824 NaN NaN 2019-04-09 02:00:00+00:00 London Westminster pm25 42.0 NaN
... ... ... ... ... ... ... ...
102 London GB 2019-06-20 23:00:00+00:00 London Westminster pm25 7.0 µg/m³
178 NaN NaN 2019-06-20 23:00:00+00:00 London Westminster pm25 7.0 NaN
177 NaN NaN 2019-06-21 00:00:00+00:00 London Westminster pm25 7.0 NaN
1825 NaN NaN 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 NaN
101 London GB 2019-06-21 00:00:00+00:00 London Westminster pm25 7.0 µg/m³

6382 rows × 7 columns

#concat的keys参数提供了一个方便的解决方案,添加了一个附加的(分层的)行索引
air_quality_=pd.concat([air_quality_pm25,air_quality_no2],keys=["PM25","NO2"])
air_quality_
city country date.utc location parameter value unit
PM25 0 Antwerpen BE 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0 µg/m³
1 Antwerpen BE 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5 µg/m³
2 Antwerpen BE 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5 µg/m³
3 Antwerpen BE 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0 µg/m³
4 Antwerpen BE 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5 µg/m³
... ... ... ... ... ... ... ... ...
NO2 5267 NaN NaN 2019-04-09 06:00:00+00:00 London Westminster no2 41.0 NaN
5268 NaN NaN 2019-04-09 05:00:00+00:00 London Westminster no2 41.0 NaN
5269 NaN NaN 2019-04-09 04:00:00+00:00 London Westminster no2 41.0 NaN
5270 NaN NaN 2019-04-09 03:00:00+00:00 London Westminster no2 67.0 NaN
5271 NaN NaN 2019-04-09 02:00:00+00:00 London Westminster no2 67.0 NaN

6382 rows × 7 columns

2. reset_index():将任何级别的索引转换为列

air_quality.reset_index(level=0)
index city country date.utc location parameter value unit
0 3500 NaN NaN 2019-04-09 01:00:00+00:00 FR04014 no2 24.4 NaN
1 176 NaN NaN 2019-04-09 01:00:00+00:00 BETR801 pm25 76.0 NaN
2 3663 NaN NaN 2019-04-09 01:00:00+00:00 BETR801 no2 22.5 NaN
3 5271 NaN NaN 2019-04-09 02:00:00+00:00 London Westminster no2 67.0 NaN
4 1824 NaN NaN 2019-04-09 02:00:00+00:00 London Westminster pm25 42.0 NaN
... ... ... ... ... ... ... ... ...
6377 102 London GB 2019-06-20 23:00:00+00:00 London Westminster pm25 7.0 µg/m³
6378 178 NaN NaN 2019-06-20 23:00:00+00:00 London Westminster pm25 7.0 NaN
6379 177 NaN NaN 2019-06-21 00:00:00+00:00 London Westminster pm25 7.0 NaN
6380 1825 NaN NaN 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 NaN
6381 101 London GB 2019-06-21 00:00:00+00:00 London Westminster pm25 7.0 µg/m³

6382 rows × 8 columns

merge():加入使用公共标识符表(自然连接)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9LXXyMGI-1595754064481)(attachment:image.png)]

station_coord=pd.read_csv("data/air_quality_stations.csv")
station_coord.head()
location coordinates.latitude coordinates.longitude
0 BELAL01 51.23619 4.38522
1 BELHB23 51.17030 4.34100
2 BELLD01 51.10998 5.00486
3 BELLD02 51.12038 5.02155
4 BELR833 51.32766 4.36226
air_quality.head()
city country date.utc location parameter value unit
3500 NaN NaN 2019-04-09 01:00:00+00:00 FR04014 no2 24.4 NaN
176 NaN NaN 2019-04-09 01:00:00+00:00 BETR801 pm25 76.0 NaN
3663 NaN NaN 2019-04-09 01:00:00+00:00 BETR801 no2 22.5 NaN
5271 NaN NaN 2019-04-09 02:00:00+00:00 London Westminster no2 67.0 NaN
1824 NaN NaN 2019-04-09 02:00:00+00:00 London Westminster pm25 42.0 NaN
air_quality=pd.merge(air_quality,station_coord,how='left',on='location')
air_quality.head()
city country date.utc location parameter value unit coordinates.latitude coordinates.longitude
0 NaN NaN 2019-04-09 01:00:00+00:00 FR04014 no2 24.4 NaN 48.83724 2.39390
1 NaN NaN 2019-04-09 01:00:00+00:00 FR04014 no2 24.4 NaN 48.83722 2.39390
2 NaN NaN 2019-04-09 01:00:00+00:00 BETR801 pm25 76.0 NaN 51.20966 4.43182
3 NaN NaN 2019-04-09 01:00:00+00:00 BETR801 no2 22.5 NaN 51.20966 4.43182
4 NaN NaN 2019-04-09 02:00:00+00:00 London Westminster no2 67.0 NaN 51.49467 -0.13193
air_quality_parameters = pd.read_csv("data/air_quality_parameters.csv")
air_quality_parameters.head()
id description name
0 bc Black Carbon BC
1 co Carbon Monoxide CO
2 no2 Nitrogen Dioxide NO2
3 o3 Ozone O3
4 pm10 Particulate matter less than 10 micrometers in... PM10
air_quality=pd.merge(air_quality,air_quality_parameters,how='left',left_on="parameter",right_on="id")
air_quality.head()
#在left_on和right_on 参数都在这里(而不是仅仅使用on),使两个表之间的联系。
city country date.utc location parameter value unit coordinates.latitude coordinates.longitude id_x description_x name_x id_y description_y name_y
0 NaN NaN 2019-04-09 01:00:00+00:00 FR04014 no2 24.4 NaN 48.83724 2.39390 no2 Nitrogen Dioxide NO2 no2 Nitrogen Dioxide NO2
1 NaN NaN 2019-04-09 01:00:00+00:00 FR04014 no2 24.4 NaN 48.83722 2.39390 no2 Nitrogen Dioxide NO2 no2 Nitrogen Dioxide NO2
2 NaN NaN 2019-04-09 01:00:00+00:00 BETR801 pm25 76.0 NaN 51.20966 4.43182 pm25 Particulate matter less than 2.5 micrometers i... PM2.5 pm25 Particulate matter less than 2.5 micrometers i... PM2.5
3 NaN NaN 2019-04-09 01:00:00+00:00 BETR801 no2 22.5 NaN 51.20966 4.43182 no2 Nitrogen Dioxide NO2 no2 Nitrogen Dioxide NO2
4 NaN NaN 2019-04-09 02:00:00+00:00 London Westminster no2 67.0 NaN 51.49467 -0.13193 no2 Nitrogen Dioxide NO2 no2 Nitrogen Dioxide NO2

小结:

  1. 使用该concat功能,可以将多个表按行连接到两列。

  2. 对于类似数据库的表合并/联接,请使用该merge 函数。


猜你喜欢

转载自blog.csdn.net/chairon/article/details/107596076