Pandas案例--------PM2.5数据统计


数据来源: https://www.kaggle.com/uciml/pm25-data-for-five-chinese-cities

数据读取

from matplotlib import font_manager
from matplotlib import pyplot as plt
import pandas as pd 
file_path = "./BeijingPM2.5.csv"

df = pd.read_csv(file_path)
print(df.head())
print(df.info())
   No  year  month  day  hour  season  PM_Dongsi  PM_Dongsihuan  \
0   1  2010      1    1     0       4        NaN            NaN   
1   2  2010      1    1     1       4        NaN            NaN   
2   3  2010      1    1     2       4        NaN            NaN   
3   4  2010      1    1     3       4        NaN            NaN   
4   5  2010      1    1     4       4        NaN            NaN   

   PM_Nongzhanguan  PM_US Post  DEWP  HUMI    PRES  TEMP cbwd    Iws  \
0              NaN         NaN -21.0  43.0  1021.0 -11.0   NW   1.79   
1              NaN         NaN -21.0  47.0  1020.0 -12.0   NW   4.92   
2              NaN         NaN -21.0  43.0  1019.0 -11.0   NW   6.71   
3              NaN         NaN -21.0  55.0  1019.0 -14.0   NW   9.84   
4              NaN         NaN -20.0  51.0  1018.0 -12.0   NW  12.97   

   precipitation  Iprec  
0            0.0    0.0  
1            0.0    0.0  
2            0.0    0.0  
3            0.0    0.0  
4            0.0    0.0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52584 entries, 0 to 52583
Data columns (total 18 columns):
No                 52584 non-null int64
year               52584 non-null int64
month              52584 non-null int64
day                52584 non-null int64
hour               52584 non-null int64
season             52584 non-null int64
PM_Dongsi          25052 non-null float64
PM_Dongsihuan      20508 non-null float64
PM_Nongzhanguan    24931 non-null float64
PM_US Post         50387 non-null float64
DEWP               52579 non-null float64
HUMI               52245 non-null float64
PRES               52245 non-null float64
TEMP               52579 non-null float64
cbwd               52579 non-null object
Iws                52579 non-null float64
precipitation      52100 non-null float64
Iprec              52100 non-null float64
dtypes: float64(11), int64(6), object(1)
memory usage: 7.2+ MB
None

数据整理

PeriodIndex函数----时间段

## 把分开的时间字符串通过periodIndex的方法转化为pandas的时间类型
periods = pd.PeriodIndex(year=df["year"],month=df["month"],day=df["day"],hour=df["hour"],freq="H")
df["datetime"] = periods
print(df.head(10))
   No  year  month  day  hour  season  PM_Dongsi  PM_Dongsihuan  \
0   1  2010      1    1     0       4        NaN            NaN   
1   2  2010      1    1     1       4        NaN            NaN   
2   3  2010      1    1     2       4        NaN            NaN   
3   4  2010      1    1     3       4        NaN            NaN   
4   5  2010      1    1     4       4        NaN            NaN   
5   6  2010      1    1     5       4        NaN            NaN   
6   7  2010      1    1     6       4        NaN            NaN   
7   8  2010      1    1     7       4        NaN            NaN   
8   9  2010      1    1     8       4        NaN            NaN   
9  10  2010      1    1     9       4        NaN            NaN   

   PM_Nongzhanguan  PM_US Post  DEWP  HUMI    PRES  TEMP cbwd    Iws  \
0              NaN         NaN -21.0  43.0  1021.0 -11.0   NW   1.79   
1              NaN         NaN -21.0  47.0  1020.0 -12.0   NW   4.92   
2              NaN         NaN -21.0  43.0  1019.0 -11.0   NW   6.71   
3              NaN         NaN -21.0  55.0  1019.0 -14.0   NW   9.84   
4              NaN         NaN -20.0  51.0  1018.0 -12.0   NW  12.97   
5              NaN         NaN -19.0  47.0  1017.0 -10.0   NW  16.10   
6              NaN         NaN -19.0  44.0  1017.0  -9.0   NW  19.23   
7              NaN         NaN -19.0  44.0  1017.0  -9.0   NW  21.02   
8              NaN         NaN -19.0  44.0  1017.0  -9.0   NW  24.15   
9              NaN         NaN -20.0  37.0  1017.0  -8.0   NW  27.28   

   precipitation  Iprec         datetime  
0            0.0    0.0 2010-01-01 00:00  
1            0.0    0.0 2010-01-01 01:00  
2            0.0    0.0 2010-01-01 02:00  
3            0.0    0.0 2010-01-01 03:00  
4            0.0    0.0 2010-01-01 04:00  
5            0.0    0.0 2010-01-01 05:00  
6            0.0    0.0 2010-01-01 06:00  
7            0.0    0.0 2010-01-01 07:00  
8            0.0    0.0 2010-01-01 08:00  
9            0.0    0.0 2010-01-01 09:00  
# 把datetime设置为索引
df.set_index("datetime",inplace=True)


# 处理缺失数据,删除缺失数据
data = df["PM_US Post"].dropna()

# 画图
_x = data.index
_y = data.values

plt.figure(figsize=(20,8),dpi=80)
plt.plot(range(len(_x)),_y)

plt.xticks(range(0,len(_x),20),list(_x)[::20])

plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YysvpkBU-1582980536492)(output_5_0.png)]

优化

# 进行降采样
df = df.resample("7D").mean() 
data = df["PM_US Post"].dropna()

# 画图
_x = data.index
_x = [i.strftime("%Y%m%d") for i in _x]
_y = data.values

plt.figure(figsize=(20,8),dpi=80)
plt.plot(range(len(_x)),_y)
plt.xticks(range(0,len(_x),10),list(_x)[::10],rotation=45)

plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-l6loKSbb-1582980536493)(output_7_0.png)]

data_china = df["PM_Dongsi"]
# 画图
_x = data.index
_x = [i.strftime("%Y%m%d") for i in _x]
_x_china = [i.strftime("%Y%m%d") for i in data_china. index]
_y = data.values
_y_china = data_china.values
plt.figure(figsize=(20,8),dpi=80)
plt.plot(range(len(_x)),_y,label="US_POST")
plt.plot(range(len(_x_china)),_y_china,label="CN_POST")
plt.xticks(range(0,len(_x),10),list(_x)[::10],rotation=45)
plt.legend(loc="best")
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-L63MS0D9-1582980536494)(output_8_0.png)]

发布了44 篇原创文章 · 获赞 27 · 访问量 8516

猜你喜欢

转载自blog.csdn.net/zenghongju/article/details/104581526