By writing crawlers, to achieve the air quality index AQI website Region and the period of time were acquired, and visualize data
Procedure:
-
Pyspider installation
using the command line installed in the anaconda prompt:
-
When prompted to upgrade, then follow the prompts to perform
-
PhantomJS download, and configure the environment variables
may be carried out in the following URL to download
https://phantomjs.org/download.html
5. The command line start pyspider
the FIG command if there is a successful start
-
If, python version 3.7 above, the proposed downgrade, because there will be language conflict
in the command line, enter the following command:
PIP WsgiDAV == Uninstall 2.4.1 -
After a successful start by default address: http: // localhost: 5000 / to enter the spider web interface
and click the create new project can create a new reptile job
after the figure I want to address is the address crawling, filling out information click create -
Afraid to enter into a written project interface:
the left is the code debugging interface, the right is the result of running the code, click on the run to run, then modify the code each time you need to save (save) -
Data collection
Select to crawl the url
click the triangle button on the right side
because the url is what we need, so it is necessary to modify url obtain expressions
Click enable css selector helper, then click on the data you want to get, you can generate a response data extraction expression formula:
click the arrow can be inserted into the resulting expression where the cursor is located, and 'div> li> a'
- Acquiring data
after the re-run:
add the following code to index_page method self.crawl in:
fetch_type='js',js_script="""function() {setTimeout("$('.more').click()", 2000);
}"""# 等待浏览器加载数据
After storage, re-
generated code pyspider the Handler class New Method:
@config(age=10 * 24 * 60 * 60)
def index1_page(self, response):
for each in response.doc('.unstyled1 a').items():
self.crawl(each.attr.href,validate_cert=False,fetch_type='js',js_script="""function() {setTimeout("$('.more').click()", 2000);
}""",callback=self.detail_page)# 代码中的JS为等待浏览器加载数据
The method index_page callbacke value method to index1_page.
At this time, the resulting data links 65 cities in May.
Select the last (April 2019 data):
- save data
@config(priority=2)
def detail_page(self, response):
import pandas as pd
title_data= response.doc("* tr > th").text()# 获取标签数据
data = response.doc("* > * tr > td").text()# 获取表格内的具体数据
title_list = title_data.split(" ")# 将获取的数据进行分割
data_list = data.split(" ")
data_AQI={}# 新建字典用来存放处理后的数据
for i in range(len(title_list)):# 按照标签进行数据处理
data_AQI[title_list[i]]= data_list[i::9]# 将数据处理为字典形式
data = pd.DataFrame(data_AQI)# 转换成DataFrame格式
data.to_csv("D:\\data.csv",index=False,encoding="GBK")# 路径可以更换为想要存储数据的路径,此处存储至windows本地桌面,windows中的编码格式为GBK。
return
- data visualization
# -*- coding: utf-8 -*-
"""
Spyder Editor
This is a temporary script file.
"""
import pandas as pd
import matplotlib.dates as mdate
import matplotlib.pyplot as plt
from pylab import mpl
from datetime import datetime
plt.style.use("seaborn-whitegrid")
mpl.rcParams['font.sans-serif'] = ['SimHei'] # 设置matplotlib字体,解决中文显示问题
def visualization_mat(filename):
data = pd.read_csv(filename, encoding="GBK")
# 创建画布,返回figure,axes两个元组 赋予fig,axes,并创建了两行一列的两个子图 一般只用到axes
fig, axes = plt.subplots(nrows=2, ncols=1, dpi=80,figsize=(8,6))
# 质量等级使用饼图展示
level = list(data["质量等级"]) # 简单的数据处理,选取饼图的标签
level_name = list(set(level))
#饼状图需要导入的是:plt.pie(x, labels= )
axes[0].pie([level.count(i) for i in level_name],labels=level_name,
autopct="%1.2f%%",colors=["r", "g", "b","y"]) # 设置饼图的相关属性
ts =[datetime.strptime(i, '%Y-%m-%d') for i in data["日期"]]
# PM数据使用折线图展示
axes[1].plot(ts,data["NO2"])
axes[1].plot(ts,data["O3_8h"])
axes[1].plot(ts,data["AQI"])
axes[1].xaxis.set_major_formatter(mdate.DateFormatter('%Y-%m-%d'))#设置时间标签显示格式
axes[1].set_xticks([datetime.strptime(i, '%Y-%m-%d').date() for i in data["日期"]])
axes[1].set_xlabel("时间") # 为坐标轴设置标签
axes[1].set_ylabel("数值")
axes[1].legend()
plt.gcf().autofmt_xdate()# 日期自动旋转
# plt.savefig(filename+’.jpg’)# 保存图片
plt.show() #展示图像
visualization_mat ("D:\\ data.csv")