Data Acquisition and Visualization of

By writing crawlers, to achieve the air quality index AQI website Region and the period of time were acquired, and visualize data
Procedure:

  • Pyspider installation
    using the command line installed in the anaconda prompt:
    Here Insert Picture Description

  • When prompted to upgrade, then follow the prompts to perform
    Here Insert Picture Description

  • PhantomJS download, and configure the environment variables
    may be carried out in the following URL to download
    https://phantomjs.org/download.html
    Here Insert Picture Description
    Here Insert Picture Description
    5. The command line start pyspider
    the FIG command if there is a successful start
    Here Insert Picture Description

  • If, python version 3.7 above, the proposed downgrade, because there will be language conflict
    in the command line, enter the following command:
    PIP WsgiDAV == Uninstall 2.4.1

  • After a successful start by default address: http: // localhost: 5000 / to enter the spider web interface
    and click the create new project can create a new reptile job
    Here Insert Picture Description
    after the figure I want to address is the address crawling, filling out information click create

  • Afraid to enter into a written project interface:
    Here Insert Picture Description
    the left is the code debugging interface, the right is the result of running the code, click on the run to run, then modify the code each time you need to save (save)

  • Data collection
    Select to crawl the url
    Here Insert Picture Description
    click the triangle button on the right side
    Here Insert Picture Description
    because the url is what we need, so it is necessary to modify url obtain expressions
    Click enable css selector helper, then click on the data you want to get, you can generate a response data extraction expression formula:
    Here Insert Picture Description
    click the arrow can be inserted into the resulting expression where the cursor is located, and 'div> li> a'
    Here Insert Picture Description

Here Insert Picture Description

  • Acquiring data
    after the re-run:
    Here Insert Picture Description
    add the following code to index_page method self.crawl in:
fetch_type='js',js_script="""function() {setTimeout("$('.more').click()", 2000);
  }"""# 等待浏览器加载数据

After storage, re-
generated code pyspider the Handler class New Method:

@config(age=10 * 24 * 60 * 60)
    def index1_page(self, response):
        for each in response.doc('.unstyled1 a').items():
        self.crawl(each.attr.href,validate_cert=False,fetch_type='js',js_script="""function() {setTimeout("$('.more').click()", 2000);
                   }""",callback=self.detail_page)# 代码中的JS为等待浏览器加载数据

The method index_page callbacke value method to index1_page.
Here Insert Picture Description
At this time, the resulting data links 65 cities in May.
Select the last (April 2019 data):
Here Insert Picture Description

  • save data
  @config(priority=2)
def detail_page(self, response):
    import pandas as pd
        title_data= response.doc("* tr > th").text()# 获取标签数据
        data = response.doc("* > * tr > td").text()# 获取表格内的具体数据
        title_list = title_data.split(" ")# 将获取的数据进行分割
        data_list = data.split(" ")
        data_AQI={}# 新建字典用来存放处理后的数据
        for i in range(len(title_list)):# 按照标签进行数据处理
            data_AQI[title_list[i]]= data_list[i::9]# 将数据处理为字典形式
             data = pd.DataFrame(data_AQI)# 转换成DataFrame格式
        data.to_csv("D:\\data.csv",index=False,encoding="GBK")# 路径可以更换为想要存储数据的路径,此处存储至windows本地桌面,windows中的编码格式为GBK。
        return 
        


Here Insert Picture Description

  • data visualization

# -*- coding: utf-8 -*-
"""
Spyder Editor

This is a temporary script file.
"""
import pandas as pd
import matplotlib.dates as mdate
import matplotlib.pyplot as plt
from pylab import mpl
from datetime import datetime

plt.style.use("seaborn-whitegrid")
mpl.rcParams['font.sans-serif'] = ['SimHei']	# 设置matplotlib字体,解决中文显示问题
def visualization_mat(filename):
    data = pd.read_csv(filename, encoding="GBK")
    # 创建画布,返回figure,axes两个元组 赋予fig,axes,并创建了两行一列的两个子图  一般只用到axes
    fig, axes = plt.subplots(nrows=2, ncols=1, dpi=80,figsize=(8,6))
    # 质量等级使用饼图展示
    level = list(data["质量等级"])	# 简单的数据处理,选取饼图的标签
    level_name = list(set(level))
    #饼状图需要导入的是:plt.pie(x, labels= )
    axes[0].pie([level.count(i) for i in level_name],labels=level_name,
    autopct="%1.2f%%",colors=["r", "g", "b","y"])	# 设置饼图的相关属性
    ts =[datetime.strptime(i, '%Y-%m-%d') for i in data["日期"]]
    # PM数据使用折线图展示
    axes[1].plot(ts,data["NO2"])
    axes[1].plot(ts,data["O3_8h"])
    axes[1].plot(ts,data["AQI"])
    axes[1].xaxis.set_major_formatter(mdate.DateFormatter('%Y-%m-%d'))#设置时间标签显示格式
    
    axes[1].set_xticks([datetime.strptime(i, '%Y-%m-%d').date() for i in data["日期"]])
    axes[1].set_xlabel("时间")	# 为坐标轴设置标签
    axes[1].set_ylabel("数值")
    axes[1].legend()
    plt.gcf().autofmt_xdate()# 日期自动旋转
# plt.savefig(filename+’.jpg’)# 保存图片
    plt.show()		#展示图像
visualization_mat ("D:\\ data.csv")



Here Insert Picture Description

Published 20 original articles · won praise 23 · views 986

Guess you like

Origin blog.csdn.net/surijing/article/details/104657256