2021SC@SDUSC project training - AQI data acquisition

2021SC@SDUSC


1. Data source

According to the reading of related papers [1] Yu Zheng, Furui Liu, Hsun-Ping Hsie. U-Air: When Urban Air Quality Inference Meets Big Data. 19th SIGKDD conference on Knowledge Discovery and Data Mining (KDD 2013). [
2 ] Yu Zheng, Xuxu Chen, Qiwei Jin, Yubiao Chen, Xiangyun Qu, Xin Liu, Eric Chang, Wei-Ying Ma, Yong Rui, Weiwei Sun. A Cloud-Based Knowledge Discovery System for Monitoring Fine-Grained Air Quality. MSR- TR-2014-40.
[3] Yu Zheng, Xiuwen Yi, Ming Li, Ruiyuan Li, Zhangqing Shan, Eric Chang, Tianrui Li. Forecasting Fine-Grained Air Quality Based on Big Data. In the Proceeding of the 21st SIGKDD conference on Knowledge Discovery and Data Mining (KDD 2015).
Combined with the actual situation, I plan to obtain data sources from multiple sources. Preliminary plans include but are not limited to AQI real-time data, weather-related data (wind speed, temperature, air pressure, humidity), road network data (road network density), poi (Point of Interest).

2. Data acquisition

First of all, the most needed data is the real-time AQI data of existing sites. After many searches, I finally found a very authoritative website, http://fb.sdem.org.cn:8801/AirDeploy.Web/AirQuality/MapMain .aspx. This is the official data released by the Department of Ecology and Environment of Shandong Province. We simply use it here. Introduce how to use it. First of all, if you want to crawl a webpage, you should right-click to check the elements, but we found that the right-click on this interface does not respond. Don't worry, it's very easy.
Just disable JavaScript in the browser, re-enter the webpage and start using the right button.
insert image description here
But the very bad thing is that we found that the map does not display, and then we don't close the page, re-enable JavaScript to refresh and it will succeed.
The web page is in aspx format, and all data is dynamically corresponding, so let's take a look at how these data are obtained.

insert image description here
Through the lookup, it is found that each click will generate a request and return the available value. We can complete the crawling by simulating this request.
insert image description here
insert image description here

from urllib.parse import urlencode
import random
import requests
import traceback
from time import sleep
import json
import csv
import pandas as pd
from lxml import etree
import datetime
base_url = 'http://fb.sdem.org.cn:8801/AirDeploy.Web/Ajax/AirQuality/AirQuality.ashx?'  #这里要换成对应Ajax请求中的链接 每小时20分
headers = {
    
    
输入自己的请求头

}
StationID_list = [68,67,4579]
col_name = []
for i in range(len(StationID_list)+1):
    col_name.append(i)
aqi=[['t1']]

for i in range(len(StationID_list)):
    sleep(random.uniform(2, 3))
    data = {
    
    
                'Method': 'GetQualityItemsValues',
                'StationID': StationID_list[i] #2,4,7,6,10470,3,2987,5,56,54,55
            }
    url = base_url + urlencode(data)
    response = requests.request("POST",url, headers = headers)
    #print(response)
    re = response.content.decode('utf-8')
    #print(re)
    re = json.loads(re)
    aqi[0].append(re['AQI'])
df = pd.DataFrame(columns=col_name, data=aqi)
df.to_csv('./input/aqi.csv',index=False,columns=None, header=None)

Note here that the obtained data format is actually in the form of json, and we can easily access it through the index after we complete the encoding through json.loads().

Guess you like

Origin blog.csdn.net/m0_46306466/article/details/124357415