20180213 爬虫爬取空气质量数据

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/SONGYINGXU/article/details/79319978

目标网址:
空气质量历史数据
1、修改爬虫原因:网址针对爬虫作了防范措施,直接爬取很难奏效。
2、google 的webdriver难以get内容,也许是网站针对性的进行了防范
思路:
1、利用Cenenium+PlatformJS 模拟浏览器请求一个页面
2、Pandas里面的read_html函数读取页面中的表格数据
环境:
python3.6+Cenenium+PlatformJS,安装过程略
代码:

# coding:utf-8
import time
from urllib import parse

import pandas as pd

from selenium import webdriver

driver = webdriver.PhantomJS('D:\\webdriver\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')

base_url = 'https://www.aqistudy.cn/historydata/daydata.php?city='

def get_month_set():
    month_set = list()
    for i in range(1, 10):
        month_set.append(('2014-0%s' % i))
    for i in range(10, 13):
        month_set.append(('2014-%s' % i))
    for i in range(1, 10):
        month_set.append(('2015-0%s' % i))
    for i in range(10, 13):
        month_set.append(('2015-%s' % i))
    for i in range(1, 10):
        month_set.append(('2016-0%s' % i))
    for i in range(10, 13):
        month_set.append(('2016-%s' % i))
    for i in range(1, 10):
        month_set.append(('2017-0%s' % i))
    for i in range(10, 13):
        month_set.append(('2017-%s' % i))
    return month_set

def get_city_set():
    str_file = r'city.txt'
    fp = open(str_file,'rb')
    city_set = list()
    for line in fp.readlines():
        city_set.append(str(line.strip(),encoding='utf-8'))
    return city_set

month_set = get_month_set()
city_set = get_city_set()

for city in city_set:
    file_name = city + '.csv'
    fp = open(file_name, 'w')
    fp.write('%s,%s,%s,%s,%s,%s,%s,%s,%s\n'%('date','AQI','grade','PM25','PM10','SO2','CO','NO2','O3_8h'))#表头

    for i in range(len(month_set)):
        str_month = month_set[i]
        weburl = ('%s%s&month=%s' % (base_url, parse.quote(city), str_month))

        driver.get(weburl)
        dfs = pd.read_html(driver.page_source,header=0)[0]
        time.sleep(1)#防止页面一带而过,爬不到内容

        for j in range(0,len(dfs)):
            date = dfs.iloc[j,0]
            aqi = dfs.iloc[j,1]
            grade = dfs.iloc[j,2]
            pm25 = dfs.iloc[j,3]
            pm10 = dfs.iloc[j,4]
            so2 = dfs.iloc[j,5]
            co = dfs.iloc[j,6]
            no2 = dfs.iloc[j,7]
            o3 = dfs.iloc[j,8]
            fp.write(('%s,%s,%s,%s,%s,%s,%s,%s,%s\n' % (date,aqi,grade,pm25,pm10,so2,co,no2,o3)))
        print('%d---%s,%s---DONE' % (city_set.index(city), city, str_month))
    fp.close()
driver.quit()
print ('爬虫已经爬完!请检测!')

参考:
杜雨大神python教程
最初版本爬虫代码,已失效
https://github.com/Yinghsusong/aqistudy

猜你喜欢

转载自blog.csdn.net/SONGYINGXU/article/details/79319978
今日推荐