Annual air quality data of cities across the country

Due to Daiso's division of labor, after a brief familiarity with the crawler, it began to crawl data

The data URL crawled by this blog is the weather report

1. Idea introduction

First, let me talk about the idea of ​​reptiles :

  1. Get the url of the webpage
  2. Get the html of the web page
  3. For data analysis, due to the blogger's limited understanding of xpath, both Liangtang and xpath are used here
  4. Store the data persistently, that is, save it in a csv file

Let me talk about the idea of ​​this project:

  1. Get all city names and url addresses and store them in a dictionary type
  2. Obtain the data url address from January 2020 to December 2020 according to the page displayed in the corresponding city
  3. Obtain the table data in it according to the url address of each month in each city
  4. Write data to csv file for persistent storage

2. Preparation

import requests
from lxml import etree
from bs4 import BeautifulSoup
import csv
import os

headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
}

Use requests to access the webpage and obtain the source code of the webpage
Use etree's xpath and BeautifulSoup to analyze the data and obtain the desired data
Use the csv library to call related functions to store data into the csv file
Use the os library to create a folder
headers is the User-Agent used when accessing the webpage , for UA camouflage

3. Get the city name and url

Note 1 : As shown in the figure below, when crawling using xpath at the beginning, when crawling to Inner Mongolia Province, only the first row of cities can be crawled. After a closer look, it is found that provinces with more than one row of cities will include A <wbr> tag hinders the acquisition of subsequent data, because I didn’t think of a good way to directly use Find_all and select of BeautifulSoup to get all the href and text in the <a> tag Note 2: As shown in the figure below,
insert image description here
Baise City The connection is broken, and the text after getting it is http://www.tianqihoubao.com/aqi/baise\r\n.html, there are too many \r\n, so you can’t access the city’s webpage correctly, so you can directly Add a special
insert image description here
judgment to the code Note 3 : There is a space after the text of each city in the <a> tag, as above, the text in <a> of Baise City is
"Baise", and there is a space here, if you don't want The space can be removed using the strip function when getting the text

The code is displayed as follows :

############################################################
# 函数功能:获取各个城市的名字及对应的url
# 传入参数:无
# 输出参数:citys_name,字典类型,城市名称为键,对应的url为值
# 修改时间:2020/1/23
# 修改理由:暂无
############################################################
def get_city():
    url = r'http://www.tianqihoubao.com/aqi/'

    response = requests.get(url=url, headers=headers).text
    soup = BeautifulSoup(response, 'lxml')                           # 用靓汤解析网页源码获取class为citychk的标签
    tables = soup.find_all(class_="citychk")

    citys_name = dict()
    for table in tables:
        citys_url = table.select("a")                                 # 挑选出其中的a标签并抽取其中的文本以及href
        for city in citys_url:
            city_name = city.text.strip()                             # 城市名称,使用strip去除尾部的空格
            city_name = city_name.encode('iso-8859-1').decode('gbk')  # 对中文乱码重新编码
            city_urls = city.attrs.get("href").strip()                # 城市的url

            if city_name == "全国空气质量排名":                         # 忽略网页中禹城市无关的文本
                continue
            # 百色需要特殊处理——html源码中的地址爬取后为http://www.tianqihoubao.com/aqi/baise\r\n.html
            # 由于噪声字符不是位于字符串的头和尾,因此需要额外加个特判
            if city_name == "百色":
                citys_name[city_name] = "http://www.tianqihoubao.com/aqi/baise.html"
                continue

            citys_name[city_name] = "http://www.tianqihoubao.com" + city_urls   # 添加字典,格式为:城市名称——对应url

    return citys_name

Fourth, get the url of each month

The xpath used here parses the data. The xpath is written using logic or strings. You need to modify the subscript value of li directly. Adding is also a direct or new xpath value. The returned months are the url addresses of each month

The code is displayed as follows :

############################################################
# 函数功能:根据城市的url获取历史大气数据页面的url
# 传入参数:city_url,字符串类型,表示城市的url地址
# 输出参数:months,列表类型,表示该城市一年各月份数据的url
# 修改时间:2020/1/23
# 修改理由:暂无
############################################################
def get_month(city_url):
    months = list()

    response = requests.get(url=city_url, headers=headers).text
    tree = etree.HTML(response)                     # 使用etree解析网页源码,下面为xpath
    month_list = tree.xpath('//div[@class="box p"]//li[2]/a/@href | //div[@class="box p"]//li[3]/a/@href'
                            '| //div[@class="box p"]//li[4]/a/@href | //div[@class="box p"]//li[5]/a/@href'
                            '| //div[@class="box p"]//li[6]/a/@href | //div[@class="box p"]//li[7]/a/@href'
                            '| //div[@class="box p"]//li[8]/a/@href | //div[@class="box p"]//li[9]/a/@href'
                            '| //div[@class="box p"]//li[10]/a/@href | //div[@class="box p"]//li[11]/a/@href'
                            '| //div[@class="box p"]//li[12]/a/@href | //div[@class="box p"]//li[13]/a/@href')

    for month in month_list:
        month = r'http://www.tianqihoubao.com' + month          # 某城市每个月的url
        months.append(month)

    return months

5. Obtain various values

The data is stored in a table, so two layers of loop analysis can be done directly, the outer loop parses the row (<tr>), and the inner loop parses the column (<td>), save each day as a list, and then save the list Enter a large list to store the data of each day of the month
Note : the crawled data may contain spaces or \r\n at the beginning and end, so use the strip() function to remove it directly and keep it valid data

The code is displayed as follows :

############################################################
# 获取历史数据
# 输入参数:month_url, 字符串类型,对应月份的url地址
# 输出参数:data_list, 列表,元素也为列表,12个月份的数据
# 修改时间:2020/1/23
# 修改理由:暂无
############################################################
def get_record(month_url):
    response = requests.get(url=month_url, headers=headers).text
    tree = etree.HTML(response)
    tr_list = tree.xpath('//div[@class="api_month_list"]/table//tr')

    data_list = list()
    for tr in tr_list[1:]:                          # 定位到行,除去表头
        td_list = tr.xpath('./td/text()')
        data = list()
        for td in td_list:                          # 定位到具体列
            data.append(td.strip())                 # 使用strip函数出去原来数据头和尾的\r\n以及空格
        data_list.append(data)

    return data_list

6. Writing the main function

Get the name and url of a city, and then get the url of each month, and then get the data and save it in the specified path

The code is displayed as follows :

def main():
    if not os.path.exists(r"./AQIoneMonth"):            # 创建文件夹
        os.mkdir(r"./AQIoneMonth")

    citys_name = get_city()                     # 获取城市名称及url
    for city in citys_name:                     # city为字符串,表示城市名称
        print("开始爬取%s的历史大气数据" % city)
        city_url = citys_name[city]
        months = get_month(city_url)            # 该城市对应的12个月的连接
        index = 0
        for month_url in months:
            index = index + 1
            data_list = get_record(month_url)   # 获取每个月的数据
            with open('./AQIoneMonth/' + city + '.csv', 'a', newline="", encoding="utf-8") as f:
                writer = csv.writer(f)
                if index == 1:                  # 写入表头,只写入一次
                    writer.writerow(
                        ["日期", '质量等级', 'AQI指数', '当天AQI排名', 'PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3'])
                writer.writerows(data_list)     # 写入数据
            print("第%d个月已经爬取完成" % index)
        print()

Note : The crawling time is relatively long, and it will be stuck after a long time of crawling. At this time, the running of the program can be terminated, and the i variable can be set in the outermost loop to control the position of the city starting at the second start. As for The value of i can see the number of files already contained in the folder that has been crawled. Change the value of i based on this and re-run to continue capturing. The sample code is as follows:

def main():
    if not os.path.exists(r"./AQI"):            # 创建文件夹
        os.mkdir(r"./AQI")

    citys_name = get_city()                     # 获取城市名称及url
    print(citys_name)
    i = 0
    for city in citys_name:                     # city为字符串,表示城市名称
        i = i + 1
        if i > 168:
            print("开始爬取%s的历史大气数据" % city)
            city_url = citys_name[city]             # 获取每个城市的url
            months = get_month(city_url)            # 获取该城市对应的12个月的url

            index = 0
            for month_url in months:
                index = index + 1
                data_list = get_record(month_url)   # 获取每个月的数据

                with open('./AQI/' + city + '.csv', 'a', newline="", encoding="utf-8") as f:
                    writer = csv.writer(f)
                    if index == 1:                  # 写入表头,只写入一次
                        writer.writerow(
                            ["日期", '质量等级', 'AQI指数', '当天AQI排名', 'PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3'])
                    writer.writerows(data_list)     # 写入数据

                print("第%d个月已经爬取完成" % index)
            print()

There is another point to note: Simao City, Chaohu City, Erenhot City, Laiwu City, and Ili Kazakh Prefecture have no data, so the website cannot crawl the data of the above cities. In addition, Nagqu City only has data from May 2020 and before. There is no data after that, so it cannot be crawled.

Seven, complete code

# -*- coding = utf-8 -*-

import requests
from lxml import etree
from bs4 import BeautifulSoup
import csv
import os

# UA伪装所使用的User-Agent
headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
}


############################################################
# 函数功能:获取各个城市的名字及对应的url
# 传入参数:无
# 输出参数:citys_name,字典类型,城市名称为键,对应的url为值
# 修改时间:2020/1/23
# 修改理由:暂无
############################################################
def get_city():
    url = r'http://www.tianqihoubao.com/aqi/'

    response = requests.get(url=url, headers=headers).text
    soup = BeautifulSoup(response, 'lxml')                           # 用靓汤解析网页源码获取class为citychk的标签
    tables = soup.find_all(class_="citychk")

    citys_name = dict()
    for table in tables:
        citys_url = table.select("a")                                 # 挑选出其中的a标签并抽取其中的文本以及href
        for city in citys_url:
            city_name = city.text.strip()                             # 城市名称,使用strip去除尾部的空格
            city_name = city_name.encode('iso-8859-1').decode('gbk')  # 对中文乱码重新编码
            city_urls = city.attrs.get("href").strip()                # 城市的url

            if city_name == "全国空气质量排名":                         # 忽略网页中禹城市无关的文本
                continue
            # 百色需要特殊处理——html源码中的地址爬取后为http://www.tianqihoubao.com/aqi/baise\r\n.html
            # 由于噪声字符不是位于字符串的头和尾,因此需要额外加个特判
            if city_name == "百色":
                citys_name[city_name] = "http://www.tianqihoubao.com/aqi/baise.html"
                continue

            citys_name[city_name] = "http://www.tianqihoubao.com" + city_urls   # 添加字典,格式为:城市名称——对应url

    return citys_name


############################################################
# 函数功能:根据城市的url获取历史大气数据页面的url
# 传入参数:city_url,字符串类型,表示城市的url地址
# 输出参数:months,列表类型,表示该城市一年各月份数据的url
# 修改时间:2020/1/23
# 修改理由:暂无
############################################################
def get_month(city_url):
    months = list()

    response = requests.get(url=city_url, headers=headers).text
    tree = etree.HTML(response)                     # 使用etree解析网页源码,下面为xpath
    month_list = tree.xpath('//div[@class="box p"]//li[2]/a/@href | //div[@class="box p"]//li[3]/a/@href'
                            '| //div[@class="box p"]//li[4]/a/@href | //div[@class="box p"]//li[5]/a/@href'
                            '| //div[@class="box p"]//li[6]/a/@href | //div[@class="box p"]//li[7]/a/@href'
                            '| //div[@class="box p"]//li[8]/a/@href | //div[@class="box p"]//li[9]/a/@href'
                            '| //div[@class="box p"]//li[10]/a/@href | //div[@class="box p"]//li[11]/a/@href'
                            '| //div[@class="box p"]//li[12]/a/@href | //div[@class="box p"]//li[13]/a/@href')

    for month in month_list:
        month = r'http://www.tianqihoubao.com' + month          # 某城市每个月的url
        months.append(month)

    return months


############################################################
# 获取历史数据
# 输入参数:month_url, 字符串类型,对应月份的url地址
# 输出参数:data_list, 列表,元素也为列表,12个月份的数据
# 修改时间:2020/1/23
# 修改理由:暂无
############################################################
def get_record(month_url):
    response = requests.get(url=month_url, headers=headers).text
    tree = etree.HTML(response)
    tr_list = tree.xpath('//div[@class="api_month_list"]/table//tr')

    data_list = list()
    for tr in tr_list[1:]:                          # 定位到行,除去表头
        td_list = tr.xpath('./td/text()')
        data = list()
        for td in td_list:                          # 定位到具体列
            data.append(td.strip())                 # 使用strip函数出去原来数据头和尾的\r\n以及空格
        data_list.append(data)

    return data_list


############################################################
# main函数
# 修改时间:2020/1/23
# 修改理由:暂无
############################################################
def main():
    if not os.path.exists(r"./AQI"):            # 创建文件夹
        os.mkdir(r"./AQI")

    citys_name = get_city()                     # 获取城市名称及url
    for city in citys_name:                     # city为字符串,表示城市名称
        print("开始爬取%s的历史大气数据" % city)
        city_url = citys_name[city]             # 获取每个城市的url
        months = get_month(city_url)            # 获取该城市对应的12个月的url

        index = 0
        for month_url in months:
            index = index + 1
            data_list = get_record(month_url)   # 获取每个月的数据

            with open('./AQI/' + city + '.csv', 'a', newline="", encoding="utf-8") as f:
                writer = csv.writer(f)
                if index == 1:                  # 写入表头,只写入一次
                    writer.writerow(
                            ["日期", '质量等级', 'AQI指数', '当天AQI排名', 'PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3'])
                writer.writerows(data_list)     # 写入数据

            print("第%d个月已经爬取完成" % index)
        print()


if __name__ == '__main__':
    main()

Guess you like

Origin blog.csdn.net/qq_43419761/article/details/113093743