Visual analysis of weather data crawled by web crawler based on Python

Table of Contents
Abstract 1
1. Design purpose 2
2. Design task content 3
3. Comparison of commonly used crawler frameworks 3
4. Overall design of web crawler programs 3
4. Detailed design of web crawler programs 4
4.1 Design environment and target analysis 4
4.2 Analysis of crawler operation process 5
Basic crawler process 5
Initiate a request 5
Get response content 5
Analyze data 5
Save data 5
Request and Response 5 Request 5 Response 5
Request
method
5
GET 5
POST 5
URL 6
Request body 6
4.3 Control module detailed design 6
v = [] 8
v = [] 96.
Debugging and testing 117.
Experience 12
References 13
There are three modules in this crawler program:
1. Crawler scheduling terminal: start the crawler, stop the crawler, and monitor the running status of the crawler
2. Crawler module: Contains three A small module, URL manager, web page downloader, web page parser.
(1) URL manager: manage the URLs that need to be crawled and the URLs that have already been crawled. You can take out a crawled URL from the URL manager and pass it to the web page downloader.
(2) Webpage downloader: The webpage downloader downloads the webpage specified by the URL, stores it as a string, and passes it to the webpage parser.
(3) Webpage parser: The webpage parser parses the passed string. The parser can not only parse out the data that needs to be crawled, but also parse out the URLs of each webpage and other webpages. These URLs will be supplemented when they are parsed. Enter URL manager
3. Data output module: store crawled data
4. Detailed design of web crawler program
4.1 Design environment and target analysis
Design environment
IDE: pycharm
Python version: python3
Target analysis
1. Initial URL: www.tianqihoubao.com/ aqi first obtains the web page through the url.
2. Data format

3. Page encoding: UTF-8
4.2 Crawler running process analysis
Basic crawler process
Initiate a request
Send a Request to the target server through the HTTP library, and the Request can contain additional headers information.
Get the response content
If the server responds normally, it will return a Response, which contains the content of the page.
Parsing data
The content may be HTML, which can be parsed with regular expressions and web page parsing libraries.
Perhaps it is Json, which can be directly converted to Json object parsing.
Saving data
can be stored as text, can also be saved to a database, or other specific types of files.
Request and Response
Request
The process when the host sends a data request to the server is called HTTP Request
Response
The process of the server returning data to the host is called HTTP Response
Request The content contained in
the request method
There are two types of GET and POST commonly used.
The parameters of the GET
request method are all included in the URL.
The parameters of the POST
request method are included in the form data in the request body. Relatively safe. The web link requested by
the URL . Request Header Contains header information when requesting. Such as: User-Agent, Host, Cookies, etc.

User-Agent
specifies the browser.
Request body
In general, the request body does not contain important information in GET requests.
Important information is included in the POST request.
Content contained in the Response
Response status
Status Code: 200
is the status code, generally 200 means the response is successful.
Response header
Response Headers
content type, content length, server information, setting cookies, etc.
Response body
The content of the requested resource, such as web page source code, binary data, etc.
4.3 Detailed Design of Control Module

crawl code

import time
import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}

citys = ['beijing', 'shanghai', 'guangzhou', 'shenzhen']

for i in range(len(citys)):

    time.sleep(5)

    for j in range(1, 13):
        time.sleep(5)
        # 请求2018年各月份的数据页面
        url = 'http://www.tianqihoubao.com/aqi/' + citys[i] + '-2018' + str("%02d" % j) + '.html'
        # 有请求头（键值对形式表示请求头）
        response = requests.get(url=url, headers=headers)
        # html字符串创建BeautifulSoup对象
        soup = BeautifulSoup(response.text, 'html.parser')
        tr = soup.find_all('tr')

        for k in tr[1:]:
            td = k.find_all('td')
            # 日期
            Date = td[0].get_text().strip()
            # 质量等级
            Quality_grade = td[1].get_text().strip()
            # AQI指数
            AQI = td[2].get_text().strip()
            # 当天AQI排名
            AQI_rank = td[3].get_text().strip()
            # PM2.5
            PM = td[4].get_text()
            # 数据存储
            filename = 'air_' + citys[i] + '_2018.csv'
            with open(filename, 'a+', encoding='utf-8-sig') as f:
                f.write(Date + ',' + Quality_grade + ',' + AQI + ',' + AQI_rank + ',' + PM + '\n')
分析代码
import numpy as np
import pandas as pd
from pyecharts import Line

citys = ['beijing', 'shanghai', 'guangzhou', 'shenzhen']
v = []
for i in range(4):
    filename = 'air_' + citys[i] + '_2018.csv'
    df = pd.read_csv(filename, header=None, names=["Date", "Quality_grade", "AQI", "AQI_rank", "PM"])

    dom = df[['Date', 'AQI']]
    list1 = []
    for j in dom['Date']:
        time = j.split('-')[1]
        list1.append(time)
    df['month'] = list1

    month_message = df.groupby(['month'])
    month_com = month_message['AQI'].agg(['mean'])
    month_com.reset_index(inplace=True)
    month_com_last = month_com.sort_index()

    v1 = np.array(month_com_last['mean'])
    v1 = ["{}".format(int(i)) for i in v1]
    v.append(v1)

attr = ["{}".format(str(i) + '月') for i in range(1, 12)]

line = Line("2018年北上广深AQI全年走势图", title_pos='center', title_top='0', width=800, height=400)
line.add("北京", attr, v[0], line_color='red', legend_top='8%')
line.add("上海", attr, v[1], line_color='purple', legend_top='8%')
line.add("广州", attr, v[2], line_color='blue', legend_top='8%')
line.add("深圳", attr, v[3], line_color='orange', legend_top='8%')
line.render("2018年北上广深AQI全年走势图.html")

import numpy as np
import pandas as pd
from pyecharts import Pie, Grid

citys = ['beijing', 'shanghai', 'guangzhou', 'shenzhen']
v = []
attrs = []
for i in range(4):
    filename = 'air_' + citys[i] + '_2018.csv'
    df = pd.read_csv(filename, header=None, names=["Date", "Quality_grade", "AQI", "AQI_rank", "PM"])

    rank_message = df.groupby(['Quality_grade'])
    rank_com = rank_message['Quality_grade'].agg(['count'])
    rank_com.reset_index(inplace=True)
    rank_com_last = rank_com.sort_values('count', ascending=False)

    attr = rank_com_last['Quality_grade']
    attr = np.array(rank_com_last['Quality_grade'])
    attrs.append(attr)
    v1 = rank_com_last['count']
    v1 = np.array(rank_com_last['count'])
    v.append(v1)

pie1 = Pie("北京", title_pos="28%", title_top="24%")
pie1.add("", attrs[0], v[0], radius=[25, 40], center=[30, 27], legend_pos="27%", legend_top="51%", legend_orient="horizontal",)

pie2 = Pie("上海", title_pos="58%", title_top="24%")
pie2.add("", attrs[1], v[1], radius=[25, 40], center=[60, 27], is_label_show=False, is_legend_show=False)

pie3 = Pie("广州", title_pos='28%', title_top='77%')
pie3.add("", attrs[2], v[2], radius=[25, 40], center=[30, 80], is_label_show=False, is_legend_show=False)

pie4 = Pie("深圳", title_pos='58%', title_top='77%')
pie4.add("", attrs[3], v[3], radius=[25, 40], center=[60, 80], is_label_show=False, is_legend_show=False)

grid = Grid("2018年北上广深全年空气质量情况", width=1200)
grid.add(pie1)
grid.add(pie2)
grid.add(pie3)
grid.add(pie4)
grid.render('2018年北上广深全年空气质量情况.html')

insert image description here

Visual analysis of weather data crawled by web crawler based on Python

Guess you like