Python web crawler--historical weather data collection

In many machine learning applications, weather data is an important auxiliary feature data, so this article mainly introduces how to use Python to obtain historical weather data.

target site

The target website for data crawling is the Weather Network

write picture description here

programming

Import related packages

import requests  # 导入requests
from bs4 import BeautifulSoup  # 导入bs4中的BeautifulSoup
import os
import re
import csv
import pandas as pd
import numpy as np
import time
import json

The following is an example of crawling historical weather data in Beijing:

Get all month URLs

Analysis of the source code of the web page shows that the URLs of all months are in the div of 'tqtongji1'.

write picture description here

The implementation code is as follows:

def get_url(request_url):
    html = requests.get(request_url).text
    Soup = BeautifulSoup(html, 'lxml') # 解析文档
    all_li = Soup.find('div', class_='tqtongji1').find_all('li')
    url_list = []
    for li in all_li:
        url_list.append([li.get_text(), li.find('a')['href']])       
    return url_list

Get historical weather data for a month

After obtaining the URL of the month, analyzing the page source code of the month shows that the historical weather data is in the div of 'tqtongji2'.

write picture description here

The source code is as follows:

def get_month_weather(request_url, year_number, month_number):
    # month_url = 'http://lishi.tianqi.com/beijing/201712.html'
    url_list = get_url(request_url)
    for i in range(len(url_list)-1, -1, -1):
        year_split = int(url_list[i][0].encode('utf-8')[:4])
        month_split = int(url_list[i][0].encode('utf-8')[7:9])
        if year_split == year_number and month_split == month_number:
            month_url = url_list[i][1]
    html = requests.get(month_url).text
    Soup = BeautifulSoup(html, 'lxml') # 解析文档
    all_ul = Soup.find('div', class_='tqtongji2').find_all('ul')
    month_weather = []
    for i in range(1, len(all_ul)):
        ul = all_ul[i]
        li_list = []
        for li in ul.find_all('li'):
            li_list.append(li.get_text().encode('utf-8'))
        month_weather.append(li_list)
    return month_weather

Get historical weather data for a year

The annual historical weather data can be obtained by summarizing the data of each month.

The source code is as follows:

def get_year_weather(request_url, year_number):
    year_weather = []
    for i in range(12):
        month_weather = get_month_weather(request_url, year_number, i+1)
        year_weather.extend(month_weather)
        print '第%d月天气数据采集完成,望您知悉!'%(i+1)
    col_name = ['Date', 'Max_Tem', 'Min_Tem', 'Weather', 'Wind', 'Wind_Level']
    result_df = pd.DataFrame(year_weather)
    result_df.columns = col_name
    # result_df.to_csv('year_weather.csv')
    return result_df

Execute 'result_df = get_year_weather(request_url, 2017)', the result is as follows:

write picture description here

For detailed code and instructions, please click on my GitHub

commercial time

Personal blog: http://ruanshubin.top
GitHub: https://github.com/Ruanshubin/

write picture description here

You are welcome to scan the QR code above and follow my WeChat public account!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325165408&siteId=291194637