In many machine learning applications, weather data is an important auxiliary feature data, so this article mainly introduces how to use Python to obtain historical weather data.
target site
The target website for data crawling is the Weather Network
programming
Import related packages
import requests # 导入requests
from bs4 import BeautifulSoup # 导入bs4中的BeautifulSoup
import os
import re
import csv
import pandas as pd
import numpy as np
import time
import json
The following is an example of crawling historical weather data in Beijing:
Get all month URLs
Analysis of the source code of the web page shows that the URLs of all months are in the div of 'tqtongji1'.
The implementation code is as follows:
def get_url(request_url):
html = requests.get(request_url).text
Soup = BeautifulSoup(html, 'lxml') # 解析文档
all_li = Soup.find('div', class_='tqtongji1').find_all('li')
url_list = []
for li in all_li:
url_list.append([li.get_text(), li.find('a')['href']])
return url_list
Get historical weather data for a month
After obtaining the URL of the month, analyzing the page source code of the month shows that the historical weather data is in the div of 'tqtongji2'.
The source code is as follows:
def get_month_weather(request_url, year_number, month_number):
# month_url = 'http://lishi.tianqi.com/beijing/201712.html'
url_list = get_url(request_url)
for i in range(len(url_list)-1, -1, -1):
year_split = int(url_list[i][0].encode('utf-8')[:4])
month_split = int(url_list[i][0].encode('utf-8')[7:9])
if year_split == year_number and month_split == month_number:
month_url = url_list[i][1]
html = requests.get(month_url).text
Soup = BeautifulSoup(html, 'lxml') # 解析文档
all_ul = Soup.find('div', class_='tqtongji2').find_all('ul')
month_weather = []
for i in range(1, len(all_ul)):
ul = all_ul[i]
li_list = []
for li in ul.find_all('li'):
li_list.append(li.get_text().encode('utf-8'))
month_weather.append(li_list)
return month_weather
Get historical weather data for a year
The annual historical weather data can be obtained by summarizing the data of each month.
The source code is as follows:
def get_year_weather(request_url, year_number):
year_weather = []
for i in range(12):
month_weather = get_month_weather(request_url, year_number, i+1)
year_weather.extend(month_weather)
print '第%d月天气数据采集完成,望您知悉!'%(i+1)
col_name = ['Date', 'Max_Tem', 'Min_Tem', 'Weather', 'Wind', 'Wind_Level']
result_df = pd.DataFrame(year_weather)
result_df.columns = col_name
# result_df.to_csv('year_weather.csv')
return result_df
Execute 'result_df = get_year_weather(request_url, 2017)', the result is as follows:
For detailed code and instructions, please click on my GitHub
commercial time
Personal blog: http://ruanshubin.top
GitHub: https://github.com/Ruanshubin/
You are welcome to scan the QR code above and follow my WeChat public account!