Foreword
Using python for web crawling data is now very common, and weather data for crawling is the entry-level novice operation, many people are beginning to learn reptiles from the weather, this is introduced from China crawling Weather Weather Network data can be input to achieve the city looking for, return to the city next week's weather conditions, saved as a csv file, and display graphical data analysis. And finally with the complete code.
1, using the module
Python3. Mainly used to csv, sys, urllib.request and BeautifulSoup4 module, wherein the processing module is designed to csv csv file, the urllib.request http request can be configured, BeautifulSoup4 page information can be resolved. Before using these modules, if there is no need to be installed, can be used to open cmd pip installation. Of course, also you need to file a city name and city code corresponding to enter the city for us to find the corresponding code corresponding weather information extraction. Click here to file content cityinfo, you can view tidied city code, copy the contents of the page is saved as .py file, and then import it into the same path.
2, the city code extracted from the file to the appropriate city code based on the input city
cityname = input ( "Please enter a city you want to check the weather:")
if cityname in cityinfo.city:
citycode = cityinfo.city[cityname]
else:
sys.exit()
3, making the first request, the contents of the request is answered, that page information
url = 'http://www.weather.com.cn/weather/' + citycode + '.shtml'
header = ( "User-Agent", "Mozilla / 5.0 (Windows NT 10.0; Win64; x64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 76.0.3809.132 Safari / 537.36") # header information setting
http_handler = urllib.request.HTTPHandler()
opener = urllib.request.build_opener (http_handler) # modify the header information
opener.addheaders = [header]
request = urllib.request.Request (url) # production request
response = opener.open (request) # get response packet
html = response.read () # read response packet
html = html.decode ( 'utf-8') # set the encoding, or garbled
Wherein a header information header is to prevent certain Web sites set up anti reptiles, in chrome browser, press f12 header information header can then click on the network in the browser, a request to find a stream, the stream request can click to see corresponding to the header information.
4, filter data based on the returned page
final = [] # initializes a list of saved data
bs = BeautifulSoup (html, "html.parser") # create objects BeautifulSoup
body = bs.body # body part of the data acquisition
data = body.find('div', {'id': '7d'})
ul = data.find('ul')
li = ul.find_all('li')
# All tags are carried out to obtain the contents of the page based on the location of the screening, as shown in the next seven days weather we're looking for are included in the div id tag 7d, the seven-day weather ul again in this div , the only one div ul, so you can use the find method, every day the weather was in the ul li's, and there are multiple li, you must use find_all () method to find all the li, can not use the find method .
5, data crawling
i = 0 # Days control crawling
lows = [] # cryopreserved
highs = [] # storage temperature
for day in li: # find a convenient per li
if i < 7:
temp = []
date = day.find ( 'h1'). string # Date obtained
temp.append(date)
inf = day.find_all ( 'p') # acquires weather, following traversal li p p requires the use of a plurality of labels instead find find_all
temp.append(inf[0].string)
temlow = inf [1] .find ( 'i'). string # minimum temperature
if inf [1] .find ( 'span') is None: # weather forecast maximum temperatures sometimes may not need to be a judge
temhigh = None
temperate = temlow
else:
temhigh = inf [1] .find ( 'span'). string # highest temperature
temhigh = temhigh.replace('℃', '')
temperate = temhigh + '/' + temlow
temp.append(temperate)
final.append(temp)
i = i + 1
Here li is obtained from each of the daily weather conditions, the control in 7 days, the corresponding position of each data extracted by the label li below, to be noted that the number of extracted tag, if there are a plurality of the same in the current tag extraction labels, to use find_all () instead find, then [n] corresponding data extraction
When the extraction temperature should pay attention to a problem, China Weather Network General will display maximum temperature and minimum temperature, but sometimes only show a temperature no maximum temperature, then you want to be a judge, or the script will go wrong. The weather then spliced into a string, and other data together into the final list
6, write csv file
with open('weather.csv', 'a', errors='ignore', newline='') as f:
f_csv = csv.writer(f)
f_csv.writerows([cityname])
f_csv.writerows(final)
Finally, we see weather data stored in the csv file as shown below:
7, using pygal drawing, before using the module need to install pip install pygal, and then introduced import pygal
bar = pygal.Line () # Create a line chart
bar.add ( 'low temperature', lows) # add sequence data of two lines
bar.add ( 'maximum temperature', highs) # Note lows and highs of type int list
bar.x_labels = daytimes
bar.x_labels_major = daytimes[::30]
bar.x_label_rotation = 45
bar.title = cityname + 'next seven days the temperature trend chart' # Set the graph title
bar.x_title = 'date' #x axis title
bar.y_title = 'Temperature (degrees Celsius)' # y axis title
bar.legend_at_bottom = True
bar.show_x_guides = False
bar.show_y_guides = True
bar.render_to_file ( 'temperate1.svg') # Save the image as SVG files can be viewed through a browser
, The final visual display generated weather pattern below:
8, complete code
import csv
import sys
import urllib.request
from bs4 import BeautifulSoup # page parsing module
import pygal
import cityinfo
cityname = input ( "Please enter a city you want to check the weather:")
if cityname in cityinfo.city:
citycode = cityinfo.city[cityname]
else:
sys.exit()
url = 'http://www.weather.com.cn/weather/' + citycode + '.shtml'
header = ( "User-Agent", "Mozilla / 5.0 (Windows NT 10.0; Win64; x64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 76.0.3809.132 Safari / 537.36") # header information setting
http_handler = urllib.request.HTTPHandler()
opener = urllib.request.build_opener (http_handler) # modify the header information
opener.addheaders = [header]
request = urllib.request.Request (url) # production request
response = opener.open (request) # get response packet
html = response.read () # read response packet
html = html.decode ( 'utf-8') # set the encoding, or garbled
# Initial screening filter according to information obtained by the page
final = [] # initializes a list of saved data
bs = BeautifulSoup (html, "html.parser") # create objects BeautifulSoup
body = bs.body
data = body.find('div', {'id': '7d'})
print(type(data))
ul = data.find('ul')
li = ul.find_all('li')
# Crawling data they need
i = 0 # Days control crawling
lows = [] # cryopreserved
highs = [] # storage temperature
daytimes = [] # Save the Date
weathers = [] # saved Weather
for day in li: # find a convenient per li
if i <7: Wuxi Women's Hospital Which is good http://www.ytsgfk120.com/
temp = [] # temporary storage of data per day
date = day.find ( 'h1'). string # Date obtained
#print(date)
temp.append(date)
daytimes.append(date)
inf = day.find_all ( 'p') # li following traversal p p requires the use of a plurality of labels instead find find_all
#print (inf [0] .string) # p extract a value of the first tag, i.e., weather
temp.append(inf[0].string)
weathers.append(inf[0].string)
temlow = inf [1] .find ( 'i'). string # minimum temperature
if inf [1] .find ( 'span') is None: # weather forecast may not be the highest temperature
temhigh = None
temperate = temlow
else:
temhigh = inf [1] .find ( 'span'). string # highest temperature
temhigh = temhigh.replace('℃', '')
temperate = temhigh + '/' + temlow
# temp.append(temhigh)
# temp.append(temlow)
lowStr = ""
lowStr = lowStr.join(temlow.string)
lows.append (int (lowStr [: - 1])) # NavigableString than three lines turn into a low temperature and stored in a low temperature type int List
if temhigh is None:
highs.append(int(lowStr[:-1]))
else:
highStr = ""
highStr = highStr.join(temhigh)
highs.append (int (highStr)) # NavigableString temperature over three lines turn into a high temperature type int and stored in list
temp.append(temperate)
final.append(temp)
i = i + 1
# Weather will eventually acquired write csv file
with open('weather.csv', 'a', errors='ignore', newline='') as f:
f_csv = csv.writer(f)
f_csv.writerows([cityname])
f_csv.writerows(final)
Drawing #
bar = pygal.Line () # Create a line chart
bar.add ( 'minimum temperature', lows)
bar.add ( 'highest temperature', highs)
bar.x_labels = daytimes
bar.x_labels_major = daytimes[::30]
# Bar.show_minor_x_labels = False # does not show the smallest scale X-axis
bar.x_label_rotation = 45
bar.title = cityname + 'Temperature Trend FIG next seven days'
bar.x_title = 'date'
bar.y_title = 'Temperature (degrees Celsius)'
bar.legend_at_bottom = True
bar.show_x_guides = False
bar.show_y_guides = True
bar.render_to_file('temperate.svg')