Based on the bulk of python web crawler

       In each site, the more distant weather information to pay for basic needs, so in order to spend less effort to obtain complete information, we often have a website crawler, this article is my first reptile experience, because it is the first time reptile, python longer running time, for any errors, please chiefs said.

       Crawling website https://en.tutiempo.net/climate/ws-567780.html Kunming average monthly weather information. Kunming in July 1942, for example, the observation site https://en.tutiempo.net/cli Mate / 07 - 1942 / WS-5 67780.html, can be found in green for the month, the blue represent the year, we need to climb information taken in 1942 to 2019 monthly information. I.e. https://en.tutiempo.net/climate/01-1942/ws-567780.html on each page of FIG. 1 https://en.tutiempo.net/climate/12-2019/ws-567780.html information red box.

figure 1

       F12 observation page structure shown in Figure 2, find the code corresponding to the red box (html white can place the mouse over the code, is the page that appears baskets module configuration of the code).

figure 2

       Found red frame corresponding to the page code shown in Figure 3:

image 3

       Therefore construction python character matches the code:

'<td class="tc2">(.*)</td><td class="tc3">(.*)</td><td class="tc4">(.*)</td><td class="tc5">(.*)</td><td class="tc6">(.*)</td><td class="tc7">(.*)</td><td class="tc8">(.*)</td><td class="tc9">(.*)</td><td class="tc10">(.*)</td><td>&nbsp;</td><td>(.*)</td><td>(.*)</td><td>(.*)</td><td>(.*)</td>'

       构造出的整体python代码如下:

import requests
import re
from xlwt import *

book = Workbook(encoding='utf-8')
sheet = book.add_sheet('Sheet1') #创建一个sheet
for j in range(78):
# 一共78年
for k in range(12):
# 一共12个月
print(j,k)
try:
# 匹配字符串
word2 = '<td class="tc2">(.*)</td><td class="tc3">(.*)</td><td class="tc4">(.*)</td><td class="tc5">(.*)</td><td class="tc6">(.*)</td><td class="tc7">(.*)</td><td class="tc8">(.*)</td><td class="tc9">(.*)</td><td class="tc10">(.*)</td><td>&nbsp;</td><td>(.*)</td><td>(.*)</td><td>(.*)</td><td>(.*)</td>'
# 在1到9月前面加个0
if(k<9):
url = "https://en.tutiempo.net/climate/0{}-{}/ws-567780.html".format(k + 1, j + 1942)
else:
url = "https://en.tutiempo.net/climate/{}-{}/ws-567780.html".format(k + 1, j + 1942)
f = requests.get(url) # Get该网页从而获取该html内容
str = f.content.decode()
# 返回查找到的数据
wordlist2 = re.findall(re.compile(word2), str)
for i in range(13):
# 将数据存入book中
print(wordlist2[0][i])
a = j*12+k
sheet.write(a, i, label=wordlist2[0][i])
except:
print()
# 将book保存到表格里
book.save("weather.xls")

       运行后得到的excel表格见图5,经过ctrl+F进行字符替换和excel表的数据-分列-完成操作后,得到表格见图6,进行一些修饰,见图7表格。

 

图5

 

图6

 

图7

        最后,本篇文章乃作者原创,禁止将本篇文章内容用于商业用途,若需转载请标明出处。

Guess you like

Origin www.cnblogs.com/nzsll/p/10959261.html