python爬取网页天气预报

python版本:3.5
用到的模块：
urllib
xpinyin
bs4.BeautifulSoup

目标：输入城市的中文名，得到该城市的天气
首先，选择一个天气预报的网站：http://www.tianqi.com
该网站可以直接加城市拼音后缀，得到该城市的天气，例如北京：http://www.tianqi.com/beijing.html

用xpinyin模块得到城市中文名的拼音

pin = xpinyin.Pinyin()
city_pinyin = pin.get_pinyin("北京","")

xpinyin中的get_pinyin方法，第一个入参是中文，第二个入参是分隔符，默认是’-’

用urllib请求网页信息

page = urllib.request.urlopen(url)

如果用 urllib.request.urlopen 方式打开一个URL,服务器端只会收到一个单纯的对于该页面访问的请求,但是服务器并不知道发送这个请求使用的浏览器,操作系统,硬件平台等信息,而缺失这些信息的请求往往都是非正常的访问,例如爬虫。
有些网站为了防止这种非正常的访问,会验证请求信息中的UserAgent(它的信息包括硬件平台、系统软件、应用软件和用户个人偏好),如果UserAgent存在异常或者是不存在,那么这次请求将会被拒绝(如上错误信息所示)

所以可以尝试在请求中加入UserAgent的信息

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url=chaper_url, headers=headers)
urllib.request.urlopen(req).read()

读取网页并使用BeautifulSoup解析

html = page.read()
soup = BeautifulSoup(html.decode("utf-8"),"html.parser")  #html.parser表示解析使用的解析器

源码

import urllib.request
import xpinyin
from bs4 import BeautifulSoup

def get_weather(city_pinyin):
    header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
    website = "http://www.tianqi.com/" + city_pinyin + ".html"
    req = urllib.request.Request(url=website,headers=header)
    page = urllib.request.urlopen(req)
    html = page.read()
    soup = BeautifulSoup(html.decode("utf-8"),"html.parser")  #html.parser表示解析使用的解析器
    nodes = soup.find_all('dd')
    tody_weather = ""
    for node in nodes:
        temp = node.get_text()
        if (temp.find('[切换城市]')):
            temp = temp[:temp.find('[切换城市]')]
        tody_weather += temp
    return tody_weather

if __name__ == "__main__":
    pin = xpinyin.Pinyin()
    city_pinyin = pin.get_pinyin("杭州","")
    tody_weather = get_weather(city_pinyin)
    print(tody_weather)

只需要修改城市名字即可得到当天的天气信息

结果如下

杭州2019年03月26日　星期二　己亥年二月二十 22℃ 多云9 ~ 22℃湿度：32%风向：东南风 2级紫外线：最空气质量：优PM: 33日出: 05:56日落: 18:1