第一个python爬虫示例——爬取天气信息

爬虫：

网络爬虫，也叫网络蜘蛛（Wed Spider）；根据网页地址（URL）爬取网页内容；网页地址是我们在浏览器中输入的网站链接；

浏览器作为客户端从服务端获取信息，然后将信息解析之后进行展示，就是我们熟悉的前端开发；

爬取天气信息：

1.新更新的PyCharm2018.2 Python版本3.5；

2.使用request3获取html文档内容；

3.使用beautifulsoup4解析我们关注的数据；

4.爬取了天气网站某城市的天气信息，输出到.csv文件中；

1.初始项目

在项目设置的Project Interpreter中进行python版本选取及包的安装：需要安装request3和beautifulsoup4；

2.爬取内容：

中国天气网——苏州天气未来7天的天气以及最高和最低温度；

3.包安装

过程中，因为pip版本不可用，需要进行模块版本更新；同时完成requests和beautifulsoup4的initial；

模块安装：

4.获取html：

结合源码及注释理解；

其中获取header的方法：打开控制台->查看网络->重新请求页面->找到第一个请求的请求

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

Connection: keep-alive

Accept-Encoding: gzip, deflate

Accept-Language: zh-cn

User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1 Safari/605.1.15

5.获取html中的所需要的数据字段：

参考：《beautifulsoup 文档》

我们需要找到id为7d的div中的ul下的li标签；

具体实现，参见编码内容；

6.写入csv文件：

将数据抓取出来后，我们将他们写入文件；

这是Mac下的预览，打开的话会因为excel的gbk编码格式而乱码；

我们增加了一部分代码将文件进行utf-8->gbk格式转换；这样 gbk编码的csv文件打开就不会乱码了；

具体实现，参见编码内容；

如果使用Numbers打开相应的文件，显示是这样的：

7.pcSuzhouWeather.py源码如下：

#!/usr/bin/python
# -*- coding:utf-8 -*-

import requests #用来抓取网页的html源码
import csv  #将数据写入csv文件中
import random   #取随机数
import time #时间相关操作
import socket   #仅用作 网络异常处理
import http.client  #仅用作 网络异常处理
# import urllib.request #urllib也可以用来抓取网页的html源码 但是没有requests方便
from bs4 import BeautifulSoup #用于代替正则式 取源码中相应标签中的内容
import codecs
import os
from datetime import date



def get_content(url, data = None):
    #设置headers是为了模拟浏览器访问 否则的话可能会被拒绝 可通过浏览器获取
    header = {
        'Accept': 'textx/ html, application / xhtml + xml, application / xml;q = 0.9, * / *;q = 0.8',
        'Connection': 'keep-alive',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language':'zh-cn',
        'User-Agent': 'Mozilla/5.0(Macintosh;IntelMacOSX10_13_4) AppleWebKit/605.1.15(KHTML,likeGecko) Version/11.1Safari/605.1.15'
    }
    #设置一个超时时间 取随机数 是为了防止网站被认定为爬虫
    timeout = random.choice(range(80,180))

    while True:
        try:
            req = requests.get(url, headers = header, timeout = timeout)
            req.encoding = 'utf-8'#utf-8编码 否则可能会乱码
            break
        except socket.timeout as e:
            print('3' + e)
            time.sleep(random.choice(range(8, 15)))
        except socket.error as e:
            print('4' + e)
            time.sleep(random.choice(range(20, 60)))
        except http.client.BadStatusLine as e:
            print('5' + e)
            time.sleep(random.choice(range(30, 80)))
        except http.client.IncompleteRead as e:
            print('6' + e)
            time.sleep(random.choice(range(5, 15)))

    return req.text

def get_data(html_text):
    final = []
    bs = BeautifulSoup(html_text, "html.parser")    #创建BeautifulSoup对象
    body = bs.body  #获取body
    data = body.find('div',{'id':'7d'}) #找到id为7d的div
    ul = data.find('ul')
    li = ul.find_all('li')  #获取所有的li

    for day in li:
        temp = []
        date = day.find('h1').string    #1日（今天）
        temp.append(date)
        inf = day.find_all('p')
        temp.append(inf[0].string)  #多云

        if inf[1].find('span') is None:
            temp_highest = None #可能没有最高气温
        else:
            temp_highest = inf[1].find('span').string
            temp_highest = temp_highest.replace('℃','') #傍晚会多一个字符显示 为统一 将其替换

        temp_lowest = inf[1].find('i').string

        temp.append(temp_highest)
        temp.append(temp_lowest)

        final.append(temp)

    return final


def write_data(data, name):
    file_name = name

    with open(file_name, 'a', errors='ignore', newline='') as f:
        f_csv = csv.writer(f)
        f_csv.writerow(['未来七天','天气','最高气温','最低气温'])
        f_csv.writerows(data)

# 转换文件存储编码 读和写
def ReadFile(filePath, encoding):
    with codecs.open(filePath, "r", encoding) as f:
        return f.read()

def WriteFile(filePath, u, encoding):
    with codecs.open(filePath, "w", encoding) as f:
        f.write(u)

# 转换文件存储编码
def GBK_2_UTF8(src, dst):
    content = ReadFile(src, encoding='gbk')
    WriteFile(dst, content, encoding='utf-8')
def UTF8_2_GBK(src, dst):
    content = ReadFile(src, encoding='utf-8')
    WriteFile(dst, content, encoding='gbk')

if __name__ == '__main__':
    datestr = date.today().strftime('%m-%d-%y')
    dir = os.getcwd()
    filepath = dir + "/weather{}.csv".format(datestr)
    filepath_gbk = dir + "/weather{}-{}.csv".format(datestr,'gbk')

    url = 'http://www.weather.com.cn/weather/101190401.shtml'
    html = get_content(url)
    res_data = get_data(html)
    write_data(res_data,filepath)

    src = filepath
    dst = filepath_gbk
    UTF8_2_GBK(src, dst)

参考链接：

《python网络爬虫入门（一）——第一个pyhton爬虫实例》

《beautifulsoup 文档》