python爬虫学习笔记_初识网络爬虫_1

获取一个网页的HTML代码

python所实现的：

from urllib.request import urlopen
html = urlopen("http://www.pythonscraping.com/pages/page1.html")

print(html.read())

urllib库

urllib是python的标准库，包含了从网络请求数据，处理cookie，甚至改变了像请求头和用户代理这些元数据的函数。

BeautifulSoup库

它通过定位HTML的标签来格式化和组织复杂的网络信息，用简单易用的Python对象来展现XML结构信息

运行beautifulsoup库：

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page1.html")

bs0bj = BeautifulSoup(html.read())

print(bs0bj.h1)

可靠的网络连接（相关异常的处理）

html = urlopen("http://www.pythonscraping.com/pages/page1.html")

这行代码可能出现的异常：

网页在服务器上不存在（或者获取页面时出现错误）
服务器不存在

第一种异常发生时：urlopen会返回一个HTTP的错误：404网页未找到或者500等等

异常处理：

from urllib.request import urlopen

from urllib.error import HTTPError

from bs4 import BeautifulSoup

try:

    html = urlopen("http://www.pythonscraping.com/pages/page1.html")

except HTTPError as e:
    #返回空值，中断程序，或者执行另一个方案

    print(e)

else:

    #程序继续，如果在except里捕捉的代码里返回或者中断了，else将不会执行

    bs0bj = BeautifulSoup(html.read())

    print(bs0bj)

第二种情况发生时：服务器不存在（可能是链接打不开或者URL写错了）

urlopen会返回一个None的对象，通过增加一个判断语句来检测None对象是否发生了错误

if html is None:

    print("URL is not found")

else:

    #程序继续

如果以上两种情况都没有发生，网页已经从服务器上成功地获取，如果网页上的内容和我们的期望有差距，仍可能会出现异常。我们可以增加一个检查条件来保证标签确实存在，如果标签不存在，那么BeautifulSoup会返回None对象，倘若进一步地调用这个None对象下面的子标签，会抛出AttributeError错误。

如果运行：

print(bs0bj.nonExistentTag)

再次运行：

print(bs0bj.nonExistentTag.someTag)

则会抛出：

AttributeError： 'NoneType' object has no attribute 'someTag'

想要避免这种情况就要对这两种情形进行检查：

V1：

try:

    badContent = bs0bj.nonExistingTag.anotherTag

    except AttributeError as e:

        print("Tag was not found")

    else:

        if badContent == None:

            print("Tag was not found")

        else:

            print(badContent)

V2：

from urllib.request import urlopen

from urllib.error import HTTPError,URLError

from bs4 import BeautifulSoup

def getTitle(url):

    try:

        html = urlopen(url)

    except (HTTPError,URLError):

        return None

    try:

        bs0bj = BeautifulSoup(html.read())

        title = bs0bj.body.h1

    except AttributeError as e:

        return None

    return title

title = getTitle("http://www.pythonscraping.com/pages/page1.html")

if title == None:

    print("Title could not be found")

else:

    print(title)

在写爬虫的时候，思考代码的总体格局，让代码既可以捕捉异常又具有良好的可读性。

如果希望能够很大程度地重用代码，那么拥有将getSiteHTML和getTitle等这样的通用函数（它们具有周密的异常处理功能）会让快速稳定地网络数据采集变得更简单易行。

python爬虫学习笔记_初识网络爬虫_1

猜你喜欢