1.2 常见异常的处理方法

在import后面的第一行代码：

html = urlopen("http://www.baidu.com")

常见的异常主要有以下两种：

网页在服务器上不存在（或者获取页面的时候出现错误）
服务器不存在

第一种错误时，程序会返回HTTP错误。HTTP错误可能是“404 Page Not Found”，“500 Internal Server Error”等等。所有类似的情形，对于urlopen函数来说，都会抛出“HTTPError”异常。我们可以用以下方式处理这种异常：

try:
    html = urlopen("http://www.baidu.com")
except HTTPError as e:
    print(e)
#如果返回空值，中断程序，或者执行另一个方案

else:
#程序继续。注意：如果你已经在上面异常捕捉那一段代码里返回或者中断（break）,
# 那么就不需要使用else语句了，这段代码就不会再执行了。

如果程序服务器不存在（链接写错了，或者链接打不开），urlopen会返回一个None对象。这个对象与其他编程语言中的null相似。我们可以用一个判断语句来检测返回的html是不是None：

if html is None:
    print("URL is not found.")
else:
    # 程序继续

在服务器成功获取网页后，如果网页上的内容不是我们预期的那样的，仍然可能会出现异常。因此，每次在调用BeautifulSoup时，最好加一个检查条件，以保证标签确实存在。假如调用的标签不存在，BS就会返回一个None对象。不过，如果再调用这个None下的子标签时就会返回AttributeERROR错误。

    print ("bs0bj.nonExistentTag")

会返回一个None对象。处理和检查这个对象十分必要。如果直接调用这个对象的子标签，就会出现异常：

    print("bs0jb.nonExistentTag.someTag")
    # 下面会返回一个异常
    AttributeError: 'NoneType' object has no attribute 'someTag'

下面的代码是同时检测两种异常的简单代码方法：

try:
    badContent = bs0bj.nonExistingTag.anotherTag
except AttributeError as e:
    print("Tag is not found.")
else:
    if badContent ==None:
        print("Tag was not found.")
    else:
        print(badContent)

如下是一个实例书写：

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def getTittle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bs0bj = BeautifulSoup(html.read())
        title = bs0bj.body.h1
    except AttributeError as e:
        return None
    return title
title = getTittle("http://www.pythonscraping.com/pages/page1.html")
if title ==None:
    print("Title could not be found.")
else:
    print(title)

1.2 常见异常的处理方法

猜你喜欢