Getting Started with Python Crawler: URLError Exception Handling

Hello everyone, this section mainly talks about URLError and HTTPError, and some handling of them.

1.URLError

First explain the possible causes of URLError:

<!--[if !supportLists]--> ·  <!--[endif]--> No connection to the network, that is, the machine cannot access the Internet

<!--[if !supportLists]--> ·  <!--[endif]-->Can not connect to a specific server

<!--[if !supportLists]--> ·  <!--[endif]--> Server does not exist

In the code, we need to surround and catch the corresponding exception with try-except statement. The following is an example, first feel its coquettish

Python

1

2

3

4

5

6

7

importurllib2 

 

requset = urllib2.Request('http://www.zhimaruanjian.com')

try:

    urllib2.urlopen(requset)

excepturllib2.URLError,e:  

    printe.reason 

We used the urlopen method to access a non-existing URL, and the results are as follows:

Python

1

[Errno11004]getaddrinfo failed  

It shows that the error code is 11004 and the cause of the error is getaddrinfo failed

2.HTTPError

HTTPError is a subclass of URLError. When you use the urlopen method to make a request, the server will correspond to a response object response, which contains a numerical "status code". For example, if the response is a "redirect", you need to locate another address to get the document, and urllib2 will handle this.

For other things that cannot be processed, urlopen will generate an HTTPError, corresponding to the corresponding status. The HTTP status code indicates the status of the response returned by the HTTP protocol. The status codes are summarized as follows:

100: Continue The client should continue sending the request. The client SHOULD continue sending the remainder of the request, or ignore this response if the request has already completed.

101: Switching Protocols After sending the blank line at the end of this response, the server will switch to those protocols defined in the Upgrade header. Similar measures should only be taken when it is more beneficial to switch to a new protocol.

102: Continue Processing Status code extended by WebDAV (RFC 2518) indicating that processing is to be continued.

200: The request is successful. Processing method: Get the content of the response and process it

201: The request completed, resulting in the creation of a new resource. The URI of the newly created resource is available in the entity of the response Handling: not encountered in the crawler

202: The request is accepted, but the processing has not yet been completed. Processing method: blocking waiting

204: The request has been fulfilled by the server, but no new information has been returned. If the client is a user agent , it does not need to update its own document view for this. Processing method: discard

300: This status code is not directly used by HTTP/1.0 applications, but is only used as the default interpretation for 3XX type responses. There are multiple requested resources available. Processing method: If it can be processed in the program, it will be further processed. If it cannot be processed in the program, it will be discarded. 301: The requested resource will be assigned a permanent URL, so that the resource can be accessed through the URL in the future. Processing method: Redirect to assigned URL

302: The requested resource is temporarily saved at a different URL Processing method: redirect to a temporary URL

304: The requested resource is not updated Processing method: discard

400: Illegal request Processing method: discard

401: Unauthorized Processing method: discard

403: Forbidden Processing method: discard

404: Not found Processing method: discard

500: Internal Server Error The server encountered an unexpected condition that prevented it from completing the processing of the request. Generally, this problem occurs when there is an error in the source code on the server side .

501: The server does not recognize the server does not support a function required by the current request. When the server does not recognize the requested method and cannot support its request for any resource.

502: Bad Gateway An invalid response was received from an upstream server when a server working as a gateway or proxy attempted to perform a request.

503: Service Error The server is currently unable to process the request due to temporary server maintenance or overload. This condition is temporary and will return after a period of time.

HTTPError实例产生后会有一个code属性,这就是是服务器发送的相关错误号。
因为urllib2可以为你处理重定向,也就是3开头的代号可以被处理,并且100-299范围的号码指示成功,所以你只能看到400-599的错误号码。

下面我们写一个例子来感受一下,捕获的异常是HTTPError,它会带有一个code属性,就是错误代号,另外我们又打印了reason属性,这是它的父类URLError的属性。

Python

1

2

3

4

5

6

7

8

import urllib2

 

req = urllib2.Request('http://www.zhimaruanjian.com')

try:

    urllib2.urlopen(req)

except urllib2.HTTPError, e:

    print e.code

    print e.reason

运行结果如下

Python

1

2

403

Forbidden

错误代号是403,错误原因是Forbidden,说明服务器禁止访问。

我们知道,HTTPError的父类是URLError,根据编程经验,父类的异常应当写到子类异常的后面,如果子类捕获不到,那么可以捕获父类的异常,所以上述的代码可以这么改写

Python

1

2

3

4

5

6

7

8

9

10

11

import urllib2

 

req = urllib2.Request('http://www.zhimaruanjian.com')

try:

    urllib2.urlopen(req)

except urllib2.HTTPError, e:

    print e.code

except urllib2.URLError, e:

    print e.reason

else:

    print "OK"

如果捕获到了HTTPError,则输出code,不会再处理URLError异常。如果发生的不是HTTPError,则会去捕获URLError异常,输出错误原因。

另外还可以加入 hasattr属性提前对属性进行判断,代码改写如下

Python

1

2

3

4

5

6

7

8

9

10

11

12

import urllib2

 

req = urllib2.Request('http://www.zhimaruanjian.com')

try:

    urllib2.urlopen(req)

except urllib2.URLError, e:

    if hasattr(e,"code"):

        print e.code

    if hasattr(e,"reason"):

        print e.reason

else:

    print "OK"

首先对异常的属性进行判断,以免出现属性输出报错的现象。

以上,就是对URLError和HTTPError的相关介绍,以及相应的错误处理办法,小伙伴们加油!

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326614676&siteId=291194637