python爬虫读书笔记(1)

1.使用urllib2模块下载URL

import urllib2
def download(url):
    return urllib2.urlopen(url).read()

2.捕获异常

出现下载错误时,该函数能够捕获异常,然后返回None。

import urllib2
def download(url):
    print 'Downloading:',url
    try:
        html=urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Downloading error',e.reason
        html=None
    return html

3.重试下载

4xx错误发生在请求存在问题时,而5xx错误则发生在服务端存在问题时。 所以, 我们只需要确保download 函数在 发生5xx 错误时重试下载即可。下面是支持重试下载功能的新版本代码。
 

def  download(url,num_retries=2):
    print('Downloading',url)
    try:
        html=urllib.urlopen(url).read()
    except urllib2.URLError as e:
        print('Downloading error',e.reason)
        html=None
        if num_retries>0:
            if hasattr(e,'code') and 500<=e.code<600:
                return download(url,num_retries-1)
    return html

4.设置用户代理

设定一个默认的用户代理“wswp”

import urllib2
def download(url,user_agent='wswp',num_retries=2):
    print('Downloading:',url)
    headers={'User-agent':user_agent}
    request=urllib2.Request(url,headers=headers)
    try:
        html=urllib.urlopen(request).read()
    except urllib2.URLError as e:
        print('Downloading error',e.reason)
        html=None
        if num_retries>0:
            if hasattr(e,'code') and 500<=e.code<600:
                return download(url,num_retries-1)
    return html

猜你喜欢

转载自blog.csdn.net/FSexperience/article/details/83859836