python爬虫读书笔记（1） - 代码天地

python爬虫读书笔记（1）

其他 2018-11-09 18:59:55 阅读次数: 0

1.使用urllib2模块下载URL

import urllib2
def download(url):
    return urllib2.urlopen(url).read()

2.捕获异常

出现下载错误时，该函数能够捕获异常，然后返回None。

import urllib2
def download(url):
    print 'Downloading:',url
    try:
        html=urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Downloading error',e.reason
        html=None
    return html

3.重试下载

4xx错误发生在请求存在问题时，而5xx错误则发生在服务端存在问题时。所以，我们只需要确保download 函数在发生5xx 错误时重试下载即可。下面是支持重试下载功能的新版本代码。

def  download(url,num_retries=2):
    print('Downloading',url)
    try:
        html=urllib.urlopen(url).read()
    except urllib2.URLError as e:
        print('Downloading error',e.reason)
        html=None
        if num_retries>0:
            if hasattr(e,'code') and 500<=e.code<600:
                return download(url,num_retries-1)
    return html

4.设置用户代理

设定一个默认的用户代理“wswp”

import urllib2
def download(url,user_agent='wswp',num_retries=2):
    print('Downloading:',url)
    headers={'User-agent':user_agent}
    request=urllib2.Request(url,headers=headers)
    try:
        html=urllib.urlopen(request).read()
    except urllib2.URLError as e:
        print('Downloading error',e.reason)
        html=None
        if num_retries>0:
            if hasattr(e,'code') and 500<=e.code<600:
                return download(url,num_retries-1)
    return html

猜你喜欢

转载自blog.csdn.net/FSexperience/article/details/83859836

python爬虫读书笔记（1）

《精通Python网络爬虫》读书笔记—— Urllib库(1)

python爬虫读书笔记（2）

读书笔记《Python高级编程》(1)

《流畅的python》读书笔记（1）

Python爬虫开发与项目实战读书笔记__Part1

读书笔记1

读书笔记（1）

《精通Python网络爬虫》读书笔记—— Urllib库(2)

《用Python写网络爬虫》读书笔记

python读书笔记

python 中的魔术方法，fluent python读书笔记1

HBase读书笔记1

linux读书笔记1

读书笔记-DAY 1

算法读书笔记-1

数据结构 python语言描述读书笔记1

python自然语言处理 -读书笔记1

流畅的python-读书笔记unit1

Python 小技巧——<Cookbook>读书笔记（1）解压赋值

Effective Python 读书笔记——第 1 条

Python读书笔记-进阶篇-1.Random

读书笔记_python网络编程3_(1)

《流畅的Python》读书笔记

python文件读书笔记

读书笔记——python的多态

流畅的python 读书笔记

Python爬虫开发与项目实战读书笔记__Part2

Python3网络爬虫抓取APP的开发实践读书笔记！

《软技能》读书笔记1

今日推荐

NetBSD 禁止提交由 AI 生成的代码

Apache Doris 2.0.10 版本正式发布！

开源日报 | 大模型开战；大模型独角兽被曝卖身；周鸿祎建议谷歌开源所有产品；最大开源AI社区提供1000万美元共享GPU

开源日报 | Chrome内置Gemini的意义不在于Gemini；中国AI追随之路的五大误区；ECharts创始人“下海”养鱼；谷歌I/O开发者大会什么都有，只是没有惊喜

微软回应中国区AI团队“打包赴美”传闻

基于大语言模型的开源知识库问答系统 MaxKB GitHub Star 数量突破 5,000 个！

周排行

static方法和非static方法的区别（java）

如何查找计算机专业paper

java.lang.ClassFormatError: Incompatible magic value 0 in class file com/sitecha

跳跃游戏II

stm32_之【建立工程】

TeaWeb v0.0.9 发布，统计底层优化、主机监控功能改进

事件分发 -----控制字体大小

JavaScript DOM练习（动态表格添加） December 25，2019

JSF Scope & CDI

实现从零搭建一个登录注册页面（附源代码）

每日归档

更多

2024-05-19(0)

2024-05-18(4)

2024-05-17(34)

2024-05-16(6)

2024-05-15(24)

2024-05-14(0)

2024-05-13(18)

2024-05-12(0)

2024-05-11(38)

2024-05-10(38)