Python3 之爬取网站页面

其他 2018-05-13 15:07:30 阅读次数: 0

Python3 抓取网页需要用到urllib.request模块

import urllib.request


def download(url, free_proxy=None, user_agent='test', num_retries=2, data=None):
    print("download开始", url)
    # 设置headers 中的用户代理，默认值是test
    headers = {"User_agent": user_agent}
    # 将用户代理添加到请求中
    request = urllib.request.Request(url, data, headers=headers)
    # 创建句柄
    opener = urllib.request.build_opener()
    # 判断如果proxy是否有值
    if free_proxy:
        # 获取ip代理协议和IP代理
        proxy_params = {urllib.request.urlparse(url).scheme: free_proxy}
        # 将IP代理设置添加到句柄中
        opener.add_handler(urllib.request.ProxyHandler(proxy_params))
    try:
        # 使用句柄的open()打开网页，read()读取内容
        html5 = opener.open(request).read()
    # 异常处理，捕获异常
    except urllib.request.URLError as e:
        # 打印异常原因
        print("download error", e.reason)
        html5 = None
        # 判断重新加载次数是否大于0
        if num_retries > 0:
            # 判断 页面code是否在500和600之间
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # 调用自身
                html5 = download5(url, free_proxy, user_agent, num_retries - 1)
    # 如果没有decode('utf-8')，数据会是b''格式
    return html5.decode('utf-8')
# 示例网址
url = 'http://www.thefaceshop.com.cn/store-locations'
# 打印抓下来的网页
print(download(url))

运行结果：

猜你喜欢

转载自blog.csdn.net/huangyanli0808/article/details/77683967

Python3 之爬取网站页面

Python编程学习之爬取网站页面的域名

Python3爬虫之五：爬取网站数据并写入excel

python3爬虫学习之urllib库实战爬取网站

python3 爬虫学习之爬取猫眼电影

python爬虫之爬取网站小说

python爬虫之爬取网站图片

Python3--爬取数据之911网站信息爬取

Python3爬取豆瓣网站奇幻小说信息

python3 urllib爬取wallhalla网站图片

practice之Python爬取Python官网页面

Python爬虫实战之爬取网站全部图片(一)

Python的scrapy之爬取boss直聘网站

python之梨视频网站视频爬取及下载

python之简单爬取一个网站信息

Python爬虫实战之爬取网站全部图片(二)

python 爬虫之爬取网站信息并保存到文件

python3爬取页面内容并筛选

python3爬虫系列12之lxml+xpath和BeautifulSoup+css selector不同方式tiobe网站爬取

Python爬虫之爬取动态页面数据

practice之Python爬取百度翻译页面

practice之Python爬取有道翻译页面

practice之Python爬取人人网页面

practice之Python爬取链家网页面（xpath）

【Python爬虫】之爬取页面内容、图片以及用selenium爬取

Python3爬虫（十三）爬取动态页之Selenium

python3 爬虫实战之爬取网易新闻APP端

Python3~scrapy项目之爬取当前页和下一页

Python3~scrapy项目之爬取当前页和详细页

python3 爬虫之爬取网易新闻APP端

今日推荐

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

周排行

Java自定义时间格式

同步整形电路

在开发中最最最常用的字符串的属性大集合

Linux 查看端口占用并杀掉

Java基础四：ArrayList

多线程之死锁就是这么简单

mysql 基础命令集

awk 命令详解

Centos6.3编译安装nginx+php步骤

OCR （Optical Character Recognition，光学字符识别）

每日归档

更多

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)