Python 爬虫02 urllib模块

其他 2018-06-19 14:01:41 阅读次数: 2

urllib包含模块

urllib.request: 打开和读取URL
urllib.error: 包含 urllib.request 产生的错误，使用 try 捕捉
urllib.parse: 包含解析 URL 的方法
urllib.robotparse: 解析 robots.txt 文件

案例 v1

from urllib import request
# 使用 urllib.request 请求一个网页内容，并打印出来
urls = "https://blog.csdn.net/xidianliutingting/article/details/53580569"
# 打开相应 URL 并把相应页面作为返回
rsp = request.urlopen(urls)
# 把返回结果读取出来
# 读取出来的内容为 bytes
html = rsp.read()
# 如果想把 bytes 内容转换成字符串，需要转码
htm = html.decode()
print(htm)

网页编码问题解决方案
1. chardet 可以自动检测页面文件的编码格式，但可能有误
2. 第三方包需要自行安装 pip install chardet
3. 如果使用 anaconda 需要使用 conda install chardet

案例 v2

"""
利用 request 下载页面
自动检测页面编码
"""
import chardet
from urllib import request
urls = "http://stock.eastmoney.com/news/1407,20180616889634253.html"
rsp = request.urlopen(urls)
html = rsp.read()
# 利用 chardet 自动检测
cs = chardet.detect(html)
htm = html.decode(cs.get("encoding", "UTF-8"))
print(htm)

猜你喜欢

转载自blog.csdn.net/qq_15902869/article/details/80721332

Python 爬虫02 urllib模块

Python爬虫-urllib模块

python 爬虫 urllib模块介绍

Python爬虫1-----urllib模块

Python3爬虫实战（urllib模块）

python3.4爬虫——urllib等模块的引用

Python 爬虫 urllib模块：get方式

Python 爬虫 urllib模块：post方式

python爬虫 urllib模块url编码处理

python爬虫基础02-urllib库

Python爬虫学习：Python内置的爬虫模块urllib库

Python之urllib爬虫-request模块和parse模块详解

【python】——爬虫02 requests 模块

python爬虫 urllib模块反爬虫机制UA详解

爬虫--Python常用模块之requests,urllib和re

Python爬虫入门：使用urllib模块获取请求页面信息

Python-爬虫03：urllib.request模块的使用

python3使用urllib模块制作网络爬虫

Python爬虫进阶——urllib模块使用案例【淘宝】

python爬虫1--urllib请求库之request模块

python爬虫4--urllib请求库之robotparser模块

python爬虫3--urllib请求库之parse模块

python爬虫2--urllib请求库之error模块

python爬虫 urllib模块发起post请求过程解析

python爬虫 urllib模块url编码处理详解

python 爬虫之urllib 库的相关模块的介绍以及应用

Python 爬虫之urllib库，及urllib库的4个模块基本使用和了解

爬虫基础——urllib模块

爬虫二（urllib模块）

爬虫--urllib模块

今日推荐

《美国对全球网络空间安全与发展的威胁和破坏》报告发布

火速冲上 GitHub 热榜 —— 开源编程语言、框架哪有这么可爱？

北京人形机器人创新中心发布全球首个纯电驱拟人奔跑的全尺寸人形机器人“天工”

LFOSSA 源来如此公开课 | 掌握云原生未来：CNCF 认证全面攻略与备考秘籍

周排行

循环神经网络（rnn）讲解

Tigao教程四：单独的关节运动

金蝶K3WISE15.0-注册套打教程

如何在Mac上配置Kubernetes

Android应用结束自身进程的方法

SpringMVC学习十三拦截器栈

中国驻洛杉矶总领馆举行新春招待会

HttpClient get post 发送

11 - three.js 笔记 - 绘制三维字体模型

Mysql递归获取某个父节点下面的所有子节点和子节点上的所有父节点

每日归档

更多

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)

2024-04-23(26)

2024-04-22(39)