基本库的使用

urllib

发送请求

urlopen
Request
Handler
OpenerDirector

验证
Cookie处理
代理设置

异常处理
解析

urlparse()
urlunparse()
urlsplit和urlunsplit
urljoin
urlencode()
parse_ql() 和 parse_qsl ()
quote和unquote
Robots协议

urllib

urllib是Python内置的HTTP请求模块，它包含

request：它是最基本的 HTTP 请求模块,可以用来模拟发送请求。就像在浏览器里输入网址
然后回车一样，只需要给库方法传入 URL 以及额外的参数，就可以模拟实现这个过程了。
error：异常处理模块，如果出现请求错误，我们可以捕获这些异常，然后进行重试或其他操
作以保证程序不会意外终止。
parse：一个工具模块,提供了许多 URL 处理方法,比如拆分、解析、合并等。
robotparser：主要是用来识别网站的 robots.txt 文件,然后判断哪些网站可以爬,哪些网站不
可以爬,它其实用得比较少。

发送请求

使用 urllib 的 request 模块，我们可以方便地实现请求的发送并得到响应。

urlopen

urllib.request 模块提供了最基本的构造 HTTP 请求的方法, 利用它可以模拟浏览器的一个请求发起过程, 同时它还带有处理授权验证( authenticaton )、重定向( redirection) 、浏览器 Cookies 以及其他内容。

import urllib.request

response = urllib.request.urlopen('https://www.baidu.com')
print(type(response))
print(response.read().decode('utf-8'))

输出结果：

<class 'http.client.HTTPResponse'>
<html>
<head>
    <script>
        location.replace(location.href.replace("https://","http://"));
    </script>
</head>
<body>
    <noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>
[Finished in 0.1s]

响应获取了baidu官网的网页源代码，响应类型为：http.client.HTTPResponse 它是一个 HTTPResposne 类型的对象,主要包含 read() 、 readinto()、 getheader(name)、
getheaders() 、 fileno()等方法，以及 msg 、version 、status 、reason 、debuglevel 、 closed 等属性。
查看响应的状态码和响应首部字段：

print(f"status:{response.status}")
print(f"headers:{response.getheaders()}")
print(f"server:{response.getheader('Server')}")

结果：

status:200
headers:[('Accept-Ranges', 'bytes'), ('Cache-Control', 'no-cache'), ('Content-Length', '227'), ('Content-Type', 'text/html'), ('Date', 'Wed, 14 Nov 2018 09:14:55 GMT'), ('Etag', '"5be10158-e3"'), ('Last-Modified', 'Tue, 06 Nov 2018 02:50:00 GMT'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('Pragma', 'no-cache'), ('Server', 'BWS/1.1'), ('Set-Cookie', 'BD_NOT_HTTPS=1; path=/; Max-Age=300'), ('Set-Cookie', 'BIDUPSID=182555CB247734E41DC41EAD4E3D44A8; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'PSTM=1542186895; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Strict-Transport-Security', 'max-age=0'), ('X-Ua-Compatible', 'IE=Edge,chrome=1'), ('Connection', 'close')]
server:BWS/1.1

urlopen()的参数:
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
data参数时可选的，data参数是可选的。如果要添加该参数,并且如果它是字节流编码格式的内容,即 bytes 类型，则需要通过 bytes ()方法转化。另外,如果传递了这个参数,则它的请求方式就不再是 GET方式，而是 POST方式。

import urllib.request
import urllib.parse

# urlencode：将参数转为 ASCII字符串 bytes: 转为字节流
data = bytes(urllib.parse.urlencode({'word': 'Nice to meet you'}), encoding='utf-8')
# 测试Post请求
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read().decode('utf-8'))

运行结果：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "word": "Nice to meet you"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "21", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.6"
  }, 
  "json": null, 
  "origin": "115.205.15.135", 
  "url": "http://httpbin.org/post"
}

传递的参数州现在了 form字段中，而且Content-Type也表示在提交表单，这表明是模拟了表单提交的方式，以 POST 方式传输数据。
timeout参数用欧冠与设置超时时间（秒）。如果请求超出了这个设置的时间，还没有得到响应，就会抛出相关的异常。可以用它来控制一个网页如果长时间未响应，就跳过抓取。
使用try except语句：

from urllib import request
from urllib import error
import socket
try:
    response = request.urlopen('http://httpbin.org/get', timeout=0.1)
except error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print(e.reason)
else:
    pass
finally:
    pass

在0.1s内得不到服务的响应，就会捕获异常。

timed out
[Finished in 0.9s]

Request

urlopen ()方法可以实现最基本请求的发起,但这几个简单的参数并不足以构建一个完整的请求。如果请求中需要加入 Headers 等信息,就可以利用更强大的 Request 类来构建：class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

url:用于请求URL，必传参数，其他都是可选参数；
data：传递的数据，必须时bytes类型，如果它是字典，可以先用urllib.parse 模块里的 urlencode ()编码
headers：请求头，是一个字典表，也可以之后调用add_header()方法添加，添加请求头最常用的用法就是通过修改 User-Agent 来伪装浏览器；
origin_req_host：请求方的host名称或者IP地址；
unverifiable：表示这个请求是否是无法验证的，默认是 False ，意思就是说用户没
有足够权限来选择接收这个请求的结果；
method：HTTP请求的方法，如POST，GET和PUT等，默认GET；

根据request的参数设置，构造请求：

from urllib import request, parse

# 请求的url地址
url = 'http://httpbin.org/post'
# 请求头 用户客户端信息 和 请求资源所在服务器
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) ',
    'Host': 'httpbin.org'
}

method = 'POST'
# 提交的信息，要转为 bytes格式
user = {
    'username': 'Ulysses',
}
data = bytes(parse.urlencode(user), encoding='utf-8')

# 请求
# req = request.Request(url, headers=headers, data=data, method=method)
req = request.Request(url, data=data, method=method)
req.add_header('User-Agent', 'Mozilla/5.0 (X11; Linux x86_64) ')

response = request.urlopen(req)
print(response.read().decode('utf-8'))

运行结果：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "username": "Ulysses"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "16", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64)"
  }, 
  "json": null, 
  "origin": "115.205.15.135", 
  "url": "http://httpbin.org/post"
}

可以看到响应信息中数据信息，而且‘User-Agent’，也修改了。

Handler

Handler是urllib.request模块下的一些处理器，通过它可以做到HTTP请求中所有的事情。BaseHandler是所有Handler的基类，它提供了最基本的方法，例如 default_open ()、 protocol_request ()等，它的几个子类：

HTTPDefaultErrorHandler：用于处理 HTTP 响应错误,错误都会抛出 HTTP Error 类型的异常；
HTTPRedirectHandler：处理重定向；
HTTPCookieProcessor：处理Cookies；
ProxyHandler：设置代理，默认没有代理；
HTTPPasswordMgr：管理密码；
HTTPBasicAuthHandler：管理认证；

其余子类和方法：BaseHandler

OpenerDirector

OpenerDirector类通过链接在一起的BaseHandler打开URL 。它管理处理程序的链接，并从错误中恢复。urlopen就是调用了OpenerDirector类的open方法，通过build_opener方法可以基于给的的Handler来构建opener。

验证

有些网站在打开时就会弹出提示框,直接提示你输入用户名和密码,验证成功后才能查看页面,这时就需要HTTPBasicAuthHandler了

from urllib.request import urlopen, build_opener, HTTPPasswordMgrWithDefaultRealm,\
    HTTPBasicAuthHandler, install_opener
from urllib.error import URLError

username = 'ulysses'
password = 'password'

url = 'http://localhost:5000/'
# HTTPPasswordMgr的子类
# 创建一个密码管理对象，用来保存 HTTP 请求相关的用户名和密码
p = HTTPPasswordMgrWithDefaultRealm()
# 参数 realm, uri, user, passwd
p.add_password(None, url, username, password)
# 创建认证管理器，参数是密码管理对象
auth_handler = HTTPBasicAuthHandler(p)

opener = build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
install_opener(opener)
with urlopen(url) as f:
    print(f.status)
    print(f.read().decode('utf-8'))

Cookie处理

cookie的处理需要HTTPCookieProcessor：

import http.cookiejar
import urllib.request

# Collection of HTTP cookies
cookie = http.cookiejar.CookieJar()
# cookie 处理器
handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

with opener.open('http://www.baidu.com') as f:
    for item in cookie:
        print(f"{item.name}={item.value}")

首先 ,我们必须声明一个 CookieJar 对象。接下来,就需要利用 HTTPCookieProcessor 来构建一个Handler ,最后利用 build_opener ()方法构建出 Opener ,执行 open () 函数即可。
获得的Cookies：

BAIDUID=E9A30396B6CF7BE50FFE7A0C9F94F6B3:FG=1
BIDUPSID=E9A30396B6CF7BE50FFE7A0C9F94F6B3
H_PS_PSSID=1468_21126_27509
PSTM=1542272067
delPer=0
BDSVRTM=0
BD_HOME=0

将浏览器的Cookies存储下来：

import http.cookiejar
import urllib.request

filename = 'cookie.txt'
url = 'http://www.baidu.com'
# you may want to backup your browser's cookies file
# if you usethis class to save cookies
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)

with opener.open(url) as f:
    cookie.save(ignore_discard=True, ignore_expires=True)

MozillaCookieJar对象可以将浏览器Cookie保存到文本中：

# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file!  Do not edit.

.baidu.com	TRUE	/	FALSE	3689756478	BAIDUID	1937EFB489AB1EDDF64315DE8F35E05B:FG=1
.baidu.com	TRUE	/	FALSE	3689756478	BIDUPSID	1937EFB489AB1EDDF64315DE8F35E05B
.baidu.com	TRUE	/	FALSE		H_PS_PSSID	26524_1437_21116_27400_26350_22160
.baidu.com	TRUE	/	FALSE	3689756478	PSTM	1542272831
.baidu.com	TRUE	/	FALSE		delPer	0
www.baidu.com	FALSE	/	FALSE		BDSVRTM	0
www.baidu.com	FALSE	/	FALSE		BD_HOME	0

也可以将cookie保存为libwww-perl(LWP)格式：

cookie = http.cookiejar.MozillaCookieJar(filename)

获取cookie后的利用：

import http.cookiejar
import urllib.request

cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie_lwp.txt', ignore_expires=True, ignore_discard=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
with opener.open('http://www.baidu.com') as f:
    print(f.read().decode('utf-8'))

用load()方法从本地文件获取cookie，之后同样构建Opener和Handler，运行后会输出百度首页的源代码。

代理设置

爬虫的时候如果需要添加代理，需要ProxyHandler：

from urllib.request import ProxyHandler, build_opener
from urllib.error import URLError

proxy_handler = ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)

with opener.open('https://www.baidu.com') as f:
    print(f.read().decode('utf-8'))

proxy_handler设置代理服务器的地址，利用ProxyHandler和build_opener构建一个Opener，之后发送请求即可。

异常处理

urllib 的 error 模块定义了由request模块产生的异常。如果出现了问题, request 模块便会抛出error 模块中定义的异常。

URLError:URLError类来自 urllib 库的error模块,它继承自 OSError类,是 error 异常模块的基类,由 request模块生的异常都可以通过捕获这个类来处理。

from urllib import error, request

try:
    response = request.urlopen('http://localhost:8000/zh-hans/topic/444')
except error.URLError as e:
    print(e.reason)

HTTPError：它是 URL Error 的子类,专门用来处理 HTTP 请求错误,比如认证请求失败等。它有如下 3 个属性：
1. code：返回HTTP状态码；
2. reason：返回错误原因；
3. headers：返回请求头；

from urllib import request, error

try:
    response = request.urlopen('http://localhost:8000/zh-hans/topic/66/')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request successfully')

运行结果：

Not Found
404
Date: Thu, 15 Nov 2018 12:12:22 GMT
Server: WSGIServer/0.2 CPython/3.6.7
Content-Type: text/html; charset=utf-8
Content-Language: zh-hans
X-Frame-Options: SAMEORIGIN
Content-Length: 27

解析

urllib.parse模块提供了处理URL的接口：

urlparse()

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True),该方法可以实现URL的分离和识别：

from urllib.parse import urlparse

url = 'https://www.baidu.com/index.html;user?id=5#comment'
o = urlparse(url)
print(type(o))
print(o)

运行的结果：

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
[Finished in 0.0s]

解析后返回的是一个ParseResult对象，可以看到它将URL分成了6段，组成形式为：
scheme://netloc/path;parameters?query#fragment，HTTP协议定义URI的格式：
在这里插入图片描述
参数：

urlstring：要分析的url；
scheme: 默认的协议http或者https，在urlstring中没有待协议时，使用它；
allow_fragments：即是否忽略 fragment 。如果它被设置为 False ，fragment 部分就会被忽略，它会被解析为 path 、parameters 或者query的一部分，而fragment部分为空；

urlunparse()

与urlparse()相对立的，将参数组合成url：

from urllib.parse import urlunparse
data = ['https', 'www.baidu.com', 'index.html', 'user', 'id=6', 'comment']
url = urlunparse(data)
print(url)

组合成的url：

https://www.baidu.com/index.html;user?id=6#comment
[Finished in 0.0s]

它的参数一定要是一个长度为6的可迭代对象。

urlsplit和urlunsplit

与urlparse类似，但不解析params，会将它合并到path中。对立的合并函数urlunsplit也是接受长度为5的可迭代对象：

from urllib.parse import urlsplit, urlunsplit

result = urlsplit("http://www.baidu.com/index.html;user?id=5#commet")
print(result.scheme, result[2])

data = ['https', 'www.baidu.com', 'index.html', 'id=6', 'comment']
print(urlunsplit(data))

结果：

http /index.html;user
https://www.baidu.com/index.html?id=6#comment
[Finished in 0.0s]

urljoin

urllib.parse.urljoin(base, url, allow_fragments=True)可以提供一个 base_url(基础链接 ) 作为第一个参数,将新的链接作为第二个参数,该方法会分析 base_url 的 scheme 、 netloc 和 path这 3 个内容并对新链接缺失的部分进行补充,最后返回结果。如果url是一个绝对的链接（以//或scheme://为起始），合成的链接就使用url中的host：

from urllib.parse import urljoin
print(urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html'))
print(urljoin('http://www.cwi.nl/%7Eguido/Python.html','//www.python.org/%7Eguido'))

结果：

http://www.cwi.nl/%7Eguido/FAQ.html
http://www.python.org/%7Eguido
[Finished in 0.0s]

urlencode()

urllib.parse.urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=quote_plus)用来构造GET请求的参数：

from urllib.parse import urlencode
# 使用get请求构成参数
params = {
    'name': 'Ulysses',
    'id': 11211
}

base_url = 'http:www.baidu.com?'
# 将字典表转为字符串
query_url = urlencode(params)
url = base_url + query_url
print(url)

运行结果：

<class 'str'>
http:www.baidu.com?name=Ulysses&id=11211
[Finished in 0.1s]

parse_ql() 和 parse_qsl ()

这两个函数用来将GET请求参数给反序列化，形成字典表或元组组成的列表：

from urllib.parse import parse_qs, parse_qsl

query = 'name=Ulysses&id=112222'
print(parse_qs(query))
print(parse_qsl(query))

结果：

{'name': ['Ulysses'], 'id': ['112222']}
[('name', 'Ulysses'), ('id', '112222')]
[Finished in 0.1s]

quote和unquote

可以对URL的内容进行编码和解码，例如在使用中文字符时：

from urllib.parse import quote, unquote
keyword = '中文'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)
print(unquote(url))

执行结果：

https://www.baidu.com/s?wd=%E4%B8%AD%E6%96%87
https://www.baidu.com/s?wd=中文
[Finished in 0.0s]

Robots协议

Robots 协议也称作爬虫协议、机器人协议,它的全名叫作网络爬虫排除标准( R obots ExclusionProtocol ),用来告诉爬虫和搜索引擎哪些页面可以抓取,哪些不可以抓取。它通常是一个叫作 robots.txt的文本文件,一般放在网站的根目录下。
当搜索爬虫访问一个站点时,它首先会检查这个站点根目录下是否存在 robots.txt 文件,如果存在,搜索爬虫会根据其中定义的爬取范围来爬取。如果没有找到这个文件,搜索爬虫便会访问所有可直接访问的页面。

爬虫学习笔记3

基本库的使用

urllib

发送请求

urlopen

Request

Handler

OpenerDirector

验证

Cookie处理

代理设置

异常处理

解析

urlparse()

urlunparse()

urlsplit和urlunsplit

urljoin

urlencode()

parse_ql() 和 parse_qsl ()

quote和unquote

Robots协议

猜你喜欢