数据爬虫（二）：python爬虫中urllib库详解,parse和request使用方法

一、urllib.request 请求模块：

urllib.request 模块提供了最基本的构造 HTTP （或其他协议如 FTP）请求的方法，利用它可以模拟浏览器的一个请求发起过程。利用不同的协议去获取 URL 信息。它的某些接口能够处理基础认证（ Basic Authenticaton）、redirections （HTTP 重定向)、 Cookies (浏览器 Cookies）等情况。而这些接口是由 handlers 和 openers 对象提供的。

（1）、urlopen：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

参数：url:需要打开的网址 data: Post 提交的数据, 默认为 None ，当 data 不为 None 时, urlopen() 提交方式为 Post timeout：设置网站访问超时时间

说明: 直接使用 urllib.request 模块中的 urlopen方法获取页面，其中 page 数据类型为 bytes 类型，经过 decode 解码转换成 string 类型。通过输出结果可以 urlopen 返回对象是 HTTPResposne 类型对象。

urlopen 返回一个类文件对象，并提供了如下方法：

read() , readline() , readlines() , fileno() , close() ：这些方法的使用方式与文件对象完全一样; info()：返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息；可以通过Quick Reference to Http Headers查看 Http Header 列表。 getcode()：返回Http状态码。如果是http请求，200表示请求成功完成;404表示网址未找到； geturl()：返回获取页面的真实 URL。在 urlopen（或 opener 对象）可能带一个重定向时，此方法很有帮助。获取的页面 URL 不一定跟真实请求的 URL 相同。

import urllib.request
response = urllib.request.urlopen('https://python.org/')
print("查看 response 的返回类型：",type(response))
print("查看反应地址信息: ",response)
print("查看头部信息1(http header)：\n",response.info())
print("查看头部信息2(http header)：\n",response.getheaders())
print("输出头部属性信息：",response.getheader("Server"))
print("查看响应状态信息1(http status)：\n",response.status)
print("查看响应状态信息2(http status)：\n",response.getcode())
print("查看响应 url 地址：\n",response.geturl())
page = response.read()
print("输出网页源码:",page.decode('utf-8'))

（二）、Post数据：

import urllib.request,urllib.parse
url = 'https://httpbin.org/post'
headers = {
 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36',
 'Referer': 'https://httpbin.org/post',
 'Connection': 'keep-alive'
 }
 # 模拟表单提交
dict = {
 'name':'MIka',
 'old:':18
}
data = urllib.parse.urlencode(dict).encode('utf-8')
\#data 数如果要传bytes（字节流）类型的，如果是一个字典，先用urllib.parse.urlencode()编码。
req = urllib.request.Request(url = url,data = data,headers = headers)
response = urllib.request.urlopen(req)
page = response.read().decode('utf-8')
print(page)

https://httpbin.org是一个专门用于测试的网站，收藏

在 urlopen 参数 data 不为 None 时，urlopen() 数据提交方式为 Post。urllib.parse.urlencode()方法将参数字典转化为字符串。提交的网址是httpbin.org，它可以提供HTTP请求测试。 https://httpbin.org/post 这个地址可以用来测试 POST 请求，它可以输出请求和响应信息，其中就包含我们传递的 data 参数。

（三）、timeout参数

timeout参数可以设置超时时间，单位为秒，意思就是如果请求超出了设置的这个时间还没有得到响应，就会抛出异常，如果不指定，就会使用全局默认时间。它支持 HTTP 、 HTTPS 、 FTP 请求。

import urllib
import urllib.request
response = urllib.request.urlopen("https://httpbin.org/get",timeout=1)
print(response.read().decode("utf-8"))

我们试着给timeout一个更小的值,例如timeout=0.1,此时抛出 urllib.error.URLError 异常，错误原因为 time out 。因为常理下 0.1 s 内根本就不可能得到服务器响应。所以通过设置参数 timeout 的值对于应对网页响应的速度具有一定的意义。同时，可以通过设置这个超长时间来控制一个网页如果长时间未响应就跳过它的抓取（可以通过try-catch 语句）。

import urllib.request
import socket
import urllib.error
try:
 response = urllib.request.urlopen('https://httpbin.org/get',timeout=0.1)
except urllib.error.URLError as e:
 if isinstance(e.reason, socket.timeout):
 print("Time out!")

输出：Time out!

二、响应（后面主要用代码演示）

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(type(response))

返回值：

<class 'http.client.HTTPResponse'>

状态码、响应头：

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

返回值：

200
[('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'SAMEORIGIN'), ('X-Clacks-Overhead', 'GNU Terry Pratchett'), ('Content-Length', '47436'), ('Accept-Ranges', 'bytes'), ('Date', 'Wed, 22 Mar 2017 15:40:16 GMT'), ('Via', '1.1 varnish'), ('Age', '3417'), ('Connection', 'close'), ('X-Served-By', 'cache-itm7426-ITM'), ('X-Cache', 'HIT'), ('X-Cache-Hits', '16'), ('X-Timer', 'S1490197216.605863,VS0,VE0'), ('Vary', 'Cookie'), ('Public-Key-Pins', 'max-age=600; includeSubDomains; pin-sha256="WoiWRyIOVNa9ihaBciRSC7XHjliYS9VwUGOIud4PB18="; pin-sha256="5C8kvU039KouVrl52D0eZSGf4Onjo4Khs8tmyTlV3nU="; pin-sha256="5C8kvU039KouVrl52D0eZSGf4Onjo4Khs8tmyTlV3nU="; pin-sha256="lCppFqbkrlJ3EcVFAkeip0+44VaoJUymbnOaEUk7tEU="; pin-sha256="TUDnr0MEoJ3of7+YliBMBVFB4/gJsv5zO7IxD9+YoWI="; pin-sha256="x4QzPSC810K5/cMjb05Qm4k3Bw5zBn4lTdO/nEW/Td4=";'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
nginx

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))

读取返回值已utf-8进行编码，过多不做演示

三、Request

from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
 'Host': 'httpbin.org'
}
dict = {
 'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

返回值：parse是解析模块

这个模块是一个能把URL字符串拆分成组件，能把组件合并成URL和将一个相对的URL转成一个抽象的URL，从而的到一个基本的URL标准格式。简单的说就是可以拆分URL，也可以拼接URL，他支持的URL格式为：file、ftp、gopher、hdl、http、https、imap、mailto，mms、news、nntp、prospero、rsync、rtsp、rtspu、sftp、shttp、sip、sips、snews、svn、svn+ssh、telnet、wais、ws、wss。
这个模块默认分为两个类别，URL parsing（URL解析）和 URL quoting（URL引用）

函数用于将一个URL解析成六个部分，返回一个元组，URL的格式为：scheme://netloc/path;parameters?query#fragment；包含六个部分，元组中每一个元素都是一个字符串，可以为空，这六个部分均不能再被分割成更小的部分；

以下为返回的元组元素：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connect-Time": "1", 
    "Connection": "close", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "Total-Route-Time": "0", 
    "User-Agent": "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)", 
    "Via": "1.1 vegur", 
    "X-Request-Id": "f96e736e-0b8a-4ab4-9dcc-a970fcd2fbbf"
  }, 
  "json": null, 
  "origin": "219.238.82.169", 
  "url": "http://httpbin.org/post"
}

urllib.parse.parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding=’utf-8’, errors=’replace’)

这个函数主要用于分析URL中query组件的参数，返回一个key-value对应的字典格式；实例：

import urllib.parse  
print(urllib.parse.parse_qs("FuncNo=9009001&username=1"))

输出：

{'FuncNo': ['9009001'], 'username': ['1']}

urllib.parse.parse_qsl(qs, keep_blank_values=False, strict_parsing=False, encoding=’utf-8’, errors=’replace’)

这个函数和urllib.parse.parse_qs（）作用一样，唯一的区别就是这个函数返回值是list形式；

import urllib.parse  
print(urllib.parse.parse_qsl("FuncNo=9009001&username=1"))

输出：

[('FuncNo', '9009001'), ('username', '1')]

urllib.parse.urlunparse(parts)

这个函数可以将urlparse（）分解出来的元组组装成URL；

示例如下：

import urllib.parse  

# print(urllib.parse.parse_qsl("FuncNo=9009001&username=1"))  

parsed=urllib.parse.urlparse("https://www.zhihu.com/question/50056807/answer/223566912")  

print(parsed)  

# print(urllib.parse.parse_qs("https://www.zhihu.com/question/50056807/answer/223566912"))  

# print(urllib.parse.parse_qs("FuncNo=9009001&username=1"))  

t=parsed[:]  

print(urllib.parse.urlunparse(t))

输出结果：

ParseResult(scheme='https', netloc='www.zhihu.com', path='/question/50056807/answer/223566912', params='', query='', fragment='')  
https://www.zhihu.com/question/50056807/answer/223566912

urllib.parse.urlsplit(urlstring, scheme=”, allow_fragments=True)
这个函数和urlparse()功能类似，唯一的区别是这个函数不会将url中的param分离出来；就是说相比urlparse()少一个param元素，返回的元组元素参照urlparse()的元组表，少了一个param元素；

示例如下：

import urllib.parse
print(urllib.parse.urlsplit("https://www.zhihu.com/question/50056807/answer/223566912"))

输出结果：

SplitResult(scheme='https', netloc='www.zhihu.com', path='/question/50056807/answer/223566912', query='', fragment='')

urllib.parse.urlunsplit(parts)

与urlunparse()相似，切与urlsplit()相对应；

示例如下：

import urllib.parse
parsed=urllib.parse.urlsplit("https://www.zhihu.com/question/50056807/answer/223566912")
t=parsed[:]
print(urllib.parse.urlunsplit(t))

输出结果：

https://www.zhihu.com/question/50056807/answer/223566912

urllib.parse.urljoin(base, url, allow_fragments=True)

这个函数用于讲一个基本的URL和其他的URL组装成成一个完成的URL；

示例如下：

import urllib.parse  

print(urllib.parse.urljoin("https://www.baidu.com/Python.html","Java.html"))

输出结果：

https://www.baidu.com/Java.html

注意：如果URL是一个抽象的URL（例如以“//”或“scheme://”开头），这个URL的主机名或请求标识会自动返回；

urllib.parse.urldefrag(url)

如果URL中包含fragment标识，就会返回一个不带fragment标识的URL，fragment标识会被当成一个分离的字符串返回；如果URL中不包含fragment标识，就会返回一个URL和一个空字符串。

总与over了，有些资料整理于网络，转载附上链接：http://blog.esouti.com/2018_02_02_526.html