Python3中的Urllib库——小白第一周学习笔记（2）

一、Urllib库——Python内置的HTTP请求库

urllib.request 请求模块（模拟发送请求）
urllib.error 异常处理模块
urllib.parse url解析模块（提供许多url处理方法，如差分，合并）
urllib.robotparser robots.txt解析模块

二、用法演示

1.urlopen

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,cadefault=False,context=None)

前三个参数使用频率较高，后面几个参数基本不用。

（1）第一个参数 url

1 import urllib.request
2 
3 response = urllib.request.urlopen('http://www.baidu.com')  #get类型的请求
4 print(response.read().decode('utf-8'))
5 '''
6     也可以直接声明一个request对象
7     request = urllib.request.Request('http://www.baidu.com')
8     response = urllib.request.urlopen(request)
9 '''

打印输出为：

（2）第二个参数data

1 import urllib.parse
2 import urllib.request
3 
4 data = bytes(urllib.parse.urlencode({'world': 'hello'}), encoding='utf8')
5 response = urllib.request.urlopen('http://httpbin.org/post', data=data)  #post类型的请求，有data参数时以post形式传递data
6 print(response.read())

打印输出为：

b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "world": "hello"\n  }, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Content-Length": "11", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.7", \n    "X-Amzn-Trace-Id": "Root=1-5e426274-a3f0622e47583590491e9caa"\n  }, \n  "json": null, \n  "origin": "223.88.90.230", \n  "url": "http://httpbin.org/post"\n}\n'

（3）第三个参数timeout

1 #代码1——不超时的演示
2 import urllib.request
3  
4 response = urllib.request.urlopen('http://httpbin.org/get', timeout=1)  #1秒，若1秒内得到结果则正常显示
5 print(response.read())

 1 #代码2——超时演示
 2 import socket
 3 import urllib.request
 4 import urllib.error
 5 
 6 try:
 7     response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)  #0.1秒，超时则执行except块
 8 except urllib.error.URLError as e:
 9     if isinstance(e.reason, socket.timeout):  #若超时，输出TIME OUT
10         print('TIME OUT')

代码2打印输出为：

TIME OUT

（4）关于响应

响应类型，状态码，响应头

1 import urllib.request
2 
3 response = urllib.request.urlopen('http://www.baidu.com')
4 print(type(response))  #响应类型
5 print(response.status)  #状态码
6 print(response.getheaders())  #响应头

（5）构造post请求——利用request对象加入headers及formdata

 1 from urllib import request, parse
 2 
 3 url = 'http://httpbin.org/post'
 4 headers = {
 5     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)',
 6     'Host': 'httpbin.org'
 7 }
 8 dict = {
 9     'name': 'Germey'
10 }
11 data = bytes(parse.urlencode(dict),encoding='utf8')
12 req = request.Request(url=url, data=data, headers=headers, method='POST')
13 '''
14     也可利用req.add_header()添加headers信息
15     req = request.Request(url=url, data=data, method='POST')
16     req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64)')
17 '''
18 response = request.urlopen(req)
19 print(response.read().decode('utf-8'))

2.Cookie

Cookie可以维持登录信息，用来爬取需要登录认证的网页。

1 import http.cookiejar,urllib.request
2 
3 cookie = http.cookiejar.CookieJar()
4 handler = urllib.request.HTTPCookieProcessor(cookie)  #利用handler处理cookie
5 opener = urllib.request.build_opener(handler)  #构造opener
6 response = opener.open('http://www.baidu.com')
7 for item in cookie:  #输出cookie中的信息
8     print(item.name+'='+item.value)

输出结果为：

BAIDUID=DD72491D2B581558F3408ACDEC22350C:FG=1
BIDUPSID=DD72491D2B581558D20BEA6F906A04C3
H_PS_PSSID=1461_21105
PSTM=1581496829
delPer=0
BDSVRTM=0
BD_HOME=0

3.异常处理

当urlopen不能处理服务器的响应时，会抛出URLError异常。HTTPError的父类是URLError异常。

HTTPError比URLError的信息更详细，HTTPError可以详细的判断异常并返回对应的状态码。

1 #URLError例子
2 from urllib import request, error
3 try:
4     response = request.urlopen('http://cuiqingcai.com/index_html')
5 except error.URLError as e:
6     print(e.reason)  #输出为Not Found

 1 #HTTPError和URLError对比
 2 from urllib import request, error
 3 
 4 try:
 5     response = request.urlopen('http://cuiqingcai.com/index_html')
 6 except error.HTTPError as e:
 7     print(e.reason, e.code, e.headers, sep='\n')
 8 except error.URLError as e:
 9     print(e.reason)
10 else:
11     print('Request Successfully')  #若没有捕获异常，输出Request Successfully

输出结果为：

Not Found
404
Server: nginx/1.10.3 (Ubuntu)
Date: Wed, 12 Feb 2020 09:03:53 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Set-Cookie: PHPSESSID=94j0c7n2t6l640b7hn6or3g680; path=/
Pragma: no-cache
Vary: Cookie
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Link: <https://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"

4.URL解析

（1）urlparse——将url字符串拆分成其组件

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

1 #解析url
2 from urllib.parse import urlparse
3 
4 result = urlparse('https://www.baidu.com/index.html;user?id=5#comment')
5 print(type(result), result, sep='\n')

输出结果：

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

（2）urlunparse——参数长度必须为6，构造url

1 from urllib.parse import urlunparse
2 
3 data = ['https', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
4 print(urlunparse(data))   #输出结果为：https://www.baidu.com/index.html;user?a=6#comment

（3）urlencode——把字典转换成请求参数

1 from urllib.parse import urlencode
2 
3 params = {
4     'name': 'jack',
5     'age': 20
6 }
7 base_url = 'https://www.baidu.com?'
8 url = base_url + urlencode(params)
9 print(url)   #输出结果为：https://www.baidu.com?name=jack&age=20

Python3中的Urllib库——小白第一周学习笔记（2）

猜你喜欢