一、Urllib库——Python内置的HTTP请求库
- urllib.request 请求模块(模拟发送请求)
- urllib.error 异常处理模块
- urllib.parse url解析模块(提供许多url处理方法,如差分,合并)
- urllib.robotparser robots.txt解析模块
二、用法演示
1.urlopen
urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,cadefault=False,context=None)
前三个参数使用频率较高,后面几个参数基本不用。
(1)第一个参数 url
1 import urllib.request 2 3 response = urllib.request.urlopen('http://www.baidu.com') #get类型的请求 4 print(response.read().decode('utf-8'))
5 '''
6 也可以直接声明一个request对象
7 request = urllib.request.Request('http://www.baidu.com')
8 response = urllib.request.urlopen(request)
9 '''
打印输出为:
(2)第二个参数data
1 import urllib.parse 2 import urllib.request 3 4 data = bytes(urllib.parse.urlencode({'world': 'hello'}), encoding='utf8') 5 response = urllib.request.urlopen('http://httpbin.org/post', data=data) #post类型的请求,有data参数时以post形式传递data 6 print(response.read())
打印输出为:
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "world": "hello"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Content-Length": "11", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.7", \n "X-Amzn-Trace-Id": "Root=1-5e426274-a3f0622e47583590491e9caa"\n }, \n "json": null, \n "origin": "223.88.90.230", \n "url": "http://httpbin.org/post"\n}\n'
(3)第三个参数timeout
1 #代码1——不超时的演示 2 import urllib.request 3 4 response = urllib.request.urlopen('http://httpbin.org/get', timeout=1) #1秒,若1秒内得到结果则正常显示 5 print(response.read())
1 #代码2——超时演示 2 import socket 3 import urllib.request 4 import urllib.error 5 6 try: 7 response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1) #0.1秒,超时则执行except块 8 except urllib.error.URLError as e: 9 if isinstance(e.reason, socket.timeout): #若超时,输出TIME OUT 10 print('TIME OUT')
代码2打印输出为:
TIME OUT
(4)关于响应
响应类型,状态码,响应头
1 import urllib.request 2 3 response = urllib.request.urlopen('http://www.baidu.com') 4 print(type(response)) #响应类型 5 print(response.status) #状态码 6 print(response.getheaders()) #响应头
(5)构造post请求——利用request对象加入headers及formdata
1 from urllib import request, parse 2 3 url = 'http://httpbin.org/post' 4 headers = { 5 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)', 6 'Host': 'httpbin.org' 7 } 8 dict = { 9 'name': 'Germey' 10 } 11 data = bytes(parse.urlencode(dict),encoding='utf8') 12 req = request.Request(url=url, data=data, headers=headers, method='POST') 13 ''' 14 也可利用req.add_header()添加headers信息 15 req = request.Request(url=url, data=data, method='POST') 16 req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64)') 17 ''' 18 response = request.urlopen(req) 19 print(response.read().decode('utf-8'))
2.Cookie
Cookie可以维持登录信息,用来爬取需要登录认证的网页。
1 import http.cookiejar,urllib.request 2 3 cookie = http.cookiejar.CookieJar() 4 handler = urllib.request.HTTPCookieProcessor(cookie) #利用handler处理cookie 5 opener = urllib.request.build_opener(handler) #构造opener 6 response = opener.open('http://www.baidu.com') 7 for item in cookie: #输出cookie中的信息 8 print(item.name+'='+item.value)
输出结果为:
BAIDUID=DD72491D2B581558F3408ACDEC22350C:FG=1 BIDUPSID=DD72491D2B581558D20BEA6F906A04C3 H_PS_PSSID=1461_21105 PSTM=1581496829 delPer=0 BDSVRTM=0 BD_HOME=0
3.异常处理
当urlopen不能处理服务器的响应时,会抛出URLError异常。HTTPError的父类是URLError异常。
HTTPError比URLError的信息更详细,HTTPError可以详细的判断异常并返回对应的状态码。
1 #URLError例子 2 from urllib import request, error 3 try: 4 response = request.urlopen('http://cuiqingcai.com/index_html') 5 except error.URLError as e: 6 print(e.reason) #输出为Not Found
1 #HTTPError和URLError对比 2 from urllib import request, error 3 4 try: 5 response = request.urlopen('http://cuiqingcai.com/index_html') 6 except error.HTTPError as e: 7 print(e.reason, e.code, e.headers, sep='\n') 8 except error.URLError as e: 9 print(e.reason) 10 else: 11 print('Request Successfully') #若没有捕获异常,输出Request Successfully
输出结果为:
Not Found 404 Server: nginx/1.10.3 (Ubuntu) Date: Wed, 12 Feb 2020 09:03:53 GMT Content-Type: text/html; charset=UTF-8 Transfer-Encoding: chunked Connection: close Set-Cookie: PHPSESSID=94j0c7n2t6l640b7hn6or3g680; path=/ Pragma: no-cache Vary: Cookie Expires: Wed, 11 Jan 1984 05:00:00 GMT Cache-Control: no-cache, must-revalidate, max-age=0 Link: <https://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"
4.URL解析
(1)urlparse——将url字符串拆分成其组件
urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)
1 #解析url
2 from urllib.parse import urlparse 3 4 result = urlparse('https://www.baidu.com/index.html;user?id=5#comment') 5 print(type(result), result, sep='\n')
输出结果:
<class 'urllib.parse.ParseResult'> ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
(2)urlunparse——参数长度必须为6,构造url
1 from urllib.parse import urlunparse 2 3 data = ['https', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment'] 4 print(urlunparse(data)) #输出结果为:https://www.baidu.com/index.html;user?a=6#comment
(3)urlencode——把字典转换成请求参数
1 from urllib.parse import urlencode 2 3 params = { 4 'name': 'jack', 5 'age': 20 6 } 7 base_url = 'https://www.baidu.com?' 8 url = base_url + urlencode(params) 9 print(url) #输出结果为:https://www.baidu.com?name=jack&age=20