版权声明: https://blog.csdn.net/wshsdm/article/details/82048123
1 python3的urllib库包含4个模块
request模块:
用于模块发送HTTP请求;
error模块:
异常处理模块,主要用于保证程序不会意外中断;
parse模块:
工具模块,包含了url处理方法;
robotparse模块:
用于识别网站的robots.xml文件,验证哪些网站不能爬取;
2 request模块
2.1 发送请求
def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
*, cafile=None, capath=None, cadefault=False, context=None):
作用:
发送字符串形式url请求,返回响应对象
返回:
http.client.HTTPResponse
参数:
data:如果添加该参数,则发送请求的方式为POST方式,该参数的形式为bytes(字节流编码)
例:
import urllib.request as request
import urllib.parse as parse
xurl="https://blog.csdn.net/gaifuxi9518/article/details/80859116"
#data参数, urlencode方法返回str
data=parse.urlencode({'xname':'abc'})
print(type(data),data)
#将字符串编码成bytes
bdata=bytes(data,encoding="utf8")
print(bdata)
response=request.urlopen(xurl,data=bdata)
print(type(response),response)
timeout参数:
设置以秒为单位的连接超时;
例如,设置超时参数,如果连接超时,抛出urllib.error异常
import urllib.request as request
import urllib.parse as parse
import urllib.error as err
import socket
xurl="https://blog.csdn.net/gaifuxi9518/article/details/80859116"
#data参数, urlencode方法返回str
data=parse.urlencode({'xname':'abc'})
print(type(data),data)
#将字符串编码成bytes
bdata=bytes(data,encoding="utf8")
print(bdata)
try:
response=request.urlopen(xurl,data=bdata,timeout=1)
print(type(response),response)
except err as e:
if isinstance(e.reson,socket.timeout):
print("连接失败!")
context参数:
必须是ssl.SSLContext类型,用来指定SSL设置
cafile参数和capath参数:
用来指定CA证书和它的路径
cadefile参数:已经弃用,默认为False;
2.2 使用request模块构建强大的请求
例如:
import urllib.request as req
#构建请求对象
request=req.Request(url="https://mp.csdn.net/postedit/82048123")
response=req.urlopen(request)
print(response)
构建请求对象的类
Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False,method=None)
参数:
url:字符串形式的请求地址;
data:发送请求的数据,必须是bytes类型;
例如:使用urlencode方法将字典类型的数据转换成字符串,然后在通过bytes方法将字符串转换字节数组
data=parse.urlencode({'xname':'abc'})
#将字符串编码成bytes
bdata=bytes(data,encoding="utf8")
headers参数:
对应请求头,是一个字典类型;
origin_req_host参数:
对应请求IP地址或host名称;
unverifiable参数:
2.3 获取cookie
import http.cookiejar,urllib.request as request
#创建存储cookie的对象
cookie=http.cookiejar.CookieJar()
#创建cookie处理句柄对象
handler=request.HTTPCookieProcessor(cookie)
#创建opener对象
opener=request.build_opener(handler)
cos=opener.open("https://www.baidu.com")
for it in cookie:
print(it.name+":"+it.value)
2.4 生成cookie
import http.cookiejar as cookjar,urllib.request as request
fname="hello.txt"
#设置将cookie文件保存成Mozilla格式
cook=cookjar.MozillaCookieJar(fname)
#根据cook对象,创建cookie处理器
handler=request.HTTPCookieProcessor(cook)
#根据cookie处理器对象创建 opener对象
opener=request.build_opener(handler)
#开启请求,并返回响应对象
response=opener.open(fullurl="https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn=baidu&wd=%E5%A4%A7%E6%95%B0%E6%8D%AE&rsv_pq=9bb4d29100027393&rsv_t=afc51HKG3RBkn5qAEeZOSZ4xr5UU7H8B%2BhIe42q1K%2FDMbB6FNYTF9PEOuv0&rqlang=cn&rsv_enter=1&rsv_sug3=4&rsv_sug1=4&rsv_sug7=101")
#
cook.save(ignore_discard=True,ignore_expires=True)
3 异常模块
3.1 网页无法找到异常:error.URLError,该对象有reason()方法
例:
try:
...
catch error.URLError as ex:
print(ex.reason)
3.2 http错误的异常对象 error.HTTPError
该对象包含三个属性:code(返回http响应状态码);reason(返回错误信息);headers(返回错误头)
4 requests对象
4.1 返回response对象
import requests
response=requests.get("https://www.baidu.com/s?ie=utf-8&f=3&rsv_bp=1&rsv_idx=1&tn=baidu&wd=dashuju%20&oq=python%2526lt%253B%2520HTTP%2526lt%253BookieProcessor&rsv_pq=b7227c130002b1ee&rsv_t=5f70swssIMLU94lSy94bhTZqTlFsRzC6fWci6pbh9j7I5amzH3KjPMJXcxw&rqlang=cn&rsv_enter=0&rsv_sug3=32&rsv_sug1=33&rsv_sug7=101&rsv_sug2=0&prefixsug=dashuju%2520&rsp=1&inputT=11979&rsv_sug4=11979")
print(type(response))
coding=response.encoding
code=response.status_code
cookie=response.cookies
txt=response.text
print(f"encoding:{coding},cookie:{code},code:{code},text:{txt}")
例:
返回json类型的数据
response=requests.get("http://httpbin.org/get")
son=response.json()
print(type(son),son)
4.2 get请求
import requests
da={
'name':'hello',
'age':20
}
r=requests.get("http://httpbin.org/get",params=da)
#返回json格式数据
print(r.text)
#如果是json可以使用json返回json对象数据
print(r.json())
4.3 读取二进制文件
import requests
response=requests.get("https://github.com/favicon.ico")
with open("abc.ico","wb") as f:
f.write(response.content)
4.4 添加header
import requests
xurl="http://daily.zhihu.com/story/9694150"
hs={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
req=requests.get(xurl,params=hs)
print(req.text)
4.5 post方法
import requests
data ={'name':'germey', 'age':'22 '}
r = requests.post("http://httpbin.org/post", data=data)
print(r.text)
5 使用cookie登录
import requests hd={ 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.3', 'cookie': 'd_c0="AJAv89triA2PTtm26av9JIlLwU91c3GBWpQ=|1525273379"; _zap=315b8012-b41b-483b-946f-8352ee66fa2f; _xsrf=o6j4o24nyh6dAqBrzgfXivhSTwZwyvCp; _ga=GA1.2.1220158767.1535276311; _gid=GA1.2.1442354079.1535276311; tgw_l7_route=156dfd931a77f9586c0da07030f2df36; capsion_ticket="2|1:0|10:1535283473|14:capsion_ticket|44:NTU3OTE2NDQwNzYxNDIxYTkyNGY4ODEwMjdlNDNhNGQ=|455352665d1cabb71edab5520cfb10a6e350a0a8e29ba2b2b3ce6b44e1ff25e9"; z_c0="2|1:0|10:1535283538|4:z_c0|92:Mi4xZGlMYkN3QUFBQUFBa0NfejIydUlEU1lBQUFCZ0FsVk5VdDl2WEFCWmZycmhFUVB5Z3k0VkJJX0FwY2VzMDhqUGN3|21be979d097cfabd4d440ae5166359aaf22704844818bb007d6e9085f0081df7"; q_c1=c19c6402cf234680a0efc2add83463ac|1535283539000|1525273381000', 'Host':'www.zhihu.com' } r=requests.get("https://www.zhihu.com/",headers=hd) print(r.text)
6 使用session维护回话
import requests
s = requests. Session()
s .get('http://httpbin.org/cookies/set/number/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)