崔庆才《Python3网络爬虫开发实战教程》的学习笔记系列第一章基本库的使用

1 python3的urllib库包含4个模块

request模块：

用于模块发送HTTP请求；

error模块：

异常处理模块，主要用于保证程序不会意外中断；

parse模块：

工具模块，包含了url处理方法；

robotparse模块：

用于识别网站的robots.xml文件，验证哪些网站不能爬取；

2 request模块

2.1 发送请求

def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
*, cafile=None, capath=None, cadefault=False, context=None):

作用：

发送字符串形式url请求，返回响应对象

http.client.HTTPResponse

参数：

data：如果添加该参数，则发送请求的方式为POST方式，该参数的形式为bytes（字节流编码）

例：

import urllib.request as request
import urllib.parse as parse
xurl="https://blog.csdn.net/gaifuxi9518/article/details/80859116"

#data参数, urlencode方法返回str
data=parse.urlencode({'xname':'abc'})
print(type(data),data)
#将字符串编码成bytes
bdata=bytes(data,encoding="utf8")
print(bdata)

response=request.urlopen(xurl,data=bdata)
print(type(response),response)

timeout参数：

设置以秒为单位的连接超时；

例如，设置超时参数，如果连接超时，抛出urllib.error异常

import urllib.request as request
import urllib.parse as parse
import urllib.error as err
import socket
xurl="https://blog.csdn.net/gaifuxi9518/article/details/80859116"

#data参数, urlencode方法返回str
data=parse.urlencode({'xname':'abc'})
print(type(data),data)
#将字符串编码成bytes
bdata=bytes(data,encoding="utf8")
print(bdata)

try:
response=request.urlopen(xurl,data=bdata,timeout=1)
print(type(response),response)
except err as e:
if isinstance(e.reson,socket.timeout):
print("连接失败！")

context参数：

必须是ssl.SSLContext类型，用来指定SSL设置

cafile参数和capath参数：

用来指定CA证书和它的路径

cadefile参数：已经弃用，默认为False；

2.2 使用request模块构建强大的请求

例如：

import urllib.request as req

#构建请求对象
request=req.Request(url="https://mp.csdn.net/postedit/82048123")
response=req.urlopen(request)
print(response)

构建请求对象的类

Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False,method=None)

参数：

url：字符串形式的请求地址；

data：发送请求的数据，必须是bytes类型；

例如：使用urlencode方法将字典类型的数据转换成字符串，然后在通过bytes方法将字符串转换字节数组

data=parse.urlencode({'xname':'abc'})
#将字符串编码成bytes
bdata=bytes(data,encoding="utf8")

headers参数：

对应请求头，是一个字典类型；

origin_req_host参数：

对应请求IP地址或host名称；

unverifiable参数：

2.3 获取cookie

import http.cookiejar,urllib.request as request

#创建存储cookie的对象
cookie=http.cookiejar.CookieJar()

#创建cookie处理句柄对象
handler=request.HTTPCookieProcessor(cookie)

#创建opener对象
opener=request.build_opener(handler)

cos=opener.open("https://www.baidu.com")
for it in cookie:
print(it.name+":"+it.value)

2.4 生成cookie

import http.cookiejar as cookjar,urllib.request as request

fname="hello.txt"
#设置将cookie文件保存成Mozilla格式
cook=cookjar.MozillaCookieJar(fname)
#根据cook对象，创建cookie处理器
handler=request.HTTPCookieProcessor(cook)
#根据cookie处理器对象创建 opener对象
opener=request.build_opener(handler)
#开启请求，并返回响应对象
response=opener.open(fullurl="https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn=baidu&wd=%E5%A4%A7%E6%95%B0%E6%8D%AE&rsv_pq=9bb4d29100027393&rsv_t=afc51HKG3RBkn5qAEeZOSZ4xr5UU7H8B%2BhIe42q1K%2FDMbB6FNYTF9PEOuv0&rqlang=cn&rsv_enter=1&rsv_sug3=4&rsv_sug1=4&rsv_sug7=101")
#
cook.save(ignore_discard=True,ignore_expires=True)

3 异常模块

3.1 网页无法找到异常：error.URLError，该对象有reason()方法

例：

try:

...

catch error.URLError as ex:

print(ex.reason)

3.2 http错误的异常对象 error.HTTPError

该对象包含三个属性：code（返回http响应状态码）；reason（返回错误信息）；headers（返回错误头）

4 requests对象

4.1 返回response对象

import requests

response=requests.get("https://www.baidu.com/s?ie=utf-8&f=3&rsv_bp=1&rsv_idx=1&tn=baidu&wd=dashuju%20&oq=python%2526lt%253B%2520HTTP%2526lt%253BookieProcessor&rsv_pq=b7227c130002b1ee&rsv_t=5f70swssIMLU94lSy94bhTZqTlFsRzC6fWci6pbh9j7I5amzH3KjPMJXcxw&rqlang=cn&rsv_enter=0&rsv_sug3=32&rsv_sug1=33&rsv_sug7=101&rsv_sug2=0&prefixsug=dashuju%2520&rsp=1&inputT=11979&rsv_sug4=11979")
print(type(response))
coding=response.encoding
code=response.status_code
cookie=response.cookies
txt=response.text
print(f"encoding:{coding},cookie:{code},code:{code},text:{txt}")

例：

返回json类型的数据

response=requests.get("http://httpbin.org/get")
son=response.json()
print(type(son),son)

4.2 get请求

import requests
da={
'name':'hello',
'age':20
}
r=requests.get("http://httpbin.org/get",params=da)

#返回json格式数据
print(r.text)

#如果是json可以使用json返回json对象数据

print(r.json())

4.3 读取二进制文件

import requests
response=requests.get("https://github.com/favicon.ico")
with open("abc.ico","wb") as f:
f.write(response.content)

4.4 添加header

import requests
xurl="http://daily.zhihu.com/story/9694150"
hs={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
req=requests.get(xurl,params=hs)
print(req.text)

4.5 post方法

import requests
data ={'name':'germey', 'age':'22 '}
r = requests.post("http://httpbin.org/post", data=data)
print(r.text)

5 使用cookie登录

import requests
hd={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.3',
    'cookie': 'd_c0="AJAv89triA2PTtm26av9JIlLwU91c3GBWpQ=|1525273379"; _zap=315b8012-b41b-483b-946f-8352ee66fa2f; _xsrf=o6j4o24nyh6dAqBrzgfXivhSTwZwyvCp; _ga=GA1.2.1220158767.1535276311; _gid=GA1.2.1442354079.1535276311; tgw_l7_route=156dfd931a77f9586c0da07030f2df36; capsion_ticket="2|1:0|10:1535283473|14:capsion_ticket|44:NTU3OTE2NDQwNzYxNDIxYTkyNGY4ODEwMjdlNDNhNGQ=|455352665d1cabb71edab5520cfb10a6e350a0a8e29ba2b2b3ce6b44e1ff25e9"; z_c0="2|1:0|10:1535283538|4:z_c0|92:Mi4xZGlMYkN3QUFBQUFBa0NfejIydUlEU1lBQUFCZ0FsVk5VdDl2WEFCWmZycmhFUVB5Z3k0VkJJX0FwY2VzMDhqUGN3|21be979d097cfabd4d440ae5166359aaf22704844818bb007d6e9085f0081df7"; q_c1=c19c6402cf234680a0efc2add83463ac|1535283539000|1525273381000',
   'Host':'www.zhihu.com'
}
r=requests.get("https://www.zhihu.com/",headers=hd)
print(r.text)

6 使用session维护回话

import requests
s = requests. Session()
s .get('http://httpbin.org/cookies/set/number/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)

崔庆才《Python3网络爬虫开发实战教程》的学习笔记系列 第一章 基本库的使用

猜你喜欢

崔庆才《Python3网络爬虫开发实战教程》的学习笔记系列第一章基本库的使用