崔庆才《Python3网络爬虫开发实战教程》的学习笔记系列 第一章 基本库的使用

版权声明: https://blog.csdn.net/wshsdm/article/details/82048123

1 python3的urllib库包含4个模块

request模块:

     用于模块发送HTTP请求;

error模块:

    异常处理模块,主要用于保证程序不会意外中断;

parse模块:

   工具模块,包含了url处理方法;

robotparse模块:

   用于识别网站的robots.xml文件,验证哪些网站不能爬取;

2 request模块

2.1 发送请求

def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
            *, cafile=None, capath=None, cadefault=False, context=None):

作用:

发送字符串形式url请求,返回响应对象

返回:

http.client.HTTPResponse

参数:

data:如果添加该参数,则发送请求的方式为POST方式,该参数的形式为bytes(字节流编码)

例:

import urllib.request as request
import urllib.parse as parse
xurl="https://blog.csdn.net/gaifuxi9518/article/details/80859116"

#data参数, urlencode方法返回str
data=parse.urlencode({'xname':'abc'})
print(type(data),data)
#将字符串编码成bytes
bdata=bytes(data,encoding="utf8")
print(bdata)

response=request.urlopen(xurl,data=bdata)
print(type(response),response)

timeout参数:

设置以秒为单位的连接超时;

例如,设置超时参数,如果连接超时,抛出urllib.error异常

import urllib.request as request
import urllib.parse as parse
import urllib.error as err
import socket
xurl="https://blog.csdn.net/gaifuxi9518/article/details/80859116"

#data参数, urlencode方法返回str
data=parse.urlencode({'xname':'abc'})
print(type(data),data)
#将字符串编码成bytes
bdata=bytes(data,encoding="utf8")
print(bdata)

try:
    response=request.urlopen(xurl,data=bdata,timeout=1)
    print(type(response),response)
except err as e:
    if isinstance(e.reson,socket.timeout):
        print("连接失败!")

context参数:

必须是ssl.SSLContext类型,用来指定SSL设置

cafile参数和capath参数:

用来指定CA证书和它的路径

cadefile参数:已经弃用,默认为False;

2.2 使用request模块构建强大的请求

例如:

import urllib.request as req

#构建请求对象
request=req.Request(url="https://mp.csdn.net/postedit/82048123")
response=req.urlopen(request)
print(response)

构建请求对象的类

Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False,method=None)

参数:

url:字符串形式的请求地址;

data:发送请求的数据,必须是bytes类型;

例如:使用urlencode方法将字典类型的数据转换成字符串,然后在通过bytes方法将字符串转换字节数组

data=parse.urlencode({'xname':'abc'})
#将字符串编码成bytes
bdata=bytes(data,encoding="utf8")

headers参数:

对应请求头,是一个字典类型;

origin_req_host参数:

对应请求IP地址或host名称;

unverifiable参数:

2.3 获取cookie

import http.cookiejar,urllib.request as request

#创建存储cookie的对象
cookie=http.cookiejar.CookieJar()

#创建cookie处理句柄对象
handler=request.HTTPCookieProcessor(cookie)

#创建opener对象
opener=request.build_opener(handler)

cos=opener.open("https://www.baidu.com")
for it in cookie:
    print(it.name+":"+it.value)

2.4 生成cookie

import http.cookiejar as cookjar,urllib.request as request

fname="hello.txt"
#设置将cookie文件保存成Mozilla格式
cook=cookjar.MozillaCookieJar(fname)
#根据cook对象,创建cookie处理器
handler=request.HTTPCookieProcessor(cook)
#根据cookie处理器对象创建 opener对象
opener=request.build_opener(handler)
#开启请求,并返回响应对象
response=opener.open(fullurl="https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn=baidu&wd=%E5%A4%A7%E6%95%B0%E6%8D%AE&rsv_pq=9bb4d29100027393&rsv_t=afc51HKG3RBkn5qAEeZOSZ4xr5UU7H8B%2BhIe42q1K%2FDMbB6FNYTF9PEOuv0&rqlang=cn&rsv_enter=1&rsv_sug3=4&rsv_sug1=4&rsv_sug7=101")
#
cook.save(ignore_discard=True,ignore_expires=True)

3 异常模块

3.1 网页无法找到异常:error.URLError,该对象有reason()方法

例:

try:

    ...

catch error.URLError as ex:

       print(ex.reason)

3.2 http错误的异常对象 error.HTTPError

    该对象包含三个属性:code(返回http响应状态码);reason(返回错误信息);headers(返回错误头)

4 requests对象

4.1 返回response对象

import requests

response=requests.get("https://www.baidu.com/s?ie=utf-8&f=3&rsv_bp=1&rsv_idx=1&tn=baidu&wd=dashuju%20&oq=python%2526lt%253B%2520HTTP%2526lt%253BookieProcessor&rsv_pq=b7227c130002b1ee&rsv_t=5f70swssIMLU94lSy94bhTZqTlFsRzC6fWci6pbh9j7I5amzH3KjPMJXcxw&rqlang=cn&rsv_enter=0&rsv_sug3=32&rsv_sug1=33&rsv_sug7=101&rsv_sug2=0&prefixsug=dashuju%2520&rsp=1&inputT=11979&rsv_sug4=11979")
print(type(response))
coding=response.encoding
code=response.status_code
cookie=response.cookies
txt=response.text
print(f"encoding:{coding},cookie:{code},code:{code},text:{txt}")
 

例:

返回json类型的数据

response=requests.get("http://httpbin.org/get")
son=response.json()
print(type(son),son)

4.2 get请求

import requests
da={
    'name':'hello',
    'age':20
}
r=requests.get("http://httpbin.org/get",params=da)

#返回json格式数据
print(r.text)

#如果是json可以使用json返回json对象数据

print(r.json())

4.3 读取二进制文件

import requests
response=requests.get("https://github.com/favicon.ico")
with open("abc.ico","wb") as f:
    f.write(response.content)

4.4 添加header

import requests
xurl="http://daily.zhihu.com/story/9694150"
hs={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
req=requests.get(xurl,params=hs)
print(req.text)

4.5 post方法

import requests
data ={'name':'germey', 'age':'22 '}
r = requests.post("http://httpbin.org/post", data=data)
print(r.text)

5 使用cookie登录

import requests
hd={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.3',
    'cookie': 'd_c0="AJAv89triA2PTtm26av9JIlLwU91c3GBWpQ=|1525273379"; _zap=315b8012-b41b-483b-946f-8352ee66fa2f; _xsrf=o6j4o24nyh6dAqBrzgfXivhSTwZwyvCp; _ga=GA1.2.1220158767.1535276311; _gid=GA1.2.1442354079.1535276311; tgw_l7_route=156dfd931a77f9586c0da07030f2df36; capsion_ticket="2|1:0|10:1535283473|14:capsion_ticket|44:NTU3OTE2NDQwNzYxNDIxYTkyNGY4ODEwMjdlNDNhNGQ=|455352665d1cabb71edab5520cfb10a6e350a0a8e29ba2b2b3ce6b44e1ff25e9"; z_c0="2|1:0|10:1535283538|4:z_c0|92:Mi4xZGlMYkN3QUFBQUFBa0NfejIydUlEU1lBQUFCZ0FsVk5VdDl2WEFCWmZycmhFUVB5Z3k0VkJJX0FwY2VzMDhqUGN3|21be979d097cfabd4d440ae5166359aaf22704844818bb007d6e9085f0081df7"; q_c1=c19c6402cf234680a0efc2add83463ac|1535283539000|1525273381000',
   'Host':'www.zhihu.com'
}
r=requests.get("https://www.zhihu.com/",headers=hd)
print(r.text)

6 使用session维护回话

import requests
s = requests. Session()
s .get('http://httpbin.org/cookies/set/number/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)

猜你喜欢

转载自blog.csdn.net/zhangMY12138/article/details/82106897