python学习笔记5---（python网络爬虫-网络请求）

urllib库

urllib是最基本的网络请求库。可以模拟浏览器行为，向指定浏览器发送请求，并保存返回的数据。

urlopen函数

在urllib库中，所有和网络请求相关的方法，都集到urllib.request模块下。

from urllib import request
resp=request.urlopen('http://www.baidu.com')
print(resp.read())  #将响应的结果读取出来

上面三行代码把百度的首页的全部代码爬下来了。爬下来的代码和通过右键查看源代码的结果一样。

parse库

parse是最基本的网络解析库。

urlretrieve函数

from urllib import request
request.urlretrieve('url','下载下来建的名字') #比如图片的地址

urlencode函数

（对中文或特殊字符进行编码和parse_qs函数是parse模块下的函数，是作用相反的函数）

from urllib import parse
params={'name':'cai','age':18,'greet':'hello world'}

from urllib import parse
url='http://www.baidu.com/s'
params={"wd':"刘德华"}
qs=parse.urlencode(params)
url=url+"?"+qs
resp = request.urlopen(url)
print(resq.read())

parse_qs函数

from urllib import parse
params = {'name':'cai','age':18,'greet':'hello world'}
qs=parse.urlencode(params)
print(qs)
res = parse.parse_qs(qs)

urlparse和urlsplit

想要对url中的各个组成部分进行分割

from urllib import parse
url='http://www.baidu.com/s?username=cai'
result=parse.urlpaser(url)
print(result)
print('scheme:',result.scheme)
print('netloc:',result.netloc)
print('path:',result.path)
print('query:',result.query)

ParseResult(scheme='http',netloc='www.baidu.com',path='/s',params='',query='username=abc')
scheme=http
netloc=www.baidu.com
path=/s
query=username=abc

urlparse和urlsplit基本一样，urlparse多一个params属性，;和?中间的值即params的值。比如url=‘http://www.baidu.com/s;hello?username=cai’ ，params=hello

用request爬取拉勾网职位信息

想要在请求的时候增加一些请求头，必须使用request.Request类实现。

from urllib import request
url='https://www.lagou.com/jobs/list_python?labelWords=sug&fromSearch=true&suginput=pyth'

resq=request.urlopen(url)
print(resq.read())

上述代码得到不完整信息（因为反爬虫机制）,所以下面伪装请求头。

from urllib import request
url='https://www.lagou.com/jobs/list_python?labelWords=sug&fromSearch=true&suginput=pyth'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'}
res=request.Request(url,headers=headers)
reqs=request.urlopen(res)
print(reqs.read())

但是查看源代码发现关于职位的信息不是从该网站得到，所以查找网页中文件下面的json文件（靠经验），发现文件叫positionAjex.json,类型是json，点击该链接后，查看右边的响应，发现职位信息在该json中（可以为了方便阅读，将该文件拷贝在www.json.cn中查看），于是点击该json链接右边的消息头，复制请求网址，即为实际的请求网址：https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false，并且请求方式为post，所以还加入data字段信息,data信息在右边参数中的表单信息中

from urllib import request
url='https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
data={
'first':'true';
'pn'=1;
'kd'='python'
}
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'}
res=request.Request(url,headers=headers,data=data,method='POST')
# 请求为post，这也是后面我们翻页的时候发现拉勾网翻页时 浏览器 url栏地址没有变化的原因
reqs=request.urlopen(res)
print(reqs.read())

上述报错：要将data通过urlencode编码，然后encode

from urllib import request,parse
url='https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
data={
'first':'true',
'pn'=1,
'kd'='python'
}
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'}
res=request.Request(url,headers=headers,data=parse.urlencode(data).encode('uft-8'),method='POST')
reqs=request.urlopen(res)
print(reqs.read())

上述报错：但报错信息要通过read().decode(‘utf-8’)才能读懂，提示访问太频繁，但实际还是反爬虫机制导致，所以加入更多的请求头信息。

from urllib import request,parse
url='https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
data={
'first':'true'，
'pn':1，
'kd':'python'
}

headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',
                'Referer'='https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput='
                }
res=request.Request(url,headers=headers,data=parse.urlencode(data).encode(uft-8),method='POST')
reqs=request.urlopen(res)
print(reqs.read())

ProxyHandler处理器（代理设置）

代理网站：1.西刺免费代理IP(全部免费):http://www.xicidaili.com
2.快代理：http://www.kuaidaili.com/
3.代理云：http://www.dailiyun.com/

from urllib import request
# 没有使用代理
# url='htttp://httpbin.org/ip'
# resq=request.urlopen(url) 其底层机制同下
# print(resq.read()) # 得到该电脑的外网的ip
# 使用代理
url='htttp://httpbin.org/ip'
#1.使用ProxyHandler，传入代理构建handler
handler=request.ProxyHandler({"http":"218.66.161.88:31769"})
#2.使用已创建的handler构建一个opener
opener=request.build_opener(handler)
#3.使用opener发送请求
resq=opener.open(url)
print(resq.read())

cookie

cookie存储的数据量有限，不同的浏览器有不同的存储大小，一般超过4kb
使用cookie访问人人网（必须登录）中大鹏页面的一种方法

from urllib import request
dapeng_url='XXX'
headers={
'User-Agent':"XXXX",
'Cookie':'XXXX'
}
resq=request.Request(url=dapeng_url,henders=headers)
req=request.urlopen(resq)
with open('dapeng.html','w',encoding='utf-8') as f:
    #write函数必须写入一个str的数据类型
    #req.read()读出来是bytes数据类型
    #bytes-》decode-》str
    #str-》encode-》bytes
    f.write(req.read().decode('utf-8'))

http.cookiejar

该模块主要的类有CookieJar（cookie只能存储在内存中）、FileCookieJar（cookie可以存储在文件中,由CookieJar派生来）、MozillaCookieJar、LWPCookieJar，其中MozillaCookieJar和LWPCookieJar继承自FileCookieJar，两者很类似。

通过登录方式动态获取cookie，访问大鹏人人网首页

from urllib import request
from urllib import parse
from http.cookiejar import CookieJar

# 1、登录
# 1.1 创建一个cookiejar对象
cookiejar = CookieJar()
# 1.2 使用cookiejar创建一个HTTPCookieProcess对象
handler=request.HTTPCookieProcessor(cookiejar)
# 1.3 使用上一步创建的handler创建一个opener
opener = request.build_opener(handler)
# 1.4 使用opener发送登录的请求（人人网的邮箱和密码）
headers = {
'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36"  
#在火狐浏览器中由于字符太长，省略中间信息，导致复制了(...)，最终报错：UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026' in position 30: ordinal not in range(256)，解决办法：在chrome中的请求头中复制该信息
}
data={
'email':"15116475571",
'password':"caiyijunxpy"
}
login_url = "http://www.renren.com/PLogin.do"  # 地址错误报：HTTPError: /sysHome
req=request.Request(login_url,data=parse.urlencode(data).encode('utf-8'),headers=headers)
opener.open(req)
 # 2 访问个人主页
 dapeng_url = "http://www.renren.com/880151247/profile"
 req = request.Request(dapeng_url,headers=headers)
 resp = opener.open(req)
 with open('dapeng.html','w',encoding='utf-8') as f:
     f.write(resp.read().decode('utf-8'))

将上面代码写成函数显示

from urllib import request
from urllib import parse
from http.cookiejar import CookieJar

   headers = {
    'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36"  
    #在火狐浏览器中由于字符太长，省略中间信息，导致复制了(...)，最终报错：UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026' in position 30: ordinal not in range(256)，解决办法：在chrome中的请求头中复制该信息
}

def get_opener():
    # 1、登录
    # 1.1 创建一个cookiejar对象
    cookiejar = CookieJar()
    # 1.2 使用cookiejar创建一个HTTPCookieProcessor对象
    handler=request.HTTPCookieProcessor(cookiejar)
    # 1.3 使用上一步创建的handler创建一个opener
    opener = request.build_opener(handler)
    return opener

def login_renren(opener):
    # 1.4 使用opener发送登录的请求（人人网的邮箱和密码）
 
    data={
'email':"15116475571",
'password':"caiyijunxpy"
}
    login_url = "http://www.renren.com/PLogin.do"  # 地址错误报：HTTPError: /sysHome
    req=request.Request(login_url,data=parse.urlencode(data).encode('utf-8'),headers=headers)
    opener.open(req)

def visit_profile(opener):
     # 2 访问个人主页
     dapeng_url = "http://www.renren.com/880151247/profile"
     req = request.Request(dapeng_url,headers=headers)
     resp = opener.open(req)
     # 自动新建的dapeng.html，打开显示大鹏的主页 
     with open('dapeng.html','w',encoding='utf-8') as f:
         f.write(resp.read().decode('utf-8'))

if __name__ == '__main__':
    opener =get_opener()
    login_renren(opener)
    visit_profile(opener)

保存cookie

from urllib import request
from http.cookiejar import MozillaCookieJar

cookiejar = MozillaCookieJar('cookie.txt')
handler=request.HTTPCookieProcessor(cookiejar)
opener=request.build_opener(handler)

req=opener.open('http://httpbin.org/cookies')
cookiejar.save(ignore_discard=True) #程序运行完后cookie信息会过期

加载cookie

from urllib import request
from http.cookiejar import MozillaCookieJar

cookiejar = MozillaCookieJar('cookie.txt')
cookiejar.load(ignore_discard=True) 
handler=request.HTTPCookieProcessor(cookiejar)
opener=request.build_opener(handler)

req=opener.open('http://httpbin.org/cookies')
for cookie in cookiejar:
    print(cookie)

requests库

response.text和response.content的区别
1.response.content:这是直接从网络上抓取的数据，没有经过任何解码，是个bytes类型。在硬盘和网络上传输的字符串都是bytes类型。
2.response.text:这个是str数据类型，是requests库将response.content进行解码的字符串。解码需要指定一个编码方式，requests会根据自己的猜测判断编码方式，但有时可能会出错，导致解码产生乱码，此时应该使用response.content.decode(‘utf-8’)进行手动解码

import requests

data={
'first':"true",
'pn':'1',
'kd':'python'
}
headers={
'Referer':'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput='
'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36"  
}

response=requests.post("https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false",data=data,headers=headers)
print(response.json())

反爬虫机制提示：{'status': False, 'msg': '您操作太频繁,请稍后再访问', 'clientIp': '218.104.146.110', 'state': 2402}

requests处理cookie信息

之前使用urllib库可以使用opener发送多个请求，多个请求之间可以共享cookie。如果使用requests，共享cookie，可以使用session对象。

import requests

url = "http://www.renren.com/PLogin.do"  
data={
'email':"15116475571",
'password':"caiyijunxpy"
}
headers={'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36"  }

session =requests.Session()
session =post(url,data=data,headers=headers)
res=Session.get("http://www.renren.com/880151247/profile")
with open('dapeng.html','w',encoding="utf-8") as f:
    f.write(res.text)

处理不信任的SSL证书

res=requests.get('http://12306.cn/mormhweb/',verify=False)
print(res.content.decode('utf-8')