1. cookie信息是什么?
cookie某些网站为了辨别用户身份, 只有登陆某个页面才可以访问;
登陆信息保存方式: 进行一个会话跟踪(session),将用户的相关信息保存到本地的浏览器中;
from collections import Iterable
from urllib.parse import urlencode
from urllib.request import HTTPCookieProcessor
from http import cookiejar
from urllib import request
# **************************1. 获取cookie信息保存到变量**********************
# # CookieJar ------> FileCookieJar ---> MozilaCookie
# # 1. 声明一个类, 将cookie信息保存到变量中;
# cookie = cookiejar.CookieJar()
#
# # 2. 通过urllib.request的 HTTPCookieProcessor创建cookie请求器;
# handler = HTTPCookieProcessor(cookie)
#
# # 3). 通过处理器创建opener; ==== urlopen
# opener = request.build_opener(handler)
#
# # 4). 打开url页面
# response = opener.open('http://www.baidu.com')
#
# # print(cookie)
# print(isinstance(cookie, Iterable))
# for item in cookie:
# print("Name=" + item.name, end='\t\t')
# print("Value=" + item.value)
# **************************2. 获取cookie信息保存到本地文件**********************
# # 1). 指定年cookie文件存在的位置;
# cookieFilenName = 'doc/cookie.txt'
#
# # 2). 声明对象MozillaCookieJar, 用来保存cookie到文件中;
# cookie = cookiejar.MozillaCookieJar(filename=cookieFilenName)
#
# # 3). 通过urllib.request的 HTTPCookieProcessor创建cookie请求器;
# handler = HTTPCookieProcessor(cookie)
#
# # 4). 通过处理器创建opener; ==== urlopen
# opener = request.build_opener(handler)
#
# response = opener.open('http://www.baidu.com')
# print(response.read().decode('utf-8'))
# # 保存到本地文件中;
# cookie.save(cookieFilenName)
# **********************************3. 从文件中获取cookie并访问********************************
# 1). 指定cookie文件存在的位置;
cookieFilenName = 'doc/cookie.txt'
# 2). 声明对象MozillaCookieJar, 用来保存cookie到文件中;
cookie = cookiejar.MozillaCookieJar()
# *****添加一步操作, 从文件中加载cookie信息
cookie.load(cookieFilenName)
# 3). 通过urllib.request的 HTTPCookieProcessor创建cookie请求器;
handler = HTTPCookieProcessor(cookie)
# 4). 通过处理器创建opener; ==== urlopen
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))
# **********************************4. 利用cookie模拟登陆网站的步骤**********************************
# *******************88模拟登陆, 并保存cookie信息;
cookieFileName = 'cookie01.txt'
cookie = cookiejar.MozillaCookieJar(filename=cookieFileName)
handler = HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
# 这里的url是教务网站登陆的url;
loginUrl = 'xxxxxxxxxxxxxx'
postData = urlencode({
'stuid': '1302100122',
'pwd': 'xxxxxx'
})
response = opener.open(loginUrl, data=postData)
cookie.save(cookieFileName)
# bs4
# ******************8根据保存的cooie信息获取其他网页的内容eg: 查成绩/选课
gradeUrl = ''
response = opener.open(gradeUrl)
print(response.read())
案例
from http import cookiejar
from urllib import request
from urllib.parse import urlencode
from urllib.request import HTTPCookieProcessor
cookieFileName = 'doc/chinaUnixCookie.txt'
cookie = cookiejar.MozillaCookieJar(filename=cookieFileName)
handler = HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
# 这里的url是chinaunix登陆的url;
loginUrl = 'http://bbs.chinaunix.net/member.php?mod=logging&action=login&loginsubmit=yes&loginhash=La2A2'
# 易错: POST data should be bytes, an iterable of bytes, or a file object.
postData = urlencode({
'username': 'LVah',
'password': 'gf132590'
}).encode('utf-8')
print(type(postData))
response = opener.open(loginUrl, data=postData)
print(response.code)
with open('doc/chinaunix.html', 'wb') as f:
f.write(response.read())
# cookie.save(cookieFileName)
urllib模块里面的异常
pyhton3中把urllib2里面的方法封装到urllib.request;
https://docs.python.org/3/library/urllib.html
HTTP常见的状态码有哪些:
- 2xxx: 成功
- 3xxx: 重定向
- 4xxx: 客户端的问题
- 5xxxx: 服务端的问题
例如:
- 404: 页面找不到
- 403: 拒绝访问
- 200: 成功访问
1.消息
- 100 Continue
- 101 Switching Protocols
- 102 Processing
2.成功
- 200 OK
- 201 Created
- 202 Accepted
- 203 Non-Authoritative Information
- 204 No Content
- 205 Reset Content
- 206 Partial Content
- 207 Multi-Status
3.重定向
- 300 Multiple Choices
- 301 Moved Permanently
- 302 Move temporarily
- 303 See Other
- 304 Not Modified
- 305 Use Proxy
- 306 Switch Proxy
- 307 Temporary Redirect
4 请求错误
- 400 Bad Request
- 401 Unauthorized
- 402 Payment Required
- 403 Forbidden
- 404 Not Found
- 405 Method Not Allowed
- 406 Not Acceptable
- 407 Proxy Authentication Required
- 408 Request Timeout
- 409 Conflict
- 410 Gone
- 411 Length Required
- 412 Precondition Failed
- 413 Request Entity Too Large
- 414 Request-URI Too Long
- 415 Unsupported Media Type
- 416 Requested Range Not Satisfiable
- 417 Expectation Failed
- 421 too many connections
- 422 Unprocessable Entity
- 423 Locked
- 424 Failed Dependency
- 425 Unordered Collection
- 426 Upgrade Required
- 449 Retry With
- 451Unavailable For Legal Reasons
5.服务器错误
- 500 Internal Server Error
- 501 Not Implemented
- 502 Bad Gateway
- 503 Service Unavailable
- 504 Gateway Timeout
- 505 HTTP Version Not Supported(http/1.1)
- 506 Variant Also Negotiates
- 507 Insufficient Storage
- 509 Bandwidth Limit Exceeded
- 510 Not Extended
- 600 Unparseable Response Headers
from urllib import request
from urllib import error
try:
url = 'http://www.baidu.com/hello.html'
response = request.urlopen(url, timeout=0.01)
except error.HTTPError as e:
print(e.code, e.headers, e.reason)
except error.URLError as e:
print(e.reason)
else:
content = response.read().decode('utf-8')
print(content[:5])
url解析模块
from urllib.parse import urlencode
from urllib.parse import urlparse
# data = urlencode({
# 'name': 'fentiao',
# 'password':'12345'
# })
# print(data)
# https://movie.douban.com/subject/4864908/comments?sort=time&status=P
# https://movie.douban.com/subject/4864908/comments?sort=new_score&status=P
# #**************** 对url地址进行编码
# data = urlencode({
# 'sort': 'time',
# 'status': 'P'
# })
# doubanUrl = 'https://movie.douban.com/subject/4864908/comments?' + data
# print(doubanUrl)
# #**************** 对url地址进行解析
doubanUrl = 'https://movie.douban.com/subject/4864908/comments?sort=new_score&status=P'
info = urlparse(doubanUrl)
print(info)
print(info.scheme)