潭州课堂25班：Ph201805201 爬虫基础第三课 fidder (课堂笔记)

https://www.cnblogs.com/zhaof/p/6910871.html

Urllib是python内置的HTTP请求库
包括以下模块
urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse url解析模块
urllib.robotparser robots.txt解析模块

我们可以这样认为 urlopen 只有 url 没有参数，那么就是 get 请求，

如果添加data参数就是 post 请求。

get 请求

import urllib.parse
import urllib.request


'''
urllib.request.urlopen参数的介绍：
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
'''
# get 请求
response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))

post 请求

import urllib.parse
import urllib.request

'''
urlopen常用的有三个参数，它的参数如下：
urllib.requeset.urlopen(url,data,timeout)
response.read()可以获取到网页的内容，如果没有read()，将返回如下内容
'''
'''
通过http://httpbin.org/post  使用urllib
模拟各种请求操作
'''
# 用 urllib.parse，通过bytes(urllib.parse.urlencode()) 将 post 数据 转换并放到urllib.request.urlopen的data参数中。
# #这样就完成了一次post请求。
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
print(data)
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.rea

timeout参数的使用

import urllib.parse
import urllib.request


'''
timeout参数的使用
在某些网络情况不好或者服务器端异常的情况会出现请求慢的情况，或者请求异常，所以这个时候我们需要给
请求设置一个超时时间，而不是让程序一直在等待结果。
'''
# 设定 timeout 的值，在XX时间嘉应，报个错，
response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
print(response.read())
#对异常进行抓取，所以将上边代码加个导演处理
try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

响应类型、状态码、响应头

response = urllib.request.urlopen('https://www.python.org')
print(type(response))
'''
我们可以通过response.status、response.getheaders().response.getheader("server")，获取状态码以及头部信息
response.read()获得的是响应体的内容
'''

设置Headers

有很多网站为了防止程序爬虫爬网站造成网站瘫痪，会需要携带一些headers头部信息才能访问，最长见的有user-agent参数

import urllib.parse
import urllib.request
from urllib import parse,request

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'zhaofan'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

'''
另一种方法设置Headers
'''
# 这种添加方式有个好处是自己可以定义一个请求头字典，然后循环进行添加
from urllib import request, parse

url = 'http://httpbin.org/post'
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

代理,ProxyHandler

'''
代理,ProxyHandler
通过rulllib.request.ProxyHandler()设置代理,网站它会检测某一段时间某个IP 的访问次数，
如果访问次数过多，它会禁止你的访问,所以这个时候需要通过设置代理来爬取数据
'''
import urllib.request

proxy_handler = urllib.request.ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('http://httpbin.org/get')
print(response.read())

cookie,HTTPCookiProcessor

'''
cookie,HTTPCookiProcessor
cookie中保存中我们常见的登录信息，
有时候爬取网站需要携带cookie信息访问,这里用到了http.cookijar，
用于获取cookie以及存储cookie
'''
import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

cookie 可以写入到文件中保存，有两种方式
　　　　http.cookiejar.MozillaCookieJar和
　　　　http.cookiejar.LWPCookieJar()，

http.cookiejar.MozillaCookieJar()方式

import http.cookiejar, urllib.request
filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

　http.cookiejar.LWPCookieJar()方式

import http.cookiejar, urllib.request
filename = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

　　获取文件中的 cookie，

import http.cookiejar, urllib.request
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

潭州课堂25班：Ph201805201 爬虫基础 第三课 fidder (课堂笔记)

猜你喜欢

潭州课堂25班：Ph201805201 爬虫基础第三课 fidder (课堂笔记)