Python3爬虫（一）：请求库之urllib

urllib是python3中用于操作url的内置库。在python2中分为urllib和urllib2

简单的爬取网页

urllib.request.urlopen(url, data, timeout)

url：请求地址，格式：http://host[:port][path]
data：上传数据
- 转换格式：urllib.parse.urlencode(dict_name).encode(‘utf8’)
timeout：超时时间（由于网络不好、服务器端异常、请求慢、请求异常，设置超时时间不让程序已知等待）

步骤：

导入模块后，使用urllib.request.urlopen(’…’)打开并爬取一个网页
返回一个文件对象，对象的操作：
- read(), readline(), readlines(), fileno(), close()等：类似文件对象的操作
- info()：返回httplib.HTTPMessage对象，表示远程服务器返回的头信息
- getcode()：返回Http状态码
- geturl()：返回请求的url

import urllib.request		# 引入模块
res = urllib.request.urlopen('http://www.baidu.com')
data = res.read()		# 读取文件内容
code = res.getcode()		# 200
url = res.geturl()		# www.baidu.com

模拟浏览器请求——Headers信息

上一种方法请求很容易被识别为爬虫，所以对设置了反爬虫的网页进行爬虫时，可以设置一些Headers信息，模拟为浏览器请求
方法：设置Headers信息（User-Agent）
如何找User-Agent：打开任意一个网页 -> 开发工具窗口 -> Network标签页 -> 在网页中点击任意链接 -> 点击任意请求 -> Headers标签 -> 找到User-Agent

我的User-Agent：

Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER

步骤：

设置爬取网址
调用urllib.request.Request创建一个请求对象
- 参数1：url
- 参数2：
  - 传入数据，默认传入0个数据
  - 传入头部，默认不传任何头部，格式：dict对象
使用urlopen打开request对象

import urllib.request

url = 'http://www.baidu.com'
header = {
	'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER'
}
req = urllib.request.Request(url, headers=header)		# 创建一个request对象
res = urllib.request.urlopen(req)		# 返回爬取的网页

使用代理服务器

原因：使用同一个IP爬取同一网站上的网页，长时间后会被该网站的服务器屏蔽
解决方法：使用代理服务器（显示的不是我们真实的IP地址，而是代理服务器的IP地址）

import urllib.request

def use_proxy(proxy_address, url):
	# 设置代理服务器的IP地址
	proxy = urllib.request.ProxyHandler({'http': proxy_address})
	opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
	urllib.request.install_opener(opener)		# 将opener安装为全局
	
	data = urllib.request.urlopen(url)
	
	# opener不安装为全局
	#proxy = urllib.request.ProxyHandler({'http': proxy_address})
	#opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
	#data = opner.open(url)
	return data

proxy_address = '61.163.39.70:9999'
data = use_proxy(proxy_address, 'http://www.baidu.com')

使用Cookie

原因：网页涉及登录信息

import urllib.request
import urllib.parse
import http.cookiejar

url = 'http://xxxxxxxxx.com'
data = {
	'username': '123456',
	'password': '123456'
}
postdata = urllib.parse.urlencode(data).encode('utf8')
header = {'User-Agent': 'xxxxxxxxxx'}

req = urllib.request.Request(url, postdata, headers=header)
cookie = http.cookiejar.CookieJar()		# 创建CookieJar对象
handler = urllib.request.HTTPCookieProcessor(cookie)		# 创建cookie处理器
opener = urllib.request.build_opener(handler)		# 构建opener对象
res = opener.open(req)

GET请求示例

get请求的信息传递是通过url传递的
结构：url?key1=value1&key2=value2…

import urllib.parse
import urllib.request

# http://www.xxx.com?key1=value1&key2=value2
url = 'http://www.xxx.com?'
data = {
	'key1': 'value1',
	'key2': 'value2'
}
params = urllib.parse.urlencode(data)		# key1=value1&key2=value2，已编码
header = {'User-Agent': 'xxxxxx'}
req = urllib.request.Request(url+params, headers=header)
res = urllib.request.urlopen(req)

POST请求示例

post请求是通过表单传递数据的

import urllib.request
import urllib.parse

url = 'http://www.xxx.com?'
header = {'User-Agent': 'xxxxxx'}
data = {
	'name': '123456',
	'password': '123456'
}
postdata = urllib.parse.urlencode(data).encode('utf8')
req = urllib.request.Request(url, postdata)		# 传入数据，但头信息呢？
res = urllib.request.urlopen(req)

模块方法说明

urllib.request.urlopen()：打开网页，返回类似文件对象

urlopen(url, data, timeout)
- url：网址
- data：上传的数据，默认为0数据
- timeout：超时时间
urlopen(request对象)
- 参数为request对象

urllib.request.Request：生成request对象

Request(url, data, headers=xxx)
- url：网址
- data：上传数据，默认为0数据
- headers：设置头信息，dict形式

urllib.request.ProxyHandler：设置代理服务器

参数：字典dict，{‘类型’: ‘代理ip:端口号’}

urllib.request.build_opener：创建opener，返回opener对象

可直接使用opener.open(url)打开网址

urllib.request.install_opener：创建全局默认opener

创建后，可以使用urllib.requesr.urlopen(url)打开网址

urllib.parse.urlencode：对字典进行编码

参数：字典dict
输出：key1=value1&key2=value2

urllib.parse.quote：对字符串进行编码