1.urllib库

urllib库中包含4各模块：request、error、parse、robotparser(识别robot.txt文件，判断是否可以爬取)。

1.2urlopen发送请求

使用的是urllib中的request模块。

import urllib.request
response =urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8')) #得到响应体中的内容

urlopen---利用它可以模拟浏览器的一个请求发起过程，同时带有处理授权验证、重定向、浏览器cookie以及其他内容

urlopen返回的是一个httprequest的对象。

1.2.1HTTPRequest的方法和属性

import urllib.request

response =urllib.request.urlopen('https://www.python.org')
print(type(response))  #使用type方法来输出响应的类型

type方法是一个HttpResponse类型对象。主要包括read()、readinto()、getheader(name)、getheaders()、fileno()等方法以及

msg、version、status、reason、debuglevel、closed等属性

import urllib.request

response =urllib.request.urlopen('https://www.python.org')
print(response.status)#得到服务器返回的状态   ‘ok’
print('--------------------------')
print(response.getheaders())#得到响应头的信息
print('--------------------------')
print(response.getheader('Content-Type'))#得到响应头中特定的信息

1.2.2urlopen详解

data参数：

data参数可选，使用了data参数后urlopen的请求方式由GET便成为POST方式

data = bytes(urllib.parse.urlencode({'name':'huangwei'}),encoding='utf-8')
reponse=urllib.response.urlopen('http://httpbin.org/post',data)

timeout参数：

data设置请求的超时时间

try:
   response = urllib.request.urlopen('http://httpbin.org/post',data,timeout=1)#设置请求超时时间，单位为秒
   print(response.read().decode('utf-8'))
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout):
        print("连接超时")

其他参数:、

content：他必须是ssl.SSLContent类型，用来指定SSL设置

cafile和capath两个参数分别指定CA整数和路径

https://docs.python.org/3/library/urllib.request.html

2.request

request可以使得请求头更加的完整

import urllib.request

request = urllib.request.Request('http://python.org')#设置一个Request对象，通过构造这个数据结构一方面可以独立成一个对象，另一方面能够更加详细的补充请求头的信息
response =urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

2.1Request详解

urllib.request.Requset(url,data=none,headers={},origin_req_host=none,unverifiable=false,methd=none)

url:填写请求的url是必填的，其他选填

data:必须是字节流，如果是字典的话必须用urlencode（）编码

headers:是一个字典，是请求头

origin_req_host:请求放大host名称或者IP地址

unverifiable:这个请求是否是无法验证的

methd：设置请求使用的方法

import urllib.request
import  urllib.parse

url='http://python.org'
headers={
    'User-Agent':'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)',
    'Host':'httpbin.org'
}
dict={
    'name':'Germery'
}
data = bytes(urllib.parse.urlencode(dict),encoding='utf-8')
request = urllib.request.Request(url,data=data,headers=headers,method='POST')#设置一个Request对象，通过构造这个数据结构一方面可以独立成一个对象，另一方面能够更加详细的补充请求头的信息
response =urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

headers也可以通过add_header('User-Agent':'Mozilla/4.0(compatible;MSIE 5.5;Windows NT')设置

网路爬虫--基本库的使用（3）