python—网络爬虫（1）

安装 request库
1，运行里面输入 CMD 直接输入 pip install requests回车，即可安装
2，直接在终端输入python进入python自带的IDLE
3,下面命令即爬取百度页面信息内容

C:\Users\ftsdata-02>python #输入python进入IDLE
Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests #导入requests库
>>> r = requests.get("http://www.baidu.com") #用requests库里面的get方法获取百度网址内容信息
>>> r #查看获取返回信息值， 200表示获取成功
<Response [200]>
>>> r.status_code #运用status_code也可查看获取网页是否成功，显示200，即成功获取
200
>>> r.encoding='utf-8' #将获取的百度网页字符码转换成utf-8字符编码
>>> r.text #显示爬出网页内容
'<!DOCTYPE html>\r\n<html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class="fm"> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class="s_ipt" value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class="mnav">新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class="mnav">hao123</a> <a href=http://map.baidu.com name=tj_trmap class="mnav">地图</a> <a href=http://v.baidu.com name=tj_trvideo class="mnav">视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class="mnav">贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class="lb">登录</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">登录</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class="bri" style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>使用百度前必读</a>  <a href=http://jianyi.baidu.com/ class="cp-feedback">意见反馈</a> 京ICP证030173号  <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

总结：
requests 库的7个主要方法：
requests.request() 构造一个请求，支撑以下各方法的基础方法

requests.request(method, url, **kwargs)
method:请求方式，对应get/put/post等七种
url：拟获取页面的url链接
**kwargs:控制访问的参数共13个。
requests.request('GET', url, **kwargs)
requests.request('HEAD', url, **kwargs)
requests.request('POST', url, **kwargs)
requests.request('PUT', url, **kwargs)
requests.request('PATCH', url, **kwargs)
requests.request('DELETE', url, **kwargs)
requests.request('OPTIONS', url, **kwargs)

**kwargs: 控制访问的参数，均为可选项
params:　字典或字节序列，作为参数增加到url中
>>> kv={'key1':'value1','key2':'value2'}
>>> r=requests.request('GET','http://python123.io/ws',params=kv)
>>> print(r.url)
http://python123.io/ws?key1=value1&key2=value2

data:字典、字节序列或文件对象，作为Request的内容
>>> kv={'key1':'value1','key2':'value2'}
>>> r=requests.request('POST','http://python123.io/ws',data=kv)
>>> body='主体内容'
>>> r.requests.request('POST','http://python123.io/ws',data=body)

json:JSON格式的数据，作为Request的内容
>>> kv={'key1':'value1'}
>>> r=requests.request('POST','http://python123.io/ws',json=kv)

headers:字典，HTTP定制头
>>>hd={'user-agent':'Chrome/10'}
>>>r=requests.request('POST','http://python123.io/ws',headers=hd)

cookies:字典或CookieJar, Request中的cookie
auth:元组，支持HTTP认证功能
files:字典类型，传输文件
>>>fs={'file':open('data.xls','rb')
>>>r=requests.request('POST','http://python123.io/ws', files=fs)

timeout:设定超时间，秒为单位
>>>r=requests.request('GET','http://www.baidu.com',timeout=10)

proxies:字典类型，设定访问代理服务器，可以增加登录认证
>>>pxs={'http':'http://user:[email protected]:1234', 'https':'https://10.10.10.1:4321' }
>>>r=requests.request('GET','http://www.baidu.com',proxies=pxs)

扫描二维码关注公众号，回复： 1497207 查看本文章

allow_redirects: True/False, 默认为True, 重定向开关
stream: True/False,默认为True,获取内容立即下载开关
varify:　 True/False,默认为True,认证SSL证书开关
cert:本地SSL证书路径

requests.get() 获取HTML网页的主要方法，对应计HTTP的GET
requests.head() 获取HTML网页头信息的方法，对应于HTTP的HEAD
requests.post() 向HTML网页提交POST请求的方法，对应于HTTP的POST
requests.put() 向HTML网页提交PUT请求的方法，对应于HTTP的PUT
requests.patch() 向HTML网页提交局部修改请求，对应于HTTP的PATCH
requests.delete() 向HTML网页提交删除请求，对应于HTTP的DELETE

r = requests.get(url) 构造一个向服务器请求资源的Request对象 Request,并且返回一个包含服务器资源的Response对象；
requests.get(url, params = None, **kwargs)
url: 拟获取页面的url链接
params: url中额外参数，字典或字节流格式，可选**kwargs: 12个控制访问的参数

Requests库的2个重要的对象
Request 和 Response对象

Response对象包含爬虫返回的所有内容
>>> import requests #导入requests库
>>> r = requests.get("http://www.baidu.com")
>>> print(r.status_code)
200
>>> type(r)
<class 'requests.models.Response'>
>>> r.headers
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'Keep-Alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Tue, 05 Jun 2018 11:48:31 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:36 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
>>>

Response对象的属性:
r.status_code HTTP请求的返回状态，200表求连接成功，404表示失败
r.text HTTP响应内容的字符串形式，即，url对应的页面内容
r.encoding 从HTTP header中猜测的响应内容编码方式
r.apparent_encoding 从内容中分析出的响应内容编码方式（备选编码方式）
r.content HTTP响应内容的二进制形式

>>> r.apparent_encoding
'utf-8'
>>> r.encoding='utf-8'
>>> r.test

理解Response的编码
r.encoding 从HTTP header中猜测的响应内容编码方式
r.apparent_encoding 从内容中分析出的响应内容编码方式（备选编码方式）
r.encoding：如果header中不存在charset，则认为编码为ISO-88591
r.apparent_encoding: 根据网页内容分析出的编码方式

理解Requests库的异常
requests.ConnectionError 网络连接错误异常，如DNS查询失败，拒绝连接等
requests.HTTPError HTTP错误异常
requests.URLRequired URL缺失异常
requests.TooManyRedirects 超过最大重定向次数，产生重定向异常
requests.ConnectTimeout 连接远程服务器超时异常
requests.Timeout 请求URL超时，产生的异常

r.raise_for_status() 如果不是200，产生异常requests.HTTPError

>>> import requests
>>> def getHTMLText(url):
... try:
... r=requests.get(url,timeout=30)
... r.raise_for_status()
... r.encoding=r.apparent_encoding
... return r.text
... except:
... return
...
>>> if __name__=='__main__':
... url="http://www.baidu.com"
... print(getHTMLText(url))

HTTP协议：超文本传输协议。
HTTP是一个基于“请求与响应”模式的、无状态的应用层协议。
HTTP协议采用URL作为定位网络资源的标识。
URL格式： http://host[:port][path]
host: 合法的Internet主机域名或IP地址
port:　端口号，缺省端口为80
path: 请求资源路径

URL是通过HTTP协议存取资源的Internet路径，一个URL对应一个数据资源。

HTTP协议对资源的操作：
GET 　　　请求获取URL位置的资源
HEAD　　　请求URL位置资源的响应消息报告，即获得该资源的头部信息
POST　　　请求向URL位置的资源后附加新的数据
PUT　　　　请求向URL位置存储一个资源，覆盖原URL位置的资源
PATCH　　请求局部更新URL位置的资源，即该处资源的部分内容
DELETE　　请求删除URL位置存储的资源

理解PATCH与PUT的区别
假设URL位置有一组数据UserInfo,包括UserID,UserName等等20个字段。
需求：用户修改了UserName，其他的不变。
采用PATCH，仅向URL提交UserName的局部更新请求。
采用PUT，必须将所有20个字段一并提交到URL，未提交字段被删除。

PATCH最大的优点：节省网络带宽

Requests库的post()方法
下示例：向URL POST一个字典，自动编码为form（表单）
>>> payload={'key1':'value1','key2':'value2'}
>>> r=requests.post('http://httpbin.org/post',data=payload)
>>> print(r.text)
{...
"form":{
"key2":"value2",
"key1":"value1
},
}

Requests库的put()方法
>>> payload={'key1':'value1', 'key2':'value2'}
>>> r=requests.put('http://httpbin.org/put',data=payload)
>>> print(r.text)
{
...
"form":{
"key2":"value2",
"key1":"value1"
},
}

python—网络爬虫（1）

猜你喜欢