The Website is the API

requests库

requests库的7个主要方法

requests.request

构造一个请求
requests.request(method,url,[**kwarges])

method：请求方式（get，post，put，patch，head，delete，option）
url：url链接
**kwarges:
- params[字典或字节序列，作为参数增加到url中]
- data[字典，字节序列或文件对象，作为request的内容]
- json[json格式数据，作为request的内容]，headers[字典，HTTP头]
- cookies[字典或cookiejar，request中的cookie]
- auth[元组，支持http认证功能]
- files[字典，传输文件]
- timeout[设定超时时间，以s为单位]
- proxies[字典类型，设定代理服务器，可增加登录认证]
- allow_redirects[重定向开关，默认为True]
- stream[获取内容立即下载开关，默认为True]
- verify[认证ssl证书开关，默认为True]
- cert[本地ssl证书路径]

requests.get()

请求获取url位置的资源
r=request.get(url,params,**kwargs)

requests.head()

请求获取资源的头部信息
requests.head(url,**kwargs)

requests.post()

请求向url位置的资源后附加新的数据
requests.post(url,data,json,**kwargs)

requests.put()

请求向url位置存储一个资源覆盖原有的资源
requests.put(url,data,**kwargs)

requests.patch()

请求改变该资源的部分内容
requests.patch(url,data,**kwargs)

requests.delete()

请求删除url位置存储的资源
requests.delete(url,**kwargs)

requests请求返回的response对象

构造一个向服务器请求资源的requests对象
返回一个包含服务器资源的response对象
- r.status_code - http请求的返回状态
- r.text - HTTP响应内容的字符串形式
- r.encoding - 从HTTP header中猜测的响应内容编码方式（如果http header中不存在charset，则认为编码为ISO-8859-1）
- r.apparent_encoding - 从内容中分析出的响应内容的编码方式（备选编码方式）
- r.content - http响应内容的二进制形式

requests库的异常

requests.ConectionError - 网络连接异常，如DNS查询失败、拒绝连接等
requests.HTTPError - HTTP错误异常
requests.URLRequired - URL缺失异常
requests.TooManyRedirects - 超过最大重定向次数
requests.ConnectTimeout - 连接远程服务器超时
requests.Timeout - 请求url超时
r.raise_for_status() - 如果不是200，产生requests.HTTPError

爬取网页的通用代码框架

网络连接有风险，异常处理很重要

import requests
def getHTMLText(url):
    try:
        r = request.get(url,timeout=30)
        r.raise_for_status() #如果状态不是200，引发HTTPError异常
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "产生异常"
if __name__=="__main__":
    url = "www.baidu.com"
    print(getHTMLText(url))

网络爬虫盗亦有道

爬虫尺寸

爬取网页玩转网页(小)
爬取网站爬取系列网站(scrapy)(中)
爬取全网(google，百度)(大)

问题

网络爬虫的“骚扰”
网络爬虫的法律风险
隐私泄露

爬虫限制

来源审查：判断user-agent进行限制
发布公告：robots协议

robots协议

作用：网站告诉爬虫那些网站可以抓取，哪些不行
形式：在网站的根目录下放置robots.txt文件

robots协议的基本语法：

User-agent：*（爬虫名）
Disallow：/（不允许访问的目录）

robots协议的使用

网络爬虫：自动或人工识别robots.txt,再进行内容爬取

约束性：robots协议是建议但非约束性，网络爬虫可以不遵守，但存在法律风险。

python爬虫学习笔记1：requests库及robots协议