Python 网络爬虫框架

Python网络请求

urllib模块
urllib是Python的自带模块
urlopen(url,data,timeout)

import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
html = response.read()
print(html.decode())

可能会出现的问题：https://www.cnblogs.com/liangmingshen/p/9958573.html

urllib3模块
非自带模块，需要命令安装 pip install urllib3

import urllib3

html = urllib3.PoolManager()
response = html .request('POST','http://httpbin.org/post',fields={'word':'hello'})
print(response.data.decode())

requests模块
非自带模块，需要命令安装 pip install requests

import requests

data = {'word':'hello'}
response = requests.post('http://httpbin.org/post',data=data)
print(response.content.decode())

请求headers处理

参考文章https://blog.csdn.net/u010256388/article/details/68491509
头部信息
Accept:浏览器可接受的MIME类型（媒体类： Multipurpose Internet Mail Extensions）
Host：初始URL中的主机和端口
Connection：是否可以处理持久连接
Cache-Control：控制网页缓存
Cookie：将cookies返回到服务器
Upgrade-Insecure-Requests：通知服务器可以处理https协议
User-Agent：用于识别浏览器类型

请求行：

①是请求方法，GET和POST是最常见的HTTP方法，除此以外还包括DELETE、HEAD、OPTIONS、PUT、TRACE。

②为请求对应的URL地址，它和报文头的Host属性组成完整的请求URL。

③是协议名称及版本号。

请求头：

④是HTTP的报文头，报文头包含若干个属性，格式为“属性名:属性值”，服务端据此获取客户端的信息。

与缓存相关的规则信息，均包含在header中

请求体：

⑤是报文体，它将一个页面表单中的组件值通过param1=value1&param2=value2的键值对形式编码成一个格式化串，它承载多个请求参数的数据。不但报文体可以传递请求参数，请求URL也可以通过类似于“/chapter15/user.html? param1=value1&param2=value2”的方式传递请求参数。

User-Agent : 有些服务器或 Proxy 会通过该值来判断是否是浏览器发出的请求
Content-Type : 在使用 REST 接口时，服务器会检查该值，用来确定 HTTP Body 中的内容该怎样解析。
- application/xml ：在 XML RPC，如 RESTful/SOAP 调用时使用
- application/json ：在 JSON RPC 调用时使用
- application/x-www-form-urlencoded ：浏览器提交 Web 表单时使用
- 在使用服务器提供的 RESTful 或 SOAP 服务时， Content-Type 设置错误会导致服务器拒绝服务

Cookie，指某些网站为了辨别用户身份、进行session跟踪而储存在用户本地终端上的数据（通常经过加密）

HTML解析

lxml 模块 pip install lxml Doc
Requests 模块 pip install requests-html Doc
HtmlPaeser pip install HTMLParser Doc
BeautifulSoup 模块 pip install bs4 Doc

Requests

推荐阅读
taobao cookie
【淘宝价格爬取】
注意：首先要登录淘宝，再设置headers，否则淘宝处于未登录状态，无法进入搜索界面
设置：F12——Network——√DOC——search?initiative_id=..——Request Headers——cookie:

#CrowTaobaoPrice.py
#淘宝本是不允许爬取搜索页面的 详见：https://www.taobao.com/robots.txt
import requests
import re

headers = {
	#淘宝登录后增加cookie
    'cookie': '...', 
    'User-Agent': 'Mozilla/5.0',
}
def getHTMLText(url, code="utf-8"):
    try:
        r = requests.get(url, timeout=30, headers=headers)
        r.raise_for_status()
        r.encoding = code
        return r.text
    except:
        print("获取失败\n")

def parsePage(ilt, html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
        for i in range(len(plt)):
        	#eval函数将字符串转为数字(去掉引号)
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split(':')[1])
            ilt.append([price , title])
    except:
        print("解析失败\n")

def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("序号", "价格", "商品名称"))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count, g[0], g[1]))

def main():
    goods = '书包'
    depth = 3 #演示使用，depth不能太大，
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44*i)
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)

main()