Python学习笔记--Python 爬虫入门 -17-1 urllib_request+parse+chardet+get+post+Request

# 0 爬虫准备工作
- 参考资料

- python网络数据采集，图灵工业出版
- 精通Python爬虫框架Scrapy，人民邮电出版社
- [Python3网络爬虫](http://blog.csdn.net/c406495762/article/details/72858983)
- [Scrapy官方教程](http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html)

- 前提知识
- url
- http协议
- web前端，html, css, js
- ajax
- re, xpath
- xml

# 1. 爬虫简介
- 爬虫定义：网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），
是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。
另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。
- 两大特征
- 能按作者要求下载数据或者内容
- 能自动在网络上流窜
- 三大步骤：
- 下载网页
- 提取正确的信息
- 根据一定规则自动跳到另外的网页上执行上两步内容

- 爬虫分类
- 通用爬虫
- 专用爬虫（聚焦爬虫）

- Python网络包简介
- Python2.x：urllib, urllib2, urllib3, httplib, httplib2, requests
- Python3.x: urllib, urllib3, httplib2, requests
- python2: urllib和urllib2配合使用，或者requests
- Python3： urllib，requests

# 2. urllib
- 包含模块
- urllib.request: 打开和读取urls
- urllib.error：包含urllib.request产生的常见的错误，使用try捕捉
- urllib.parse: 包含解析url的方法
- urllib.robotparse: 解析robots.txt文件
- 案例v01

from urllib  import  request
"""
使用urllib.request 请求一个网页内容,并把内容打印出来
"""
if __name__ == '__main__':
    url = 'https://blog.csdn.net/u013985879'
    # 打开相应的url 并把相应的页面内容返回
    rsp = request.urlopen(url)
    
    print(type(rsp)) #<class 'http.client.HTTPResponse'>
    print(rsp) #<http.client.HTTPResponse object at 0x000001EBB1625EF0>
    # 读取到的结果为bytes 类型
    html = rsp.read()
    print(type(html)) #<class 'bytes'>
    print(html) #b'<!DOCTYPE html>\n<html lang="zh-CN">\n<head>\n...
    # 解码
    html = html.decode()
    print(html)

- 网页编码问题解决
- chardet 可以自动检测页面文件的编码格式，但是，可能有误
- 需要安装， conda install chardet
- 案例v02

#!/usr/bin/env python
from urllib import  request
import chardet

if __name__ == '__main__':
    url = 'https://blog.csdn.net/u013985879'
    rsp = request.urlopen(url)
    html = rsp.read()
    # 利用chardet 自动检测
    cs = chardet.detect(html)
    print(type(cs)) #<class 'dict'>
    print(cs) #{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

    html = html.decode(cs.get("encoding","utf-8"))
    print(html)

注意:需要设置 Project Interpreter, anaconda 目录下的py解释器

- urlopen 的返回对象
- 案例v03
- geturl: 返回请求对象的url
- info: 请求反馈对象的meta信息
- getcode：返回的http code

from  urllib import  request

if __name__ == '__main__':
    url='https://blog.csdn.net/u013985879'
    rsp = request.urlopen(url)
    print(type(rsp))
    print(rsp)

    # url:https://blog.csdn.net/u013985879
    print("url:{}".format(rsp.geturl())) 
    '''
    Info:Server: openresty
    Date: Sat, 08 Sep 2018 05:46:23 GMT
    Content-Type: text/html; charset=UTF-8
    Transfer-Encoding: chunked
    Connection: close
    Vary: Accept-Encoding
    Set-Cookie: uuid_tt_dd=10_19413877000-1536385583899-776898; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;
    Set-Cookie: dc_session_id=10_1536385583899.453257; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;
    Vary: Accept-Encoding
    Strict-Transport-Security: max-age= 31536000
    '''
    print("Info:{}".format(rsp.info()))
    # Code:200
    print("Code:{}".format(rsp.getcode()))

    # html = rsp.read()
    # html = html.decode()
    # print(html)

- request.data 的使用
- 访问网络的两种方法
- get:
- 利用参数给服务器传递信息，
- 参数为dict，然后用parse编码
- 案例v04

from  urllib import  request,parse

"""
掌握对url 进行参数编码的方法
需要使用parse 模块
"""
if __name__ == '__main__':
    baseurl='http://www.baidu.com/s?'
    wd = input("Please input keyword:")
    # 要想使用data,需要使用字典结构
    qs={
        "wd":wd
    }
    # 转换url 编码
    qs = parse.urlencode(qs)
    print(type(qs)) #<class 'str'>
    print(qs) #wd=%E5%9B%BE%E7%81%B5%E5%AD%A6%E9%99%A2
    url = baseurl+qs
    # 如果直接用可读的带参数的url ,是不能访问的
    # url = 'https://www.baidu.com/s?wd=大'
    print(url) #http://www.baidu.com/s?wd=%E5%9B%BE%E7%81%B5%E5%AD%A6%E9%99%A2
    rsp = request.urlopen(url)
    html = rsp.read()
    html = html.decode()
    print(html)

- post
- 一般向服务器传递参数使用
- post是把信息自动加密处理
- 我们如果想使用post信息，需要用到data参数
- 使用post，意味着Http的请求头可能需要更改：
- Content-Type: application/x-www.form-urlencode
- Content-Length: 数据长度
- 简而言之，一旦更改请求方法，请注意其他请求头部信息相适应
- urllib.parse.urlencode可以将字符串自动转换成上面的
- 案例v05 (已失效)

返回的内容格式 Json形式

from  urllib import  request,parse
import  json
"""
利用parse 模块模拟post 请求
分析百度翻译
分析步骤
1. 打开F12
2. 尝试输入单词girl，发现每敲一个字母后都有请求
3. 请求地址是 http://fanyi.baidu.com/sug
4. 利用NetWork-All-Hearders，查看，发现FormData的值是 kw:girl
5. 检查返回内容格式，发现返回的是json格式内容==>需要用到json包
"""

'''
大致流程:
1.利用data 构造内容,然后urlopen 打开
2.返回json 格式
'''

if __name__ == '__main__':
    baseurl = 'http://fanyi.baidu.com/sug'
    text = input("Please input keyword:")
    data = {
        'kw':text
    }
    # TypeError: POST data should be bytes, an iterable of bytes, or a file object. It cannot be of type str.
    data = parse.urlencode(data)
    print(type(data)) #<class 'str'>

    data = data.encode("utf-8")

   
    json_data = request.urlopen(url=baseurl,data=data)

    print(type(json_data)) #<class 'http.client.HTTPResponse'>
    # print(json_data)

    json_data = json_data.read().decode("utf-8")
    # {"errno":0,"data":[{"k":"hello","v":"int. \u6253\u62db\u547c; \u54c8\u55bd\uff0c\u5582; \u4f60\u597d\uff0c\u60a8\u597d; \u8868\u793a\u95ee\u5019; n. \u201c\u5582\u201d\u7684\u62db\u547c\u58f0\u6216\u95ee\u5019\u58f0; vi. \u558a"},{"k":"hello everyone","v":" \u5927\u5bb6\u597d;"},{"k":"hello kitty","v":"n. \u5361\u901a\u4e16\u754c\u4e2d; \u6709\u8fd9\u6837\u4e00\u53ea\u5c0f\u732b; \u6ca1\u6709\u5634\u5df4; \u8138\u86cb\u5706\u5706\u7684;"},{"k":"hellos","v":"n. \u5582( hello\u7684\u540d\u8bcd\u590d\u6570 );"},{"k":"hellow","v":" \uff08\u901a\u5e38\u7684\u62db\u547c\u8bed\uff09\u55e8\uff0c \uff08\u6253\u7535\u8bdd\u7528\uff09\u5582\uff01\uff0c \uff08\u82f1\uff09\uff08\u8868\u793a\u60ca\u8bb6\uff09\u54ce\u54df;"}]}
    print(json_data)

    #把json 字符串转化成字典
    json_data = json.loads(json_data)
    print(type(json_data)) #<class 'dict'>
    print(json_data)



    for item in json_data['data']:
        print(item['k'],"--",item['v'])

    """
    hello --- int. 打招呼; 哈喽，喂; 你好，您好; 表示问候; n. “喂”的招呼声或问候声; vi. 喊
    hello everyone ---  大家好;
    hello kitty --- n. 卡通世界中; 有这样一只小猫; 没有嘴巴; 脸蛋圆圆的;
    hellos --- n. 喂( hello的名词复数 );
    hellow ---  （通常的招呼语）嗨， （打电话用）喂！， （英）（表示惊讶）哎哟;
    """

- 为了更多的设置请求信息，单纯的通过urlopen函数已经不太好用了
- 需要利用request.Request 类
- 案例V06(已失效)

from  urllib import  request,parse
import  json
"""
利用parse 模块模拟post 请求
分析百度翻译
分析步骤
1. 打开F12
2. 尝试输入单词girl，发现每敲一个字母后都有请求
3. 请求地址是 http://fanyi.baidu.com/sug
4. 利用NetWork-All-Hearders，查看，发现FormData的值是 kw:girl
5. 检查返回内容格式，发现返回的是json格式内容==>需要用到json包
"""

'''
大致流程:
1.利用data 构造内容,然后urlopen 打开
2.返回json 格式
'''

if __name__ == '__main__':
    baseurl = 'http://fanyi.baidu.com/sug'
    text = input("Please input keyword:")
    data = {
        'kw':text
    }
    # TypeError: POST data should be bytes, an iterable of bytes, or a file object. It cannot be of type str.
    data = parse.urlencode(data)
    print(type(data)) #<class 'str'>

    data = data.encode("utf-8")

    headers = {
        'Accept': 'application / json', 'text / javascript'
        'Accept - Encoding': 'gzip', 'deflate'
        'Accept - Language': 'zh - CN', 'zh'
        'Connection': 'keep - alive'
        # 'Content - Length':len(data)
        # 'Content - Type':"application / x - www - form - urlencoded  charset = UTF - 8"

    }
    req = request.Request(url=baseurl, data=data, headers=headers)
    print(type(req))
    print(req)
    json_data = request.urlopen(req)
    # json_data = request.urlopen(url=baseurl,data=data)

    print(type(json_data)) #<class 'http.client.HTTPResponse'>
    # print(json_data)

    json_data = json_data.read().decode("utf-8")
    # {"errno":0,"data":[{"k":"hello","v":"int. \u6253\u62db\u547c; \u54c8\u55bd\uff0c\u5582; \u4f60\u597d\uff0c\u60a8\u597d; \u8868\u793a\u95ee\u5019; n. \u201c\u5582\u201d\u7684\u62db\u547c\u58f0\u6216\u95ee\u5019\u58f0; vi. \u558a"},{"k":"hello everyone","v":" \u5927\u5bb6\u597d;"},{"k":"hello kitty","v":"n. \u5361\u901a\u4e16\u754c\u4e2d; \u6709\u8fd9\u6837\u4e00\u53ea\u5c0f\u732b; \u6ca1\u6709\u5634\u5df4; \u8138\u86cb\u5706\u5706\u7684;"},{"k":"hellos","v":"n. \u5582( hello\u7684\u540d\u8bcd\u590d\u6570 );"},{"k":"hellow","v":" \uff08\u901a\u5e38\u7684\u62db\u547c\u8bed\uff09\u55e8\uff0c \uff08\u6253\u7535\u8bdd\u7528\uff09\u5582\uff01\uff0c \uff08\u82f1\uff09\uff08\u8868\u793a\u60ca\u8bb6\uff09\u54ce\u54df;"}]}
    print(json_data)

    #把json 字符串转化成字典
    json_data = json.loads(json_data)
    print(type(json_data)) #<class 'dict'>
    print(json_data)



    for item in json_data['data']:
        print(item['k'],"--",item['v'])

    """
    hello --- int. 打招呼; 哈喽，喂; 你好，您好; 表示问候; n. “喂”的招呼声或问候声; vi. 喊
    hello everyone ---  大家好;
    hello kitty --- n. 卡通世界中; 有这样一只小猫; 没有嘴巴; 脸蛋圆圆的;
    hellos --- n. 喂( hello的名词复数 );
    hellow ---  （通常的招呼语）嗨， （打电话用）喂！， （英）（表示惊讶）哎哟;
    """

Python学习笔记--Python 爬虫入门 -17-1 urllib_request+parse+chardet+get+post+Request

猜你喜欢