001 爬虫的基本概念以及urllib的request和parse

1.http的请求方式：

get请求
    优点：比较便捷
    缺点：不安全、长度有限制
post请求
    优点：比较安全、数据整体没有限制、可以上传文件
put
delete（删除一些信息）
    发送网络请求（可以带一定的数据给服务器）
head（请求头）
    Accept：文本格式
    Accept-Encoding：编码格式
    Connection：长链接/短链接
    Cookie：缓存
    Referer：表示从哪个页面跳转的
    Uer-Agent：浏览器和用户信息

2.爬虫的分类：

通用爬虫：
    使用搜索引擎：百度、谷歌、雅虎
    优点：开放性、速度快
    缺点：目标不明确
聚焦爬虫：（主要内容）
    优点：目标明确、对用户的需求非常明确、返回的内容很固定
    增量式：  翻页：从第一页请求到最后一页
    Deep：    深度爬虫——静态数据 html和css
    动态数据：js代码 加密js
    robots：  是否允许其他爬虫（通用爬虫）进行爬取

3.爬虫的工作过程：

    1、确认爬取的目标网站
    2、使用python（java/GO）代码发送请求获取数据
    3、解析获取到的数据
        获取新的目标（url）
        递归第一步
        爬自己想要的数据
    4、数据持久化
        python3（自带）：
　　　　　　　· urllib.request
            · urlopen：
            · get传参
            · post
            · handle处理器的自定义
            · urlError
    　　requests（第三方）
    　　数据解析：xpath bs4
    　　数据存储

from urllib import request

def load_data():
    url = "http://www.baidu.com/"
    # 发送get的http请求
    # respense: http相应的对象
    response = request.urlopen(url)
    # 读取内容 read()是bytes类型
    data = response.read().decode()
    # 将数据写入文件
    with open('baidi.html', 'w', encoding='utf-8') as fp:
        fp.write(data)

    # python爬取的类型
    # 字符串类型  编码用encode
    # bytes类型   解码用decode

load_data()

使用urlopen获取网站

from urllib import request, parse
import string

def get_method_params():
    # 目标字符串
    # url = "https://www.baidu.com/s?wd=%E4%BD%A0%E5%A5%BD"
    # 拼接字符串
    url = "https://www.baidu.com/s?wd="
    str_name = '你好'
    # python 是解释性语言 只支持ASCII 0-127 不支持中文

    final_url = url + str_name
    print(final_url)  # https://www.baidu.com/s?wd=你好
    # 将包含汉字的网址进行转义
    new_url = parse.quote(final_url, safe=string.printable)
    print(new_url)  # https://www.baidu.com/s?wd=%E4%BD%A0%E5%A5%BD

    # 发送网络请求
    response = request.urlopen(new_url)
    print(response.read().decode())

get_method_params()

使用parse解析带有中文的传参地址

001 爬虫的基本概念以及urllib的request和parse

猜你喜欢