[爬虫小记] 优秀的requests模块

前言

除了当初学习爬虫的时候用过urllib、urllib2，后来再没用过了。都是使用的requests，本文将记录一直以来个人使用 requests的经验总结。

正文

    import requests
    r = requests.get('http://www.baidu.com')

    print(r.status_code)  #200  获取状态码

    print(r.text)     #常用，获取Unicode形式的response，默认以utf8解码为Unicode

    print(r.content)    #获取字节形式的response，可供编码检测

    print(chardet.detect(r.content))  #编码由原网页决定，这里的检测只是一定概率正确

    # 如果r.text输出有乱码，说明网站非utf8编码，需修改response编码
    r.encoding = 'gbk'  # gbk , gb18030 等

    myparams = {'name': 'lei', 'age': 222}
    #这个参数会以明文方式直接填充到url中：http://www.baidu.com/?age=222&name=lei
    #它是http请求中的QueryString
    requests.get('http://www.baidu.com',params=myparams)

    #data参数在post中传递表单，不会填充在url中，是http请求的body
    requests.post('http://www.baidu.com', data={'name': 'lei', 'age': 222})

    #上传文件
    with open(r'c:\words_v1.txt')as f:
        _file = {'file': f}
        requests.post('http://www.baidu.com', files=_file)

    #传入cookie (官方的写法较复杂，不必采用)
    requests.get('http://www.baidu.com',headers={'Cookie':'PSTM=1525659528;
                                                    BD_UPN=12314753;'})

    #取消ssl验证以访问https站点
    requests.get('https://www.baidu.com', verify=False)

    #代理使用
    _list = ["http://41.118.132.69:4433", "http://51.228.12.69:4423"]
    for i in _list:
        p = {'http': i}
        r = requests.get('https://www.baidu.com', proxies=p)
        if r.status_code == 200:
            # do something
            break

    #下载文件 ,iter_count 限制每次下载的字节数，逐块写入
    with open(r'xx.png', 'wb')as f:
        for c in requests.get('http://xxx.com', stream=True).iter_content(1024):
            f.write(c)

    #保持会话 ，常用在模拟登录、登录以访问资源等情况
    s = requests.session()  #Session() 一样
    #s 与直接用requests的大部分用法一致，如get post

    UA = {'User-Agent':'xxx'}
    s.headers = UA    #或  s.headers.update(UA)

    #传入cookie
    dict_cookie = dict(a='123', b='321')
    requests.utils.add_dict_to_cookiejar(s.cookies, dict_cookie)
    # 实际操作时这种方式也复杂，我们想要的是从浏览器中复制整个cookie字符串
    # 到python代码中直接使用，而无需修改为字典形式，参见下面的方式

    #传入cookie 方式二
    s.get('http://www.baidu.com', headers={'Cookie': 'PSTM=1525659528;
                                            BD_UPN=12314753;'})

    #持久化cookie
    import json
    with open('cookie.txt', 'wb') as f:
        cookie = s.cookies.get_dict()
        json.dump(cookie, f)

    #从文件加载cookie
    with open('cookie.txt') as f:
        cookie = json.load(f)
        s.cookies.update(cookie)

有高见请留言。

声明：本文章为个人对技术的理解与总结，不能保证毫无瑕疵，接收网友的斧正。

[爬虫小记] 优秀的requests模块

前言

正文

猜你喜欢