爬虫基本库的使用---requests库

  • 使用requests---实现Cookies、登录验证、代理设置等操作

    处理网页验证和Cookies时,需要写Opener和Handler来处理,为了更方便地实现这些操作,就有了更强大的库requests

    • 例子简单使用requests库

       1 import requests
       2 
       3 r = requests.get('http://wwww.baidu.com/')
       4 print(type(r), r.status_code, r.text, r.cookies, sep='\n\n')
       5 
       6 
       7 # 输出:
       8 <class 'requests.models.Response'>
       9 
      10 200
      11 
      12 <!DOCTYPE html>
      13 <!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible 
      14 ......
      15 feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>
      16 
      17 
      18 <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
      View Code
      • GET请求

        • GET请求,返回相应的请求信息
        • requests.get(url, params, **kwargs)
          • url表示要捕获的页面链接,params表示url的额外参数(字典或字节流格式),**kwargs表示12个控制访问的参数
             1 import requests
             2 
             3 r = requests.get('http://httpbin.org/get')
             4 print(r.text)
             5 
             6 
             7 # 输出:
             8 {
             9   "args": {}, 
            10   "headers": {
            11     "Accept": "*/*", 
            12     "Accept-Encoding": "gzip, deflate", 
            13     "Host": "httpbin.org", 
            14     "User-Agent": "python-requests/2.21.0"
            15   }, 
            16   "origin": "120.85.108.192, 120.85.108.192", 
            17   "url": "https://httpbin.org/get"
            18 }
            19 
            20 
            21 # 返回结果中包含请求头、URL、IP等信息
            View Code
             1 import requests
             2 
             3 data = {
             4     'name': 'LiYihua',
             5     'age': '21'
             6 }
             7 r = requests.get('http://httpbin.org/get', params=data)
             8 print(r.text)
             9 
            10 
            11 # 输出:
            12 {
            13   "args": {
            14     "age": "21", 
            15     "name": "LiYihua"
            16   }, 
            17   "headers": {
            18     "Accept": "*/*", 
            19     "Accept-Encoding": "gzip, deflate", 
            20     "Host": "httpbin.org", 
            21     "User-Agent": "python-requests/2.21.0"
            22   }, 
            23   "origin": "120.85.108.92, 120.85.108.92", 
            24   "url": "https://httpbin.org/get?name=LiYihua&age=21"
            25 }
            View Code
             1 import requests
             2 
             3 r = requests.get('http://httpbin.org/get')
             4 print(type(r.text), r.json(), type(r.json()), sep='\n\n')
             5 
             6 
             7 # 输出:
             8 <class 'str'>
             9 
            10 {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.21.0'}, 'origin': '120.85.108.92, 120.85.108.92', 'url': 'https://httpbin.org/get'}
            11 
            12 <class 'dict'>
            13 
            14 # json()方法可以将返回结果是JSON格式的字符串转化为字典
            View Code

            抓取二进制数据

             1 import requests
             2 
             3 r = requests.get('https://github.com/favicon.ico')
             4 print(r.text, r.content, sep='\n\n')
             5 
             6 # response.content返回的是bytes型的数据。
             7 # 如果想取图片,文件,则可以通过r.content
             8 
             9 # response.text返回的是Unicode型的数据。
            10 # 如果想取文本,可以通过r.text
            11 
            12 # 输出:
            13 :�������OL��......
            14 
            15 b'\x00\x00\x01\x00\x02\x00\x10\x10\x00\x00\x0......
            View Code

            将提取到的图片保存

            1 import requests
            2 
            3 r = requests.get('https://github.com/favicon.ico')
            4 with open('favicon.ico', 'wb') as f:
            5     f.write(r.content)
            6 
            7 # 运行结束后生成一个名为favicon.ico的图标
            View Code

            上一个例子用到的open()方法和with as语句

            # open()方法
            # def open(file, mode='r', buffering=None, encoding=None, errors=None, newline=None, closefd=True)
            
            # 常用参数:
            file表示要打开的文件            mode表示打开文件的模式:只读,写入,追加等
            
            buffering : 如果 buffering 的值被设为 0,就不会有寄存。如果 buffering 的值取 1,访问文件时会寄存行。如果将 buffering 的值设为大于 1 的整数,表明了这就是的寄存区的缓冲大小。如果取负值,寄存区的缓冲大小则为系统默认
            
            # 对于mode参数
            ========= ===============================================================
                字母的意义
                --------- ---------------------------------------------------------------
                'r'         打开阅读(默认)
                'w'        打开进行写入,首先截断文件
                'x'        创建一个新文件并打开它进行写入
                'a'        打开进行写入,如果文件存在,则附加到文件结尾
                'b'        二进制模式
                't'         文本模式(默认)
                '+'        打开磁盘文件进行更新(读写)
                'U'       通用换行模式(已弃用)
                ========= ===============================================================
            
            
            # with as 语句
            有一些任务,可能事先需要设置,事后做清理工作。对于这种场景,Python的with语句提供了一种非常方便的处理方式。
            with的处理基本思想是with所求值的对象必须有一个__enter__()方法,一个__exit__()方法。紧跟with后面的语句被求值后,返回对象的__enter__()方法被调用,这个方法的返回值将被赋值给as后面的变量。当with后面的代码块全部被执行完之后,将调用前面返回对象的__exit__()方法。
            代码解释说明:
            class Sample:
                def __enter__(self):
                    print "In __enter__()"
                    return "Foo"
             
                def __exit__(self, type, value, trace):
                    print "In __exit__()"
             
            def get_sample():
                return Sample()
             
            with get_sample() as sample:
                print "sample:", sample
            View Code

            添加headers

             1 import requests
             2 
             3 r = requests.get('https://www.zhihu.com/explore')
             4 print(r.text)
             5 
             6 
             7 # 输出:
             8 <html>
             9 <head><title>400 Bad Request</title></head>
            10 <body bgcolor="white">
            11 <center><h1>400 Bad Request</h1></center>
            12 <hr><center>openresty</center>
            13 </body>
            14 </html>
            15 
            16 # 部分网址需要传递headers,如果不传递,就不能正常请求
            17 import requests
            18 
            19 headers = {
            20     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko '
            21                   'Chrome/52.0.2743.116 Safari/537.36'
            22 }
            23 r = requests.get('https://www.zhihu.com/explore', headers=headers)
            24 print(r.text)
            25 
            26 
            27 
            28 # 输出:
            29 <!DOCTYPE html>
            30 <html lang="zh-CN" dropEffect="none" class="no-js no-auth ">
            31 <head>
            32 <meta charset="utf-8" />
            33 ......
            34 <script type="text/zscript" znonce="d78db0c15fa84270ac967503884baf11"></script>
            35 
            36 <input type="hidden" name="_xsrf" value="cdb6166e0dc5f38afc3ee95053d7ef55"/>
            37 </body>
            38 </html>
            View Code
      • POST请求

        • 这是一种比较常见的URL请求方式
           1 import requests
           2 
           3 data = {
           4     'name': 'LiYihua',
           5     'age': 21
           6 }
           7 r = requests.post('http://httpbin.org/post', data=data)
           8 print(r.text)
           9 
          10 
          11 # 输出:
          12 {
          13   "args": {}, 
          14   "data": "", 
          15   "files": {}, 
          16   "form": {
          17     "age": "21", 
          18     "name": "LiYihua"
          19   }, 
          20   "headers": {
          21     "Accept": "*/*", 
          22     "Accept-Encoding": "gzip, deflate", 
          23     "Content-Length": "19", 
          24     "Content-Type": "application/x-www-form-urlencoded", 
          25     "Host": "httpbin.org", 
          26     "User-Agent": "python-requests/2.21.0"
          27   }, 
          28   "json": null, 
          29   "origin": "120.85.108.90, 120.85.108.90", 
          30   "url": "https://httpbin.org/post"
          31 }
          32 
          33 # POST请求成功,获得返回结果,form部分为提交的数据
          View Code
      • 响应

        • text 和 content 获取响应的内容

          status code 属性得到状态码    headers 属性得到响应头    cookies属性得到 Cookies

          url属性得到 URL    history属性得到请求历史

           1 import requests
           2 
           3 r = requests.get('https://www.cnblogs.com/liyihua/')
           4 
           5 print(type(r.status_code), r.status_code,
           6       type(r.headers), r.headers,
           7       type(r.cookies), r.cookies,
           8       type(r.url), r.url,
           9       type(r.history), r.history,
          10       sep='\n\n')
          11 
          12 
          13 # 输出:
          14 <class 'int'>
          15 
          16 200
          17 
          18 <class 'requests.structures.CaseInsensitiveDict'>
          19 
          20 {'Date': 'Thu, 20 Jun 2019 08:18:00 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Cache-Control': 'private, max-age=10', 'Expires': 'Thu, 20 Jun 2019 08:18:10 GMT', 'Last-Modified': 'Thu, 20 Jun 2019 08:18:00 GMT', 'X-UA-Compatible': 'IE=10', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Encoding': 'gzip'}
          21 
          22 <class 'requests.cookies.RequestsCookieJar'>
          23 
          24 <RequestsCookieJar[]>
          25 
          26 <class 'str'>
          27 
          28 https://www.cnblogs.com/liyihua/
          29 
          30 <class 'list'>
          31 
          32 []
          View Code

          状态码通常用来判断请求是否成功

           1 import requests
           2 
           3 r = requests.get('http://www.baidu.com')
           4 exit() if not r.status_code == requests.codes.ok else print('Request Successfully')
           5 
           6 
           7 # 输出:
           8 Request Successfully
           9 
          10 # request.codes.ok 返回成功的状态码200
          View Code

          返回码和相应的查询条件

          扫描二维码关注公众号,回复: 6564276 查看本文章

    • 高级用法

      • 文件上传

         1 import requests
         2 
         3 files = {
         4     'file': open('favicon.ico', 'rb')
         5 }
         6 r = requests.post('http://httpbin.org/post', files=files)
         7 print(r.text)
         8 
         9 
        10 # 输出:
        11 {
        12   "args": {}, 
        13   "data": "", 
        14   "files": {
        15     "file": "data:application/octetstream;base64,AAABAAIAEBAAAAEAIAAoBQAAJgAAACAgAAABACAAKBQAAE4FAAAoAAAAEAAAACAAAAABACAAAAAAAAAFAAA...
        16   }, 
        17   "form": {}, 
        18   "headers": {
        19     "Accept": "*/*", 
        20     "Accept-Encoding": "gzip, deflate", 
        21     "Content-Length": "6665", 
        22     "Content-Type": "multipart/form-data; boundary=c1b665273fc73e67e57ac97e78f49110", 
        23     "Host": "httpbin.org", 
        24     "User-Agent": "python-requests/2.21.0"
        25   }, 
        26   "json": null, 
        27   "origin": "120.85.108.71, 120.85.108.71", 
        28   "url": "https://httpbin.org/post"
        29 }
        View Code
      • Cookies

         1 import requests
         2 
         3 headers = {
         4     'Cookie': 'tgw_l7_route=66cb16bc7......ECLNu3tQ',
         5     'Host': 'www.zhihu.com',
         6     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
         7 }
         8 r = requests.get('https://www.zhihu.com', headers=headers)
         9 print(r.text)
        10 
        11 # 输出:
        12 <!doctype html>
        13 <html lang="zh" data-hairline="true" data-theme="light"><head><meta charSet="utf-8"/><title data-react-helmet="true">首页 - 知乎</title><meta name="viewport" ......
        14 # 说明登录成功
        15 
        16 
        17 # Cookie维持登录状态,首先登录知乎,复制headers中的Cookie,然后将其设置到Headers里面,然后发送请求
        View Code
         1 from requests.cookies import RequestsCookieJar
         2 import requests
         3 
         4 cookies = 'tgw_l7_route=66cb16bc7f45da64562a07.......ALNI_MbNds66nlodoTCxp8EVE6ECLNu3tQ'
         5 jar = requests.cookies.RequestsCookieJar()
         6 
         7 headers = {
         8     'Host': 'www.zhihu.com',
         9     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
        10 }
        11 
        12 for cookies in cookies.split(';'):
        13     key, value = cookies.split('=', 1)
        14     jar.set(key, value)
        15 
        16 r = requests.get('https://www.zhihu.com', cookies=jar, headers=headers)
        17 print(r.text)
        18 
        19 
        20 # 输出同上面一样
        21 # 将复制下来的cookies利用split()方法处理分割
        22 # 创建RequestsCookieJar对象,利用set()方法设置好每个Cookie的key和value
        View Code

      • 会话维持

        • Session对象,可以方便的维护一个会话

           1 import requests
           2 
           3 requests.get('http://httpbin.org/cookies/set/number/123456789')
           4 r = requests.get('http://httpbin.org/cookies')
           5 print(r.text)
           6 
           7 
           8 # 输出:
           9 {
          10   "cookies": {}
          11 }
          12 
          13 
          14 import requests
          15 
          16 s = requests.Session()
          17 s.get('http://httpbin.org/cookies/set/number/123456789')
          18 r = s.get('http://httpbin.org/cookies')
          19 print(r.text)
          20 
          21 
          22 # 输出:
          23 {
          24   "cookies": {
          25     "number": "123456789"
          26   }
          27 }
          View Code
        • SSL证书验证

           1 import requests
           2 
           3 r = requests.get('https://www.12306.cn')
           4 print(r.status_code)
           5 
           6 # 没有出错会输出:200
           7 # 如果请求一个HTTPS站点,但是证书验证错误的页面时,就会错误。
           8 
           9 
          10 # 为了避免错误,可以将改例子稍作修改
          11 import requests
          12 from requests.packages import urllib3
          13 
          14 urllib3.disable_warnings()
          15 r = requests.get('https://www.12306.cn', verify=False)
          16 print(r.status_code)
          View Code
        • 代理设置

           1 import requests
           2 
           3 proxies = {
           4     'http': 'socks5://user:[email protected]:3128',
           5     'https': 'socks5://user:[email protected]:1080'
           6 }
           7 
           8 requests.get('https://www.taobao.com', proxies=proxies)
           9 
          10 
          11 # 使用SOCKS协议代理
          View Code
        • 超时设置

          1 import requests
          2 
          3 r = requests.get('https://taobao.com', timeout=(0.1, 1))
          4 print(r.status_code)
          5 
          6 # 输出:200
          View Code
        • 身份验证
           1 import requests
           2 from requests.auth import HTTPBasicAuth
           3 
           4 r = requests.get('http://localhost', auth=HTTPBasicAuth('liyihua', 'woshiyihua134'))
           5 print(r.status_code)
           6 
           7 
           8 # 输出:200
           9 
          10 
          11 # 也可以使用OAuth1方法
          12 import requests
          13 from requests_oauthlib import OAuth1
          14 
          15 url = 'https://api.twitter.com/1.1/account/verify_credentials.json'
          16 auth = OAuth1('YOUR_APP_KEY', 'YOUR_APP_SECRET'
          17               'USER_OAUTH_TOKEN', 'USER_OAUTH_TOKEN_SECRET')
          18 requests.get(url, auth=auth)
          View Code
        • Prepared Request(准备请求)

          要获取一个带有状态的 Prepared Request, 需要用Session.prepare_request()
           1 from requests import Request, Session
           2 
           3 url = 'http://httpbin.org/post'
           4 data = {
           5     'name': 'LiYihua'
           6 }           # 参数
           7 header = {
           8     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537 (KHTML, like Gecko Chrome/53.0.2785.116 Safari/537.36'
           9 }           # 伪装浏览器
          10 s = Session()                       # 会话维持
          11 req = Request('POST', url, data=data, headers=header)
          12 
          13 prepped = s.prepare_request(req)            # Session的prepare_request()方法将req转化为一个 Prepared Request对象 
          14 r = s.send(prepped)                 # send() 发送请求
          15 print(r.text)
          16 
          17 
          18 # 输出:
          19 {
          20   "args": {}, 
          21   "data": "", 
          22   "files": {}, 
          23   "form": {
          24     "name": "LiYihua"
          25   }, 
          26   "headers": {
          27     "Accept": "*/*", 
          28     "Accept-Encoding": "gzip, deflate", 
          29     "Content-Length": "12", 
          30     "Content-Type": "application/x-www-form-urlencoded", 
          31     "Host": "httpbin.org", 
          32     "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537 (KHTML, like Gecko Chrome/53.0.2785.116 Safari/537.36"
          33   }, 
          34   "json": null, 
          35   "origin": "120.85.108.184, 120.85.108.184", 
          36   "url": "https://httpbin.org/post"
          37 }
          View Code

猜你喜欢

转载自www.cnblogs.com/liyihua/p/11050374.html