Python crawler learning 17

Python crawler learning 17

  • Advanced usage part 2

    • Session maintenance
      # 之前我们学过 post 与 get方法做到模拟网页进行请求,这两种方法是相互独立的,即相当于两个浏览器打开了不同的页面
      # 基于以上特点,我们使用爬虫时,用 POST 方法登录网站之后,再想要使用get方法获取请求个人信息页面显然不能得到我们想要的信息,那么如何解决这种问题呢?
      
      # 方法一
      在两次请求中传入相同的cookie参数
      # 方法二
      利用Session对象,进行session维护
      

      Example:

      import requests
      
      r0 = requests.get('https://www.httpbin.org/cookies/set/number/123456789')
      # 在我们设置cookie并成功获得请求后,再次向该网站请求
      print(r0.text)
      r1 = requests.get('https://www.httpbin.org/cookies')
      # 可以发现返回的cookies字段为空
      print(r1.text)
      

      operation result:

      insert image description here

      Using the Session object:

      # session 维持
      import requests
      
      s = requests.session()
      r0 = s.get('https://www.httpbin.org/cookies/set/number/123456789')
      r1 = s.get('https://www.httpbin.org/cookies')
      print(r0.text)
      print(r1.text)
      

      operation result:

      insert image description here

    • SSL certificate verification
      # 现在很多网站要求使用HTTPS协议,但是有些网站可能没有设置好HTTPS证书,或者网站的HTTPS 证书可能不被CA机构认可,这是这些网站可能会出现SSL证书错误的提示。
      # 例如我们访问这个网站 https://ssr2.scrape.center/
      # 就会有如下提示
      

      insert image description here

      Let's use the requests library to request such a website:

      import requests
      
      resp = requests.get('https://ssr2.scrape.center/')
      print(resp.status_code)
      
      # 嗯?怎么回事?怎么不让我们访问
      

      Run result: no result... throws us an SSLError error

      insert image description here

      Set verify parameter to bypass verification
      import requests
      # verify 参数设置为 True(默认值) 时会自动验证,反之则不会进行验证
      resp = requests.get('https://ssr2.scrape.center/',verify=False)
      print(resp.status_code)
      

      In this way, the status code can be obtained:

      insert image description here

      But we found that the program still gave a warning, expecting us to specify a certificate

      Set ignore warnings to suppress warnings
      import requests
      from requests.packages import urllib3
      
      urllib3.disable_warnings()
      resp = requests.get('https://ssr2.scrape.center/', verify=False)
      print(resp.status_code)
      

      operation result:

      insert image description here

      Specify certificate to bypass warning

      To set up the certificate, you need to have crt and key files locally, and specify their paths, and the key of the local private certificate must be decrypted.

      Don't look at me, like I can skip this step, I don't

      # 格式
      import requests
      
      resp = requests.get('https://ssr2.scrape.center/', cert=('path/**.crt', 'path/**.key'))
      

Today ends, to be continued...

Guess you like

Origin blog.csdn.net/szshiquan/article/details/123560383