Use Urllib (2)-browser disguise

  • The previous example crawled into memory crawled the title of Jingdong, and it was easily crawled down. Next, we will change the URL to another example: https://www.qiushibaike.com/
    • import urllib.request
       import re
       # ignore small details are automatically skipped, greatly reducing the error rate 
      # crawl data into memory 
      # http://www.jd.com 
      url = " https://www.qiushibaike.com/ " 
      data = urllib.request.urlopen (url) .read (). decode ( " utf-8 " , " ignore " ) 
      pat = " <title> (. *?) </ title> " 
      # re.S mode modifier, Web page data is often multi-line, to avoid the impact of multiple lines 
      print (re.compile (pat, re.S) .findall (data))

       

    • The error message indicates that our link is closed remotely, it is possible that the IP is blocked, or the crawler is recognized (this is the majority)
  • Browser disguise
    • The above error message, and then you can browse with a browser, which proves that our IP is not blocked, if you want to continue to visit, we can disguise the browser, the other party sees whether you are a browser, if it is through
    • Import the urllib.request
       Import Re
       # browser camouflage 
      # established opener object, can be set opener 
      opener = urllib.request.build_opener ()
       # Construction ancestral User-Agent, key 
      the UA = ( " User-Agent " , " the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 80.0.3987.132 Safari / 537.36 " )
       # place it addheaders attribute 
      opener.addheaders = [the UA]
       # the opener is mounted as a global 
      urllib .request.install_opener (opener) 
      url = " http://www.qiushibaike.com " 
      pat ="<title>(.*?)</title>"
      data = urllib.request.urlopen(url).read().decode("utf-8","ignore")
      print(re.compile(pat,re.S).findall(data))

Guess you like

Origin www.cnblogs.com/u-damowang1/p/12724245.html