Anti-pickup mechanism (lazy loading and cookies)

Anti-pickup mechanism: lazy loading of pictures

  • Webmaster material HD picture download
  • Anti-pickup mechanism: lazy loading of pictures, widely used in some picture websites
    • Only when the image is actually displayed in the browser's visual range will the pseudo attribute of the img tag become a real attribute. If the request is sent, the request request has no visual scope, so we must parse the img pseudo attribute Attribute value (picture address, for example, the webmaster material is visual src, the pseudo attribute is src2, so just crawl the src2 attribute)
  • The anti-climbing mechanism learned:
    • robots
    • UA camouflage
    • Capture of dynamically loaded data
    • Image lazy loading

cookie:

  • Is a set of key-value pairs stored on the client.

  • Typical applications of cookies in the web:

    ----- Free password login

  • The link between cookies and crawlers

    • Sometimes, when requesting a page, if the cookie is not carried in the request process, then we cannot request the correct page data. Therefore, cookies are a typical anti-pickup mechanism common to crawlers.
  • Demand: Crawl the consulting information in the snowball net. https://xueqiu.com/

  • analysis:

    • 1. Determine whether the crawled consulting data is dynamically loaded
      • More related consulting data is dynamically loaded. When the wheel slides to the bottom, more consulting data will be dynamically loaded.
    • 2. Locate the data packet requested by ajax, extract the requested url, and the response data is consultation data in the form of json
import requests
headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}
#这是雪球网发ajax获取数据的网址
url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id=20369434&count=15&category=-1'
page_text = requests.get(url=url,headers=headers).json()
page_text

结果:{'error_description': '遇到错误,请刷新页面或者重新登录帐号后再试',
 'error_uri': '/v4/statuses/public_timeline_by_category.json',
 'error_data': None,
 'error_code': '400016'}
  • Problem: We did not request the data we wanted

  • Reason: We are not strictly simulating browser requests.

    • Processing: You can paste the request headers carried by the browser in the header dictionary, and apply the headers to the requests operation of requests.
  • Cookie handling

    • Method 1: Manual processing

      • Paste the cookie in the packet capture tool in the headers
      • Disadvantages: this method is invalid if the cookies have expired
    • Method 2: Automatic processing

      • Automatic processing based on Session object
      • How to get a session object, requests.Session () returns a session object
      • The role of the session object:
        • The object can call get and post requests like requests. It's just that if a cookie is generated in the process of using the session to send a request, the cookie will be automatically stored in the session object, which means that this means that the next time the session object is used to initiate the request, the request is carried Requests sent by cookies.
      • When using sessions in crawlers, session objects are used at least twice!
        • The first time is to use the session to automatically store the cookie capture in the session object and initiate a request to the home page
        • The second time is to send a request with a cookie.
      #创建好session对象
      session = requests.Session()
      #第一次使用session捕获且存储cookie,猜测对雪球网的首页发起的请求可能会产生cookie
      main_url = "https://xueqiu.com"
      session.get(main_url,headers=headers) #捕获且存储cookie
      url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id=20369434&count=15&category=-1'
      page_text = session.get(url=url,headers=headers).json() #携带cookie发起的请求
      

Guess you like

Origin www.cnblogs.com/zzsy/p/12687591.html