Use scrapy3 middleware

Recap:

     

   Supplementary knowledge:

      ua request header using the library

      

Installation: 
PIP install Fake - useragent 

use: 
from fake_useragent Import UserAgent 
UA = UserAgent () 

call specifies UA: 
ua.ie 
Mozilla / 5.0 (Windows; U; MSIE 9.0 ; Windows NT 9.0 ; en- US) 

random UA: 
ua.random

 

 

 

 

  scrapy use of middleware

    One:

      Download using middleware  

        -: Role: Bulk interception request headers and response

        -: intercepts the request:

          1: falsification request header (User-Agent)

          2: Set Agent ip (process_exception) related to the requested object

  

    two: 

      Use reptile middleware

      

     A: Using a middleware

       1: middleware code

  

class MiddleproDownloaderMiddleware ( Object ): 
    # Not All Methods need to BE A Method, the If defined. IS not defined, 
    # scrapy The Acts of AS  IF at The Downloader does not the Modify at The Middleware 
    # passed Objects. 



    @classmethod 
    DEF from_crawler (CLS, crawler): 
        # This Method iS Used to Create your Scrapy by Spiders. 
        S = CLS () 
        crawler.signals.connect (s.spider_opened, Signal = signals.spider_opened)
         return S 


    # intercept all non-exception processing request 
    # is a parameter spider instantiated object 
    #
    from fake_useragent Import UserAgent 
    ua = UserAgent () 

    DEF process_request (Self, Request, Spider): 
        Print ( ' begin download middleware ' ) 


        # ua for camouflage 
        Print ( ' Request header is ' + self.ua.random) 
        Request.Headers [ ' the User-- agent ' ] = self.ua.random 

        # tests: see whether the agent operation commencement 
        request.META [ ' proxy ' ] = ' https://218.60.8.83:3129 ' 




        # the Called for each Request that goes through the downloader 
        # Middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called



        return None

    # 拦截所有的响应

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response



    # 拦截所有的异常请求对象

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass


    #写日志

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

 

        2: Parsing

       2-> 1: normal request interception

        

        2-2> to intercept an exception request

 

      3 Test Results 

    ip has been replaced with proxy ip

    And the request has been replaced with a random header request header

     

 


   
      

 




 

Guess you like

Origin www.cnblogs.com/baili-luoyun/p/10962571.html
Recommended