Recap:
Supplementary knowledge:
ua request header using the library
Installation: PIP install Fake - useragent use: from fake_useragent Import UserAgent UA = UserAgent () call specifies UA: ua.ie Mozilla / 5.0 (Windows; U; MSIE 9.0 ; Windows NT 9.0 ; en- US) random UA: ua.random
scrapy use of middleware
One:
Download using middleware
-: Role: Bulk interception request headers and response
-: intercepts the request:
1: falsification request header (User-Agent)
2: Set Agent ip (process_exception) related to the requested object
two:
Use reptile middleware
A: Using a middleware
1: middleware code
class MiddleproDownloaderMiddleware ( Object ): # Not All Methods need to BE A Method, the If defined. IS not defined, # scrapy The Acts of AS IF at The Downloader does not the Modify at The Middleware # passed Objects. @classmethod DEF from_crawler (CLS, crawler): # This Method iS Used to Create your Scrapy by Spiders. S = CLS () crawler.signals.connect (s.spider_opened, Signal = signals.spider_opened) return S # intercept all non-exception processing request # is a parameter spider instantiated object # from fake_useragent Import UserAgent ua = UserAgent () DEF process_request (Self, Request, Spider): Print ( ' begin download middleware ' ) # ua for camouflage Print ( ' Request header is ' + self.ua.random) Request.Headers [ ' the User-- agent ' ] = self.ua.random # tests: see whether the agent operation commencement request.META [ ' proxy ' ] = ' https://218.60.8.83:3129 ' # the Called for each Request that goes through the downloader # Middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None # 拦截所有的响应 def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response # 拦截所有的异常请求对象 def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass #写日志 def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)
2: Parsing
2-> 1: normal request interception
2-2> to intercept an exception request
3 Test Results
ip has been replaced with proxy ip
And the request has been replaced with a random header request header