-
Constructor crawler override file, selenium using a browser object to instantiate in the process (because the browser is instantiated objects only once)
-
Rewrite reptile files closed (self, spider) method, close the browser in its internal objects. This method is invoked at the end of reptiles
-
Rewrite process_response method to download middleware, so the method of response object to intercept and tamper response data stored in the page
-
Open the downloaded middleware in the configuration file
2. The code shows
- Reptile file:
WangyiSpider class (RedisSpider): name = 'wangyi' #allowed_domains = [ 'www.xxxx.com'] start_urls = [ 'https://news.163.com'] DEF the __init __ (Self): # instance of a browser objects (instantiated once) self.bro = webdriver.Chrome (executable_path = '/ the Users / Bobo / Desktop / chromedriver') # must be at the end of the entire crawler, close the browser DEF closed (Self, Spider): Print ( ' end reptiles') self.bro.quit ()
- Middleware file:
Import HtmlResponse scrapy.http from # parameters introduced: # intercepted response object (downloader Spider response is transmitted to the object) #request: request object corresponding to the object in response #response: the intercepted response object #spider: file corresponding crawler examples of reptiles def process_response (self, request, response , spider): tamper response # page data stored in the object if request.url in [ 'http://news.163.com/domestic/','http:/ /news.163.com/world/','http://news.163.com/air/','http://war.163.com/ ']: spider.bro.get (URL = Request. URL) JS = 'the window.scrollTo (0, document.body.scrollHeight)' spider.bro.execute_script (JS) the time.sleep (2) # predetermined time loading data browser give certain buffer # page data is It contains dynamic loading out of the news data corresponding to the page data page_text = spider. bro.page_source # tamper response object return HtmlResponse(url=spider.bro.current_url,body=page_text,encoding='utf-8',request=request) else: return response
- Profiles:
DOWNLOADER_MIDDLEWARES = { 'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543, }