Use of Selenium in Scrapy

1. selenium used in the process of scrapy

  • Constructor crawler override file, selenium using a browser object to instantiate in the process (because the browser is instantiated objects only once)

  • Rewrite reptile files closed (self, spider) method, close the browser in its internal objects. This method is invoked at the end of reptiles

  • Rewrite process_response method to download middleware, so the method of response object to intercept and tamper response data stored in the page

  • Open the downloaded middleware in the configuration file

2. The code shows

- Reptile file:

WangyiSpider class (RedisSpider): 
     name = 'wangyi' 
     #allowed_domains = [ 'www.xxxx.com'] 
     start_urls = [ 'https://news.163.com'] 
     DEF the __init __ (Self): 
         # instance of a browser objects (instantiated once) 
         self.bro = webdriver.Chrome (executable_path = '/ the Users / Bobo / Desktop / chromedriver') 
     # must be at the end of the entire crawler, close the browser 
     DEF closed (Self, Spider): 
         Print ( ' end reptiles') 
         self.bro.quit ()

imgClick and drag to move

- Middleware file:

Import HtmlResponse scrapy.http from     
     # parameters introduced: 
     # intercepted response object (downloader Spider response is transmitted to the object) 
     #request: request object corresponding to the object in response 
     #response: the intercepted response object 
     #spider: file corresponding crawler examples of reptiles 
     def process_response (self, request, response , spider): 
         tamper response # page data stored in the object 
         if request.url in [ 'http://news.163.com/domestic/','http:/ /news.163.com/world/','http://news.163.com/air/','http://war.163.com/ ']: 
             spider.bro.get (URL = Request. URL) 
             JS = 'the window.scrollTo (0, document.body.scrollHeight)' 
             spider.bro.execute_script (JS) 
             the time.sleep (2) # predetermined time loading data browser give certain buffer 
             # page data is It contains dynamic loading out of the news data corresponding to the page data 
             page_text = spider. bro.page_source 
             # tamper response object
             return HtmlResponse(url=spider.bro.current_url,body=page_text,encoding='utf-8',request=request)
         else:
             return response

imgClick and drag to move

- Profiles:

 DOWNLOADER_MIDDLEWARES = {
     'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543,
 ​
 }

Guess you like

Origin www.cnblogs.com/yzg-14/p/12207872.html