Introduced
- When data to make certain website crawling through scrapy framework, often encounter when the page is loaded Dynamic Data happen if their request is sent directly scrapy url, is absolutely not get that part out of the dynamic loading of data values. However, we found that observation, it is loaded out request data corresponding to a dynamic loading is performed through a browser url. So if we want to also get data dynamically loaded out in scrapy, you must use selenium to create a browser object, then the request is sent by the browser object, retrieve data values from a dynamically loaded.
Details today
1. Case Study:
News data crawling under Netease news of domestic, international, military, unmanned aerial vehicles sector: Demand -
- Demand analysis: When you click the hyperlink to enter the domestic domestic corresponding page, you will find the current page to show the news data is dynamically loaded out, if the request for the url directly by the program is to obtain a less than dynamic loading of news data of. We use the selenium you need to instantiate a browser object, requesting the url in that object, get news data dynamically loaded.
2.selenium scrapy principle used in the analysis:
When the engine submitted url domestic sector corresponding to the request to download, a download page data for downloading, and then downloaded to the page data, in response to the package submitted to the engine, the engine response to the sub Spiders. Page response data stored in the object Spiders received no news of where the data is dynamically loaded. To get news of dynamically loaded data, you need to submit to intercept the object in response to the response of the engine to download in the download middleware, tampering cut its page data internally stored, modified to carry the news out of the dynamic loading data , then the object has been tampered with final response to Spiders parse operation.
3.selenium use in the process of scrapy:
- Constructor crawler override file, selenium using a browser object to instantiate in the process (because the browser is instantiated objects only once)
- Rewrite reptile files closed (self, spider) method, close the browser in its internal objects. This method is invoked at the end of reptiles
- Rewrite process_response method to download middleware, so the method of response object to intercept and tamper response data stored in the page
- Open the downloaded middleware in the configuration file
Example 4:
# 1.spider file Import Scrapy from wangyiPro.items Import WangyiproItem from Selenium Import the webdriver class WangyiSpider (scrapy.Spider): name = ' wangyi ' # allowed_domains = [ 'www.xxx.con'] start_urls = [ ' HTTP: // www.xxx.con / ' ] # instantiated operating a browser is executed only once Bro = webdriver.Chrome (executable_path = ' chromedriver.exe ' ) URLs = [] # the final plate 5 is stored in the corresponding URL DEFthe parse (Self, Response): li_list = response.xpath ( ' // * [@ ID = "index2016_wrap"] / div [. 1] / div [2] / div [2] / div [2] / div [2] / div / UL / Li ' ) for index in [3,4,6,7,8 ]: Li = li_list [index] NEW_URL = li.xpath ( ' ./a/@herf ' ) .extract_first () Self. urls.append (NEW_URL) # of five sections corresponding to the transmission request url the yield scrapy.Request (url = NEW_URL, the callback = self.parse_news) # to parse each information data section corresponding to the news only parse [ title] DEF parse_news (Self, the Response): div_list = response.xpath (' // div [@ class = "ndi_main"] / div ' ) for div in div_list: title = div.xpath ( ' ./div/div [. 1] / H3 / A / text () ' ) .extract_first ( ) news_detail_url = div.xpath ( ' ./div/div[1]/h3/a/@href ' ) .extract_first () # instantiated item object, will resolve to the title and content item stored in the object item = WangyiproItem () Item [ ' title ' ] = title # of Information page url manually request access to news content yield scrapy.Request(url=news_detail_url,callback=self.parse_detail,meta={'item':item}) def parse_detail(self,response): item = response.meta['item'] # 通过response解析出新闻内容 content = response.xpath('//div[@id="endText"]//text()').extract() content = ''.join(content) item['content'] = content yield item def close(self,spider): #After the end of the reptiles, close the browser calls the method Print ( ' reptile overall end ~~~~~~~~~~~~~~~~~~~ ' ) self.bro.quit () ------ -------------------------------------------------- -------------------------------- # 2.items file Import scrapy class WangyiproItem (scrapy.Item): # the DEFINE at The Item here Wallpaper for your like Fields: # name = scrapy.Field () title = scrapy.Field () Content = scrapy.Field () --------------------- -------------------------------------------------- ----------------- # 3.middlewares file from scrapyImport Signals from scrapy.http Import HtmlResponse from Time Import SLEEP class WangyiproDownloaderMiddleware (Object): DEF process_request (Self, Request, Spider): return None DEF process_response (Self, Request, Response, Spider): # determine which objects are responsive 5 plate, if the response to the treatment object IF response.url in spider.urls: # Get browser defined in reptiles Bro = spider.bro bro.get (response.url) bro.execute_script ( 'the window.scrollTo (0, document.body.scrollHeight) ' ) SLEEP ( . 1 ) bro.execute_script ( ' the window.scrollTo (0, document.body.scrollHeight) ' ) SLEEP ( . 1 ) bro.execute_script ( ' the window.scrollTo ( 0, document.body.scrollHeight) ' ) SLEEP ( 1 ) bro.execute_script ( ' window.scrollTo (0, document.body.scrollHeight) ' ) SLEEP ( 1 ) # get a page carries news source data of data page_text = bro.page_source # 实例化一个新的响应对象 new_response = HtmlResponse(url=response.url,body=page_text,encoding='utf-8',request=request) return new_response else: return response def process_exception(self, request, exception, spider): pass ---------------------------------------------------------------------------------------- # 4.pipelines文件 class WangyiproPipeline(object): def process_item(self, item, spider): print(item) return item ---------------------------------------------------------------------------------------- # 5.setting文件 BOT_NAME = 'wangyiPro' SPIDER_MODULES = ['wangyiPro.spiders'] NEWSPIDER_MODULE = 'wangyiPro.spiders' USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36' ROBOTSTXT_OBEY = False DOWNLOADER_MIDDLEWARES = { 'wangyiPro.middlewares.WangyiproDownloaderMiddleware ' : 543 , } ITEM_PIPELINES = { ' wangyiPro.pipelines.WangyiproPipeline ' : 300 , } , LOG_LEVEL, = ' ERROR '