python web crawler use of selenium in --Scrapy

Introduced

  • When data to make certain website crawling through scrapy framework, often encounter when the page is loaded Dynamic Data happen if their request is sent directly scrapy url, is absolutely not get that part out of the dynamic loading of data values. However, we found that observation, it is loaded out request data corresponding to a dynamic loading is performed through a browser url. So if we want to also get data dynamically loaded out in scrapy, you must use selenium to create a browser object, then the request is sent by the browser object, retrieve data values ​​from a dynamically loaded.

Details today

1. Case Study:

    News data crawling under Netease news of domestic, international, military, unmanned aerial vehicles sector: Demand -

    - Demand analysis: When you click the hyperlink to enter the domestic domestic corresponding page, you will find the current page to show the news data is dynamically loaded out, if the request for the url directly by the program is to obtain a less than dynamic loading of news data of. We use the selenium you need to instantiate a browser object, requesting the url in that object, get news data dynamically loaded.

2.selenium scrapy principle used in the analysis:

When the engine submitted url domestic sector corresponding to the request to download, a download page data for downloading, and then downloaded to the page data, in response to the package submitted to the engine, the engine response to the sub Spiders. Page response data stored in the object Spiders received no news of where the data is dynamically loaded. To get news of dynamically loaded data, you need to submit to intercept the object in response to the response of the engine to download in the download middleware, tampering cut its page data internally stored, modified to carry the news out of the dynamic loading data , then the object has been tampered with final response to Spiders parse operation.

3.selenium use in the process of scrapy:

  • Constructor crawler override file, selenium using a browser object to instantiate in the process (because the browser is instantiated objects only once)
  • Rewrite reptile files closed (self, spider) method, close the browser in its internal objects. This method is invoked at the end of reptiles
  • Rewrite process_response method to download middleware, so the method of response object to intercept and tamper response data stored in the page
  • Open the downloaded middleware in the configuration file

Example 4:

# 1.spider file 

Import Scrapy
 from wangyiPro.items Import WangyiproItem
 from Selenium Import the webdriver 

class WangyiSpider (scrapy.Spider): 
    name = ' wangyi ' 
    # allowed_domains = [ 'www.xxx.con'] 
    start_urls = [ ' HTTP: // www.xxx.con / ' ]
     # instantiated operating a browser is executed only once 
    Bro = webdriver.Chrome (executable_path = ' chromedriver.exe ' ) 

    URLs = [] # the final plate 5 is stored in the corresponding URL 

    DEFthe parse (Self, Response): 
        li_list = response.xpath ( ' // * [@ ID = "index2016_wrap"] / div [. 1] / div [2] / div [2] / div [2] / div [2] / div / UL / Li ' )
         for index in [3,4,6,7,8 ]: 
            Li = li_list [index] 
            NEW_URL = li.xpath ( ' ./a/@herf ' ) .extract_first () 
            Self. urls.append (NEW_URL) 

            # of five sections corresponding to the transmission request url 
            the yield scrapy.Request (url = NEW_URL, the callback = self.parse_news) 

    # to parse each information data section corresponding to the news only parse [ title] 
    DEF parse_news (Self, the Response):
        div_list  = response.xpath (' // div [@ class = "ndi_main"] / div ' )
         for div in div_list: 
            title = div.xpath ( ' ./div/div [. 1] / H3 / A / text () ' ) .extract_first ( ) 
            news_detail_url = div.xpath ( ' ./div/div[1]/h3/a/@href ' ) .extract_first ()
             # instantiated item object, will resolve to the title and content item stored in the object 
            item = WangyiproItem () 
            Item [ ' title ' ] = title
             # of Information page url manually request access to news content 
            yield scrapy.Request(url=news_detail_url,callback=self.parse_detail,meta={'item':item})

    def parse_detail(self,response):
        item = response.meta['item']
        # 通过response解析出新闻内容
        content = response.xpath('//div[@id="endText"]//text()').extract()
        content = ''.join(content)

        item['content'] = content
        yield item

    def close(self,spider):
        #After the end of the reptiles, close the browser calls the method 
        Print ( ' reptile overall end ~~~~~~~~~~~~~~~~~~~ ' ) 
        self.bro.quit ()
 ------ -------------------------------------------------- --------------------------------
 # 2.items file 

Import scrapy 

class WangyiproItem (scrapy.Item):
     # the DEFINE at The Item here Wallpaper for your like Fields: 
    # name = scrapy.Field () 
    title = scrapy.Field () 
    Content = scrapy.Field ()
 --------------------- -------------------------------------------------- -----------------
 # 3.middlewares file 

from scrapyImport Signals
 from scrapy.http Import HtmlResponse
 from Time Import SLEEP 

class WangyiproDownloaderMiddleware (Object): 

    DEF process_request (Self, Request, Spider):
         return None 

    DEF process_response (Self, Request, Response, Spider):
         # determine which objects are responsive 5 plate, if the response to the treatment object 
        IF response.url in spider.urls:
             # Get browser defined in reptiles 
            Bro = spider.bro 
            bro.get (response.url) 

            bro.execute_script ( 'the window.scrollTo (0, document.body.scrollHeight) ' ) 
            SLEEP ( . 1 ) 
            bro.execute_script ( ' the window.scrollTo (0, document.body.scrollHeight) ' ) 
            SLEEP ( . 1 ) 
            bro.execute_script ( ' the window.scrollTo ( 0, document.body.scrollHeight) ' ) 
            SLEEP ( 1 ) 
            bro.execute_script ( ' window.scrollTo (0, document.body.scrollHeight) ' ) 
            SLEEP ( 1 ) 

            # get a page carries news source data of data 
            page_text = bro.page_source
            # 实例化一个新的响应对象
            new_response = HtmlResponse(url=response.url,body=page_text,encoding='utf-8',request=request)
            return new_response
        else:
            return response

    def process_exception(self, request, exception, spider):
        pass
----------------------------------------------------------------------------------------
# 4.pipelines文件

class WangyiproPipeline(object):
    def process_item(self, item, spider):
        print(item)
        return item
----------------------------------------------------------------------------------------
# 5.setting文件
BOT_NAME = 'wangyiPro'

SPIDER_MODULES = ['wangyiPro.spiders']
NEWSPIDER_MODULE = 'wangyiPro.spiders'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'

ROBOTSTXT_OBEY = False

DOWNLOADER_MIDDLEWARES = {
   'wangyiPro.middlewares.WangyiproDownloaderMiddleware ' : 543 , 
} 

ITEM_PIPELINES = {
    ' wangyiPro.pipelines.WangyiproPipeline ' : 300 , 
} 

, LOG_LEVEL, = ' ERROR '

 

Guess you like

Origin www.cnblogs.com/bilx/p/11588038.html