scrapy use seleium and middleware

  • case study:

  • News crawl data in the domestic sector Netease news: Demand
  • Requirements Analysis: When you click the hyperlink to enter the domestic domestic corresponding page, you will find the current page to show the news data is dynamically loaded out, if the request for the url directly by the program is not dynamically load data to get news out of . Then we need to use selenium to instantiate a browser object, requesting the url in that object, get news data dynamically loaded.

    selenium in principle scrapy used in the analysis:

  • When the engine submitted url domestic sector corresponding to the request to download, a download page data for downloading, and then downloaded to the page data, in response to the package submitted to the engine, the engine will be transferred to the response Spiders. Page response data stored in the object Spiders received no news of where the data is dynamically loaded. To get news of dynamically loaded data, you need to submit to intercept the object in response to the response of the engine to download in the download middleware, tampering cut its page data internally stored, modified to carry the news out of the dynamic loading data , then the object has been tampered with final response to Spiders parse operation.

    The use of selenium in scrapy process:

  • Constructor crawler override file, selenium using a browser object to instantiate in the process (because the browser is instantiated objects only once)
  • Rewrite reptile files closed (self, spider) method, close the browser in its internal objects. This method is invoked at the end of reptiles
  • Rewrite process_response method to download middleware, so the method of response object to intercept and tamper response data stored in the page
  • Open the downloaded middleware in the configuration file

           Code:

spider

Import scrapy
 from the Selenium Import webdriver
 from selenium.webdriver.chrome.options Import Options
 from wangyiPro.items Import WangyiproItem
 "" " 
crawling Netease domestic and international news headlines and content 
" "" 
class WangyiSpider (scrapy.Spider): 
    name = ' wangyi ' 
    # allowed_domains = [' www.163.com '] 
    start_urls = [ ' https://news.163.com/domestic/ ' , ' https://news.163.com/world/ ']


    def __init__ (Self ):
        options = webdriver.ChromeOptions()
        options.add_argument('--window-position=0,0');  # chrome 启动初始位置
        options.add_argument('--window-size=1080,800');  # chrome 启动初始大小
        self.browser = webdriver.Chrome(executable_path='C://xx//chromedriver.exe' ,chrome_options=options)

    def parse(self, response):
       div_list =  response.xpath('//div[@class="ndi_main"]/div')
       for div_item indiv_list: 
           title = div_item.xpath ( ' ./div/div [. 1] / H3 / A / text () ' ) .extract_first () 
           new_detail_url = div_item.xpath ( ' ./div/div[1]/h3/a / @ href ' ) .extract_first () 
           Item = WangyiproItem () 
           Item [ ' title ' ] = title 

           # for details page news initiate Request 
           yield scrapy.Request (url = new_detail_url, callback = self.parse_detail, Meta = { ' Item ' :}) Item # request parameter passing Item 

    # parse news content 
    def parse_detail (Self, the Response):
        content = response.xpath('//*[@id="endText"]//text()').extract()
        content = ''.join(content)
        item = response.meta['item']
        item['content'] = content.strip()

        yield item

    def closed(self,spider):
        self.browser.quit()

middleware

from scrapy import signals
from time import sleep
from scrapy.http import HtmlResponse
class WangyiproDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    # 拦截响应对象进行篡改
    def process_response(self, request, response, spider):
        #With the Response returned from at The Called at The Downloader. 
        # Selected specified response object tampering 
        # by request url specified 
        # by request to specify the Response 
        # Spider reptile objects 
        Bro = spider.browser # Gets browser object reptile defined 
        IF request.url in spider.start_urls:
             # the response # tampering instantiate a new response object (containing news data dynamically loaded) to replace the original old response object 
            # seleium easily obtain dynamic data based on 
            bro.get (request.url) 
            SLEEP ( 3 ) 
            Bro. execute_script ( ' the window.scrollTo (0, document.body.scrollHeight) ' ) 
            SLEEP ( . 1 )
            page_text = bro.page_source # 包含了动态加载对象
            new_response = HtmlResponse(url=request.url,body=page_text,encoding="utf-8",request=request)

            return new_response
        else:
            # response # 其他请求
        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
            return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

pipeline file

Import pymysql 

class WangyiproPipeline (Object):
     # constructor 
    DEF  the __init__ (Self): 
        self.conn = None   # define a descriptor attribute 
        self.cursor = None 
        self.num = 0 

    # following are in the process of rewriting the parent class : 
    # start reptiles performed once 
    DEF open_spider (Self, Spider): 
        self.conn = pymysql.Connect (= Host ' 192.168.xx.xx ' , Port = 3306, = User ' the root ' , password = ' XX ' , = DB ' xx_db' , 
                                    Charset = ' utf8 ' )
         Print ( ' reptile database start ' ) 

    # deal specifically target item 
    # because the method is called multiple times to perform, so the file open and close operations written in the other two will each perform a method. 
    DEF process_item (Self, Item, Spider): 
        author = Item [ ' title ' ] 
        Content = Item [ ' Content ' ] 
        self.cursor = self.conn.cursor ()
         the try : 

            self.cursor.execute ( 'insert into qiubai values(%s,%s)', (author, content))
            self.conn.commit()
        except Exception as e:
            print(e,content[0,20])
            self.conn.rollback()
        return item

    def close_spider(self, spider):
        print('爬虫数据库结束')
        self.cursor.close()
        self.conn.close()

 

File items

 

class WangyiproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    content = scrapy.Field()
    pass

 

Configuration setting

= USER_AGENT ' Mozilla / 5.0 (Macintosh; Intel Mac OS the X-10_12_0) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 68.0.3440.106 Safari / 537.36 '   # disguise the identity of the carrier request 
# Obey a robots.txt rules 
# ROBOTSTXT_OBEY = True 
ROBOTSTXT_OBEY = false   # negligible or non-compliance with robots protocol 
# display only the specified type of the log information 
, LOG_LEVEL, = ' ERROR ' 

# the Configure maximum Concurrent Requests Performed by Scrapy (default: 16) 
# CONCURRENT_REQUESTS = 32 

# the Configure a Delay for Requests for the Same Website ( default: 0) 
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'wangyiPro.middlewares.WangyiproSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543,
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'wangyiPro.pipelines.WangyiproPipeline': 300,
}

 

Guess you like

Origin www.cnblogs.com/xiao-apple36/p/12635470.html