scrapy topic (V): custom extensions

By extension scrapy provided, we can write some custom function, mechanism inserted in scrapy

First, write a simple extension

We now write a number of extensions to get a total count of the item
we can create aextensions.py

# extendsions.py
# -*- coding: utf-8-*-
from scrapy import signals
from scrapy.exceptions import NotConfigured

class StatsItemCount(object):
    def __init__(self):
        self.item_count = 0

    @classmethod
    def from_crawler(cls, crawler):
        # instantiate the extension object
        ext = cls()

        # connect the extension object to signals
        crawler.signals.connect(ext.spider_opened,
                                signal=signals.spider_opened)
        crawler.signals.connect(ext.spider_closed,
                                signal=signals.spider_closed)
        crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)

        # return the extension object
        return ext

    def spider_opened(self, spider):
        spider.logger.info("-----------opened spider %s", spider.name)

    def spider_closed(self, spider):
        spider.logger.info("------------closed spider %s", spider.name)
        spider.logger.info("A total of data acquired {} " .format (self.item_count)) 

    DEF item_scraped (Self, Item, Spider): 
        self.item_count +. 1 =
  1. In the from_crawlerregistration process signal
  2. Writing item_scrapedmethod, statistical parsing out all item
  3. In the spider_closedoutput data crawled

Open extensions:

# settings.py
EXTENSIONS = {
   'ccidcom.extensions.StatsItemCount': 999,
}

Run reptiles
scrapy crawl ccidcomSpider

...
2019-11-21 16:53:23 [ccidcomSpider] INFO: -----------opened spider ccidcomSpider
2019-11-21 16:53:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-11-21 16:53:23 [scrapy.core.engine] INFO: Closing spider (finished)
2019-11-21 16:53:23 [ccidcomSpider] INFO: ------------closed spider ccidcomSpider
2019-11-21 16:53:23 [ccidcomSpider] INFO: 一共获取到10条数据
...

It can be seen clearly, in writing extensions instead of writing spider, middleware, the entire project more chaos, good scalability

Two, scrapy built-in expansion

1. The statistics document extension

scrapy.extensions.logstats.LogStats
record statistics

2. The core statistical information extension

scrapy.extensions.corestats.CoreStats
core statistical information statistics, must be turned LogStats, this extension is valid

3. telnet debug extensions

scrapy.extensions.telnet.TelnetConsole
provide debugging telnet reptile, saying more about this in the reptile after commissioning

4. Extended Memory Usage Monitoring

scrapy.extensions.memusage.MemoryUsage
memory usage monitoring extension, this extension does not support windows

  1. When closed Spider Spider exceeds a certain value
  2. Send e-mail notification exceeds a certain value

Configuration values:
MEMUSAGE_LIMIT_MB: reptiles size limit, reaches the closed crawler
MEMUSAGE_WARNING_MB: Warning memory size at the peak of the E-mail
MEMUSAGE_NOTIFY_MAIL: a notification e-mail address
MEMUSAGE_CHECK_INTERVAL_SECONDS: detecting interval, in seconds

5. Extended Memory Debugging

scrapy.extensions.memdebug.MemoryDebugger
This extension collect the following information:

  1. python garbage collector does not collect objects
  2. Other objects should not be reserved

Configuration Item:
MEMDEBUG_ENABLED: after opening the memory information is recorded in the statistics

6. Automatic shut spider extension

scrapy.extensions.closespider.CloseSpider
reaches a specified condition is closed reptiles

Configuration Item:
CLOSESPIDER_TIMEOUT: spider running up to a certain time will automatically shut down, default 0, do not close
CLOSESPIDER_ITEMCOUNT: reptiles crawl item reaches a specified number, then close the reptiles, the default is 0, not close
CLOSESPIDER_PAGECOUNT: reptile crawling pages to reach specify the number of closes, the default is 0, not close
CLOSESPIDER_ERRORCOUNT: If the error occurred during the run up to a certain number of reptiles, reptile is turned off, the default is 0, not closed

7. StatsMailer extension

scrapy.extensions.statsmailer.StatsMailer
send a message after the grab is complete, including the collection of statistics to

Configuration Item:
STATSMAILER_RCPTS: receiving e-mail address

Guess you like

Origin www.cnblogs.com/qiu-hua/p/12638732.html