By extension scrapy provided, we can write some custom function, mechanism inserted in scrapy
First, write a simple extension
We now write a number of extensions to get a total count of the item
we can create aextensions.py
# extendsions.py # -*- coding: utf-8-*- from scrapy import signals from scrapy.exceptions import NotConfigured class StatsItemCount(object): def __init__(self): self.item_count = 0 @classmethod def from_crawler(cls, crawler): # instantiate the extension object ext = cls() # connect the extension object to signals crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened) crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed) crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped) # return the extension object return ext def spider_opened(self, spider): spider.logger.info("-----------opened spider %s", spider.name) def spider_closed(self, spider): spider.logger.info("------------closed spider %s", spider.name) spider.logger.info("A total of data acquired {} " .format (self.item_count)) DEF item_scraped (Self, Item, Spider): self.item_count +. 1 =
- In the
from_crawler
registration process signal - Writing
item_scraped
method, statistical parsing out all item - In the
spider_closed
output data crawled
Open extensions:
# settings.py EXTENSIONS = { 'ccidcom.extensions.StatsItemCount': 999, }
Run reptilesscrapy crawl ccidcomSpider
... 2019-11-21 16:53:23 [ccidcomSpider] INFO: -----------opened spider ccidcomSpider 2019-11-21 16:53:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-11-21 16:53:23 [scrapy.core.engine] INFO: Closing spider (finished) 2019-11-21 16:53:23 [ccidcomSpider] INFO: ------------closed spider ccidcomSpider 2019-11-21 16:53:23 [ccidcomSpider] INFO: 一共获取到10条数据 ...
It can be seen clearly, in writing extensions instead of writing spider, middleware, the entire project more chaos, good scalability
Two, scrapy built-in expansion
1. The statistics document extension
scrapy.extensions.logstats.LogStats
record statistics
2. The core statistical information extension
scrapy.extensions.corestats.CoreStats
core statistical information statistics, must be turned LogStats, this extension is valid
3. telnet debug extensions
scrapy.extensions.telnet.TelnetConsole
provide debugging telnet reptile, saying more about this in the reptile after commissioning
4. Extended Memory Usage Monitoring
scrapy.extensions.memusage.MemoryUsage
memory usage monitoring extension, this extension does not support windows
- When closed Spider Spider exceeds a certain value
- Send e-mail notification exceeds a certain value
Configuration values:
MEMUSAGE_LIMIT_MB: reptiles size limit, reaches the closed crawler
MEMUSAGE_WARNING_MB: Warning memory size at the peak of the E-mail
MEMUSAGE_NOTIFY_MAIL: a notification e-mail address
MEMUSAGE_CHECK_INTERVAL_SECONDS: detecting interval, in seconds
5. Extended Memory Debugging
scrapy.extensions.memdebug.MemoryDebugger
This extension collect the following information:
- python garbage collector does not collect objects
- Other objects should not be reserved
Configuration Item:
MEMDEBUG_ENABLED: after opening the memory information is recorded in the statistics
6. Automatic shut spider extension
scrapy.extensions.closespider.CloseSpider
reaches a specified condition is closed reptiles
Configuration Item:
CLOSESPIDER_TIMEOUT: spider running up to a certain time will automatically shut down, default 0, do not close
CLOSESPIDER_ITEMCOUNT: reptiles crawl item reaches a specified number, then close the reptiles, the default is 0, not close
CLOSESPIDER_PAGECOUNT: reptile crawling pages to reach specify the number of closes, the default is 0, not close
CLOSESPIDER_ERRORCOUNT: If the error occurred during the run up to a certain number of reptiles, reptile is turned off, the default is 0, not closed
7. StatsMailer extension
scrapy.extensions.statsmailer.StatsMailer
send a message after the grab is complete, including the collection of statistics to
Configuration Item:
STATSMAILER_RCPTS: receiving e-mail address