Solve the problem of Scrapy-Redis empty running, automatically close the crawler after the link is run

Scrapy-Redis empty running problem, after the redis_key link is run, the crawler is automatically closed

question:

  • In the scrapy-redis framework, the xxx:requests stored by reids have been crawled, but the program is still running, how to automatically stop the program and end the empty run.

I believe everyone has a headache, especially with a lot of posts on the Internet, let's see how I solved this problem

extracurricular understanding

Distributed scaling:

We know that scrapy runs on a single machine by default, so how does scrapy-redis turn it into a multi-machine collaboration?

First solve the problem that the crawler waits and is not closed:

1. The signal system inside scrapy will trigger the spider_idle signal when the crawler exhausts the requests in the internal queue.

2. After the signal manager of the crawler receives the spider_idle signal, it will call the processor that registers the spider_idle signal for processing.

3. After all handlers of the signal are called, if the spider remains idle, the engine will close the spider.

The solution in scrapy-redis registers a spider_idle() method corresponding to the spider_idle signal on the signal manager. When the spider_idle is triggered, the signal manager will call the spider_idle() in the crawler. The source code of Scrapy_redis is as follows:

    def spider_idle(self):
        """Schedules a request if available, otherwise waits."""
        # XXX: Handle a sentinel to close the spider.
        self.schedule_next_requests()    # 这里调用schedule_next_requests() 来从redis中生成新的请求
        raise DontCloseSpider              # 抛出不要关闭爬虫的DontCloseSpider异常,保证爬虫活着

Solutions:

  • Through the previous understanding, we know that the key to the crawler shutdown is the spider_idle signal.
  • The spider_idle signal will only be triggered when the spider queue is empty, and the trigger interval is 5s.
  • Then we can also use the same method to register a spider_idle() method corresponding to the spider_idle signal on the signal manager.
  • In the spider_idle() method, write an end condition to end the crawler

solution:

  • Close the crawler for a period of time after redis_key is empty

The implementation scheme of closing the crawler for a period of time after redis_key is empty:

This is implemented in extensions (extensions) in Scrapy, of course you can also implement in pipelines (pipelines).

The extension framework provides a mechanism that enables you to bind custom functionality to Scrapy. Extensions are just normal classes, they are instantiated, initialized when Scrapy starts. For details about extensions, see: scrapy Extensions (Extensions)

  • In the directory of the settings.py file, create a file called extensions.py,
  • Write the following code in it
# -*- coding: utf-8 -*-
# Define here the models for your scraped Extensions
import logging
import time
from scrapy import signals
from scrapy.exceptions import NotConfigured
logger = logging.getLogger(__name__)


class RedisSpiderSmartIdleClosedExensions(object):

    def __init__(self, idle_number, crawler):
        self.crawler = crawler
        self.idle_number = idle_number
        self.idle_list = []
        self.idle_count = 0

    @classmethod
    def from_crawler(cls, crawler):
        # 首先检查是否应该启用和提高扩展
        # 否则不配置
        if not crawler.settings.getbool('MYEXT_ENABLED'):
            raise NotConfigured

        # 获取配置中的时间片个数,默认为360个,30分钟
        idle_number = crawler.settings.getint('IDLE_NUMBER', 360)

        # 实例化扩展对象
        ext = cls(idle_number, crawler)

        # 将扩展对象连接到信号, 将signals.spider_idle 与 spider_idle() 方法关联起来。
        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(ext.spider_idle, signal=signals.spider_idle)

        # return the extension object
        return ext

    def spider_opened(self, spider):
        logger.info("opened spider %s redis spider Idle, Continuous idle limit: %d", spider.name, self.idle_number)

    def spider_closed(self, spider):
        logger.info("closed spider %s, idle count %d , Continuous idle count %d",
                    spider.name, self.idle_count, len(self.idle_list))

    def spider_idle(self, spider):
        self.idle_count += 1                        # 空闲计数
        self.idle_list.append(time.time())       # 每次触发 spider_idle时,记录下触发时间戳
        idle_list_len = len(self.idle_list)         # 获取当前已经连续触发的次数
        
        # 判断 当前触发时间与上次触发时间 之间的间隔是否大于5秒,如果大于5秒,说明redis 中还有key 
        if idle_list_len > 2 and self.idle_list[-1] - self.idle_list[-2] > 6:
            self.idle_list = [self.idle_list[-1]]

        elif idle_list_len > self.idle_number:
            # 连续触发的次数达到配置次数后关闭爬虫
            logger.info('\n continued idle number exceed {} Times'
                        '\n meet the idle shutdown conditions, will close the reptile operation'
                        '\n idle start time: {},  close spider time: {}'.format(self.idle_number,
                                                                              self.idle_list[0], self.idle_list[0]))
            # 执行关闭爬虫操作
            self.crawler.engine.close_spider(spider, 'closespider_pagecount')

  • Add the following configuration in settings.py, please replace lianjia_ershoufang with your project directory name.

MYEXT_ENABLED=True      # 开启扩展
IDLE_NUMBER=360           # 配置空闲持续时间单位为 360个 ,一个时间单位为5s

# 在 EXTENSIONS 配置,激活扩展
'EXTENSIONS'= {
            'lianjia_ershoufang.extensions.RedisSpiderSmartIdleClosedExensions': 500,
        },

  • After completing the idle shutdown extension, the crawler will close the crawler after being idle for 360 time units

Configuration instructions:

MYEXT_ENABLED: 是否启用扩展,启用扩展为 True, 不启用为 False
IDLE_NUMBER: 关闭爬虫的持续空闲次数,持续空闲次数超过IDLE_NUMBER,爬虫会被关闭。默认为 360 ,也就是30分钟,一分钟12个时间单位

Epilogue

This method is only used when a set of links cannot be run within 5 seconds. If your set of links can be run within 5 seconds, you can make some judgments on this basis. The principle is the same, everyone can draw a scoop according to the gourd.

Haha, isn't my way really great!

Reference link

The combination of scrapy and redis realizes service-based distributed crawler

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324410484&siteId=291194637