[Reptile] study notes day57 6.7 scrapy-redis source code analysis of official documents refer to:. Spider

. 6.7 scrapy-redis official document source analysis Reference: Spider

Here Insert Picture Description

spider.py

This design reads from redis spider to climb in the url, then perform crawling, crawling process returns if more url, then continue until all of the request is complete. After reading the url from redis continue in this cycle process.

Analysis: realized by monitoring the state of the crawler connect signals.spider_idle signal in the spider. When idle, the return new make_requests_from_url (url) to the engine, and further to the scheduler schedules.

from scrapy import signals
from scrapy.exceptions import DontCloseSpider
from scrapy.spiders import Spider, CrawlSpider

from . import connection


# Default batch size matches default concurrent requests setting.
DEFAULT_START_URLS_BATCH_SIZE = 16
DEFAULT_START_URLS_KEY = '%(name)s:start_urls'


class RedisMixin(object):
    """Mixin class to implement reading urls from a redis queue."""
    # Per spider redis key, default to DEFAULT_START_URLS_KEY.
    redis_key = None
    # Fetch this amount of start urls when idle. Default to DEFAULT_START_URLS_BATCH_SIZE.
    redis_batch_size = None
    # Redis client instance.
    server = None

    def start_requests(self):
        """Returns a batch of start requests from redis."""
        return self.next_requests()

    def setup_redis(self, crawler=None):
        """Setup redis connection and idle signal.
        This should be called after the spider has set its crawler object.
        """
        if self.server is not None:
            return

        if crawler is None:
            # We allow optional crawler argument to keep backwards
            # compatibility.
            # XXX: Raise a deprecation warning.
            crawler = getattr(self, 'crawler', None)

        if crawler is None:
            raise ValueError("crawler is required")

        settings = crawler.settings

        if self.redis_key is None:
            self.redis_key = settings.get(
                'REDIS_START_URLS_KEY', DEFAULT_START_URLS_KEY,
            )

        self.redis_key = self.redis_key % {'name': self.name}

        if not self.redis_key.strip():
            raise ValueError("redis_key must not be empty")

        if self.redis_batch_size is None:
            self.redis_batch_size = settings.getint(
                'REDIS_START_URLS_BATCH_SIZE', DEFAULT_START_URLS_BATCH_SIZE,
            )

        try:
            self.redis_batch_size = int(self.redis_batch_size)
        except (TypeError, ValueError):
            raise ValueError("redis_batch_size must be an integer")

        self.logger.info("Reading start URLs from redis key '%(redis_key)s' "
                         "(batch size: %(redis_batch_size)s)", self.__dict__)

        self.server = connection.from_settings(crawler.settings)
        # The idle signal is called when the spider has no requests left,
        # that's when we will schedule new requests from redis queue
        crawler.signals.connect(self.spider_idle, signal=signals.spider_idle)

    def next_requests(self):
        """Returns a request to be scheduled or none."""
        use_set = self.settings.getbool('REDIS_START_URLS_AS_SET')
        fetch_one = self.server.spop if use_set else self.server.lpop
        # XXX: Do we need to use a timeout here?
        found = 0
        while found < self.redis_batch_size:
            data = fetch_one(self.redis_key)
            if not data:
                # Queue empty.
                break
            req = self.make_request_from_data(data)
            if req:
                yield req
                found += 1
            else:
                self.logger.debug("Request not made from data: %r", data)

        if found:
            self.logger.debug("Read %s requests from '%s'", found, self.redis_key)

    def make_request_from_data(self, data):
        # By default, data is an URL.
        if '://' in data:
            return self.make_requests_from_url(data)
        else:
            self.logger.error("Unexpected URL from '%s': %r", self.redis_key, data)

    def schedule_next_requests(self):
        """Schedules a request if available"""
        for req in self.next_requests():
            self.crawler.engine.crawl(req, spider=self)

    def spider_idle(self):
        """Schedules a request if available, otherwise waits."""
        # XXX: Handle a sentinel to close the spider.
        self.schedule_next_requests()
        raise DontCloseSpider


class RedisSpider(RedisMixin, Spider):
    """Spider that reads urls from redis queue when idle."""

    @classmethod
    def from_crawler(self, crawler, *args, **kwargs):
        obj = super(RedisSpider, self).from_crawler(crawler, *args, **kwargs)
        obj.setup_redis(crawler)
        return obj


class RedisCrawlSpider(RedisMixin, CrawlSpider):
    """Spider that reads urls from redis queue when idle."""

    @classmethod
    def from_crawler(self, crawler, *args, **kwargs):
        obj = super(RedisCrawlSpider, self).from_crawler(crawler, *args, **kwargs)
        obj.setup_redis(crawler)
        return obj

spider changes is not great, mainly through the connect interface signals spider_idle bound to the spider, when the spider is initialized by the initialization setup_redis good function and redis connection, after removal from redis strat url by next_requests function, key use is defined settings in REDIS_START_URLS_AS_SET (note the url pool queue here's our top initialization url pool and not a thing, the pool is a queue scheduling, initialization url url entrance of the pool is located, they are present in redis, but using a different key to distinguish it as a different table, right), spider using a small amount of start url, you can develop a lot of new url, the url will be sentenced to re-enter the scheduler and scheduling. Until the spider went to the pool is not scheduled url when the trigger spider_idle signal, thereby triggering the spider's next_requests function again to read some url from start url redis pool.

to sum up

To sum up the general idea of ​​scrapy-redis: This works by rewriting the scheduler and spider class that implements the scheduling, and interactive spider start the redis. Implement new dupefilter and queue class, reaching interactive scheduling and sentenced to heavy container and redis, because reptiles processes on each host accessing the same redis database, scheduling and re-sentenced are unified unified management, to achieve distributed the purpose crawlers. When the spider is initialized, and initializes the object corresponding scheduler, the scheduler object by reading Settings, configure its own scheduler queue containers and heavy tools judgment dupefilter. Whenever a spider outputs a request when, scrapy this reuqest kernel will be submitted to the spider corresponding scheduler object scheduling, scheduler were sentenced to heavy objects on request by visiting redis, if not repeat put him added to the redis scheduling pool. When the condition is satisfied scheduling, scheduler object is removed from the pool redis scheduling request is sent to a spider, so that he crawling. When all available temporary URL crawling spider, the spider found Scheduler redis corresponding scheduling pool is empty, then the trigger signal spider_idle, spider after receiving this signal, reading is directly connected redis strart url pool, take a new url batch entry, and then repeat the top job again.

Published 290 original articles · won praise 94 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_35456045/article/details/104111452