Lecture 48: Principles of Scrapy-Redis

As we mentioned in the previous lesson, the Scrapy-Redis library has provided us with Scrapy distributed queue, scheduler, deduplication and other functions. Its GitHub address is: https://github.com/rmax/scrapy-redis .

In this lesson, we have an in-depth grasp of the method of using Redis to implement Scrapy distributed, and an in-depth understanding of the principles of Scrapy-Redis.

Get the source code

You can clone the source code and execute the following command:
git clone https://github.com/rmax/scrapy-redis.git The
core source code is in the scrapy-redis/src/scrapy_redis directory.

Crawl queue

Let's start with the crawl queue and take a look at its specific implementation. The source file is queue.py, which contains the implementation of three queues. First, it implements a parent class Base, which provides some basic methods and properties, as shown below:

class Base(object): 
    """Per-spider base queue class""" 
    def __init__(self, server, spider, key, serializer=None): 
        if serializer is None: 
            serializer = picklecompat 
        if not hasattr(serializer, 'loads'): 
            raise TypeError("serializer does not implement 'loads' function: % r" 
                            % serializer) 
        if not hasattr(serializer, 'dumps'): 
            raise TypeError("serializer '% s' does not implement 'dumps' function: % r" 
                            % serializer) 
        self.server = server 
        self.spider = spider 
        self.key = key % {
    
    'spider': spider.name} 
        self.serializer = serializer 
 
    def _encode_request(self, request): 
        obj = request_to_dict(request, self.spider) 
        return self.serializer.dumps(obj) 
 
    def _decode_request(self, encoded_request): 
        obj = self.serializer.loads(encoded_request) 
        return request_from_dict(obj, self.spider) 
 
    def __len__(self): 
        """Return the length of the queue""" 
        raise NotImplementedError 
 
    def push(self, request): 
        """Push a request""" 
        raise NotImplementedError 
 
    def pop(self, timeout=0): 
        """Pop a request""" 
        raise NotImplementedError 
 
    def clear(self): 
        """Clear queue/stack""" 
        self.server.delete(self.key)

First look at the _encode_request and _decode_request methods, because we need to store a Request object in the database, but the database cannot directly store the object, so we need to convert the Request sequence into a string and then store it. These two methods are serialization. And deserialization operations are implemented using the pickle library. Generally, the _encode_request method is called for serialization when push is called to store the Request in the database, and _decode_request is called for deserialization when the pop is called to take out the Request.

In the parent class, the len , push, and pop methods are all unimplemented, and NotImplementedError will be thrown directly. Therefore, this class cannot be used directly. You must implement a subclass to override these three methods, and different subclasses There will be different implementations and different functions.

Next, we need to define some subclasses to inherit the Base class and rewrite these methods. There are three subclass implementations in the source code. They are FifoQueue, PriorityQueue, and LifoQueue. Let's take a look at their implementations. principle.

The first is FifoQueue:

class FifoQueue(Base): 
    """Per-spider FIFO queue""" 
 
    def __len__(self): 
        """Return the length of the queue""" 
        return self.server.llen(self.key) 
 
    def push(self, request): 
        """Push a request""" 
        self.server.lpush(self.key, self._encode_request(request)) 
 
    def pop(self, timeout=0): 
        """Pop a request""" 
        if timeout > 0: 
            data = self.server.brpop(self.key, timeout) 
            if isinstance(data, tuple): 
                data = data[1] 
        else: 
            data = self.server.rpop(self.key) 
        if data: 
            return self._decode_request(data)

You can see that this class inherits the Base class and rewrites the three methods of len , push, and pop. In these three methods, all operations on the server object, and the server object is a Redis connection object, we can directly Call the method of operating Redis to operate the database. You can see that the operation methods here are llen, lpush, rpop, etc., which means that the crawling queue is a list of Redis used, and the serialized Request will be stored in the list Is one of the elements of the list. The len method is to get the length of the list. The push method calls the lpush operation, which means saving data from the left side of the list, and the pop method calls the rpop operation, which means taking it out from the right side of the list. data.

Therefore, the access order of Request in the list is left-in and right-out, so this is an orderly in and out, that is, first in, first out. It is called First Input First Output in English, and is also referred to as FIFO for short, and the name of this category It's called FifoQueue.

There is also an opposite implementation class called LifoQueue, which is implemented as follows:

class LifoQueue(Base): 
    """Per-spider LIFO queue.""" 
 
    def __len__(self): 
        """Return the length of the stack""" 
        return self.server.llen(self.key) 
 
    def push(self, request): 
        """Push a request""" 
        self.server.lpush(self.key, self._encode_request(request)) 
 
    def pop(self, timeout=0): 
        """Pop a request""" 
        if timeout > 0: 
            data = self.server.blpop(self.key, timeout) 
            if isinstance(data, tuple): 
                data = data[1] 
        else: 
            data = self.server.lpop(self.key) 
 
        if data: 
            return self._decode_request(data)

The difference from FifoQueue is its pop method, where the lpop operation is used, which means exit from the left, while the push method still uses the lpush operation, which enters from the left. Then the effect achieved in this way is first in last out, last in first out, English called Last In First Out, or LIFO for short, and this type of name is called LifoQueue. At the same time, this access method is similar to stack operation, so it can actually be called StackQueue.

In addition, there is a subclass implementation in the source code, called PriorityQueue. As the name suggests, it is called a priority queue. The implementation is as follows:

class PriorityQueue(Base): 
    """Per-spider priority queue abstraction using redis' sorted set""" 
 
    def __len__(self): 
        """Return the length of the queue""" 
        return self.server.zcard(self.key) 
 
    def push(self, request): 
        """Push a request""" 
        data = self._encode_request(request) 
        score = -request.priority 
        self.server.execute_command('ZADD', self.key, score, data) 
 
    def pop(self, timeout=0): 
        """ 
        Pop a request 
        timeout not support in this queue class 
        """ 
        pipe = self.server.pipeline() 
        pipe.multi() 
        pipe.zrange(self.key, 0, 0).zremrangebyrank(self.key, 0, 0) 
        results, count = pipe.execute() 
        if results: 
            return self._decode_request(results[0])

Here we can see that the zcard, zadd, and zrange operations of the server object are used in the len , push, and pop methods. You can know that the storage result used here is a sorted set, in which each element can be set. Score, then this score represents the priority.

The zcard operation is called in the len method, and what is returned is the size of the ordered collection, which is the length of the crawling queue. The zadd operation is called in the push method, which is to add elements to the collection. The score here is designated as the priority of the Request The opposite of the level, because the low score will be in the front of the set, so here the high priority Request will be at the top of the set. The pop method first calls the zrange operation to take out the first element of the collection, because the highest priority Request will be at the top of the collection, so the first element is the highest priority Request, and then call the zremrangebyrank operation to delete this element , This completes the removal and deletion operation.

This queue is the queue used by default, that is, the crawling queue uses an ordered set to store by default.

Deduplication filter

As mentioned earlier, Scrapy's de-duplication is achieved by using collections, and the de-duplication in Scrapy distributed requires the use of shared collections, so here is the collection data structure in Redis. Let's take a look at how the deduplication class is implemented. The source code file is dupefilter.py, which implements an RFPDupeFilter class, as shown below:

class RFPDupeFilter(BaseDupeFilter): 
    """Redis-based request duplicates filter. 
    This class can also be used with default Scrapy's scheduler. 
    """ 
    logger = logger 
    def __init__(self, server, key, debug=False): 
        """Initialize the duplicates filter. 
        Parameters 
        ---------- 
        server : redis.StrictRedis 
            The redis server instance. 
        key : str 
            Redis key Where to store fingerprints. 
        debug : bool, optional 
            Whether to log filtered requests. 
        """ 
        self.server = server 
        self.key = key 
        self.debug = debug 
        self.logdupes = True 
 
    @classmethod 
    def from_settings(cls, settings): 
        """Returns an instance from given settings. 
        This uses by default the key ``dupefilter:<timestamp>``. When using the 
        ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as 
        it needs to pass the spider name in the key. 
        Parameters 
        ---------- 
        settings : scrapy.settings.Settings 
        Returns 
        ------- 
        RFPDupeFilter 
            A RFPDupeFilter instance. 
        """ 
        server = get_redis_from_settings(settings) 
        key = defaults.DUPEFILTER_KEY % {
    
    'timestamp': int(time.time())} 
        debug = settings.getbool('DUPEFILTER_DEBUG') 
        return cls(server, key=key, debug=debug) 
 
    @classmethod 
    def from_crawler(cls, crawler): 
        """Returns instance from crawler. 
        Parameters 
        ---------- 
        crawler : scrapy.crawler.Crawler 
        Returns 
        ------- 
        RFPDupeFilter 
            Instance of RFPDupeFilter. 
        """ 
        return cls.from_settings(crawler.settings) 
 
    def request_seen(self, request): 
        """Returns True if request was already seen. 
        Parameters 
        ---------- 
        request : scrapy.http.Request 
        Returns 
        ------- 
        bool 
        """ 
        fp = self.request_fingerprint(request) 
        added = self.server.sadd(self.key, fp) 
        return added == 0 
 
    def request_fingerprint(self, request): 
        """Returns a fingerprint for a given request. 
        Parameters 
        ---------- 
        request : scrapy.http.Request 
 
        Returns 
        ------- 
        str 
 
        """ 
        return request_fingerprint(request) 
 
    def close(self, reason=''): 
        """Delete data on close. Called by Scrapy's scheduler. 
        Parameters 
        ---------- 
        reason : str, optional 
        """ 
        self.clear() 
 
    def clear(self): 
        """Clears fingerprints data.""" 
        self.server.delete(self.key) 
 
    def log(self, request, spider): 
        """Logs given request. 
        Parameters 
        ---------- 
        request : scrapy.http.Request 
        spider : scrapy.spiders.Spider 
        """ 
        if self.debug: 
            msg = "Filtered duplicate request: %(request) s" 
            self.logger.debug(msg, {
    
    'request': request}, extra={
    
    'spider': spider}) 
        elif self.logdupes: 
            msg = ("Filtered duplicate request %(request) s" 
                   "- no more duplicates will be shown" 
                   "(see DUPEFILTER_DEBUG to show all duplicates)") 
            self.logger.debug(msg, {
    
    'request': request}, extra={
    
    'spider': spider}) 
            self.logdupes = False

A request_seen method is also implemented here, which is very similar to the request_seen method in Scrapy. However, the collection here uses the sadd operation of the server object, that is, the collection is no longer a simple data structure, but directly replaced by the database storage method.

The way to identify duplicates is to use fingerprints, which are also obtained by the request_fingerprint method. After obtaining the fingerprint, directly add the fingerprint to the collection. If the addition is successful, it means that the fingerprint does not originally exist in the collection, and the return value is 1. The final return result in the code is to determine whether the addition result is 0, if the return value just now is 1, then the determination result is False, that is, it is not repeated, otherwise it is determined to be repeated.

In this way, we successfully used the Redis collection to complete fingerprint recording and repeated verification.

scheduler

Scrapy-Redis also helped us realize the scheduler Scheduler used with Queue and DupeFilter. The source file name is scheduler.py. We can specify some configurations, such as SCHEDULER_FLUSH_ON_START whether to clear the crawling queue when crawling starts, and SCHEDULER_PERSIST whether to keep the crawling queue uncleared after the crawling ends. We can freely configure it in settings.py, and this scheduler is well connected.

Next we look at the two core access methods, the implementation is as follows:

def enqueue_request(self, request): 
    if not request.dont_filter and self.df.request_seen(request): 
        self.df.log(request, self.spider) 
        return False 
    if self.stats: 
        self.stats.inc_value('scheduler/enqueued/redis', spider=self.spider) 
    self.queue.push(request) 
    return True 
 
def next_request(self): 
    block_pop_timeout = self.idle_before_close 
    request = self.queue.pop(block_pop_timeout) 
    if request and self.stats: 
        self.stats.inc_value('scheduler/dequeued/redis', spider=self.spider) 
    return request

enqueue_request can add Request to the queue. The core operation is to call the push operation of Queue, as well as some statistics and log operations. Next_request is to take the Request from the queue. The core operation is to call the pop operation of the Queue. At this time, if there is a Request in the queue, the Request will be directly taken out and the crawling will continue. Otherwise, if the queue is empty, the crawling will restart.

to sum up

So far we have solved the three distributed problems mentioned before, and summarized them as follows:

The implementation of crawling queues provides three types of queues, which are maintained by Redis lists or ordered collections.
The implementation of deduplication uses the Redis collection to save the fingerprint of the Request to provide repeated filtering.
Realization of re-crawling after interruption. After the interruption, the Redis queue is not emptied. When it is restarted, the next_request of the scheduler will fetch the next Request from the queue and continue to crawl.

Conclusion

The above content is the core source code analysis of Scrapy-Redis. Scrapy-Redis also provides implementations of Spider and Item Pipeline, but they are not required.

In the next section, we will integrate Scrapy-Redis into the previously implemented Scrapy project to achieve collaborative crawling of multiple hosts.