Python scrapy framework teaching (5): distributed crawler

Data deduplication

When the data is duplicated, we can not save it

from scrapy.exceptions import DropItem 
class DuplicatesPipeline(object): 
  def __init__(self): 
    self.ids_seen = set() 
  def process_item(self, item, spider): 
    if item['id'] in self.ids_seen: 
      raise DropItem("Duplicate item found: %s" % item) 
    else:
      self.ids_seen.add(item['id']) 
    return item

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542 

Python learning exchange group: 1039645993

Distributed collection

Scrapy_redis : Redis-based components for Scrapy.

Github address: https://github.com/rmax/scrapy-redis  Scrapy_redis implements more and more powerful functions on the basis of scrapy, which are embodied in: reqeust de-duplication, crawler persistence, and easy
distribution

So, how does scrapy_redis help us to scrape data?

Stand-alone crawler

By default, Scrapy does not support distributed, you need to use Redis-based Scrapy-Redis components to achieve distributed.

Normal Scrapy stand-alone crawler:

 

Scrapy does not share the scheduling queue, which means that Scrapy does not support distributed. In order to support distribution, we need to let Scrapy support shared scheduling queues, which is transformed into shared scheduling and de-duplication functions.

Distributed crawler

Distributed: divide and conquer

Deploy a crawler code on multiple computers to complete the crawler task together.

Use Redis server to centrally process all requests, mainly responsible for request deduplication and scheduling. In this way, all crawlers on the computer side share a crawling queue, and every request received by each computer side is not accessed by other crawlers. Thereby improving the crawler efficiency.

After getting a request, check whether the request is deduplicated in Redis. If it does, it proves that other spiders have collected it! If not, add it to the dispatch queue and wait for others to get it.

Scrapy is a general crawler framework, but it does not support distributed. Scrapy-redis provides some redis-based components in order to implement Scrapy distributed crawling more conveniently.

The installation is as follows: pip install scrapy-redis Scrapy-redis

Provides the following four components (components): (Four components mean that these four modules need to be modified accordingly)

  • Scheduler
  • Duplication Filter (de-duplication)
  • Item Pipeline
  • Base Spider (Reptile)

Scheduler (Scheduler)
Scrapy transforms Python’s original collection.deque (two-way queue) to form its own Scrapy queue, but multiple spiders of Scrapy cannot share the queue to be crawled. The solution of redis is to replace this Scrapy queue with a redis database (also referred to as a redis queue), which allows multiple spiders to read from the same database, thus realizing a shared crawling queue.

Redis supports a variety of data structures, these data structures can easily achieve such requirements:

  • The list has lpush(), lpop(), rpush(), rpop(), these methods can achieve first-in-first-out, or first-in-last-out crawling queue.
  • The collection elements are unordered and non-repetitive, and it is convenient to implement a random sorting and non-repetitive crawling queue.
  • Scrapy's Request has priority control, and the collection in Redis also has a score. This function can be used to implement a crawl queue with priority scheduling.

Scrapy builds a dictionary structure according to the priority of the queue to be crawled, such as:

{ 
  优先级0 : 队列0 
  优先级1 : 队列1 
  优先级2 : 队列2 
}

Then according to the priority in the request, it is determined which queue to enter, and when dequeuing, it will be dequeued according to the lower priority. Since Scrapy's original Scheduler can only process Scrapy's own queues and cannot process queues in Redis, the original Scheduler is no longer available, and Scrapy-Redis's Scheduler component should be used.

Duplication Filter (de-
duplication ) Scrapy comes with a de-duplication module, which uses the collection type in Python. This collection will record the fingerprint of each request, which is the hash value of the Request. The fingerprint calculation uses the sha1() method of hashlib. The calculated fields include the requested Method, URL, Body, and Header. As long as there is a little difference in these strings, the calculated fingerprints will be different. In other words, the result of the calculation is the encrypted string, which is the request fingerprint. Through the encrypted string, each request is unique, that is, the fingerprint is unique. And the fingerprint is a character string. It is easier to judge the character string than to judge the entire request object. Therefore, fingerprints are used as the basis for judging and de-duplication.

In order for Scrapy-Redis to realize the deduplication function of distributed crawlers, it also needs to update the fingerprint collection, but each crawler cannot maintain its own separate fingerprint collection. Using the data structure type of the Redis collection, the fingerprint identification of distributed crawlers can be easily realized. That is to say: each host gets the fingerprint of the Request and compares it with the collection in Redis. If the fingerprint exists, it means that it is a duplicate, and it will not send the request. If there is no fingerprint collection in Redis, it will Send a request and add the fingerprint to the Redis collection. In this way, the sharing of fingerprint collections of distributed crawlers is realized.

The Item Pipeline
engine sends the crawled Items (returned by Spider) to the Item Pipeline, and the scrapy-redis Item Pipeline stores the crawled Items in the redis items queue. After modifying the Item Pipeline, it is convenient to extract items from the items queue based on the key and implement the items processes cluster.

Base Spider
no longer uses scrapy's original Spider class. The rewritten RedisSpider inherits the two classes Spider and RedisMixin. RedisMixin is the class used to read URLs from redis. When we generate a Spider to inherit RedisSpider, call the setup_redis function, this function will connect to the redis database, and then set the signals (signals):

When the spider is idle, the signal will call the spider_idle function. This function calls the schedule_next_request function to ensure that the spider is always alive and throws the DontCloseSpider exception.

When the signal of an item is caught, the item_scraped function will be called, and this function will call the schedule_next_request function to get the next request.

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/114982668