The Scrapy Framework Series of Python Crawlers (23) - Distributed crawler scrapy_redis shallow combat [XXTop250 partial crawling]

1. Practical explanation (crawling of XXTop250 complete information):

  • First use scrapy_redis with a single project to explain some important points!

1.1 Use the complete XXTOP250 project made before, but set it to crawl only one page (a total of 25 movies), for easy observation

insert image description here

1.2 Configure the necessary configuration for using scrapy_redis in the settings file, and use the public redsi data storage area (by using a specific pipeline)

# 第一步:加入以下代码:
#设置scrapy-redis
#1.启用调度将请求存储进redis
from scrapy_redis.scheduler import Scheduler
SCHEDULER="scrapy_redis.scheduler.Scheduler"

#2.确保所有spider通过redis共享相同的重复过滤
from scrapy_redis.dupefilter import RFPDupeFilter
DUPEFILTER_CLASS="scrapy_redis.dupefilter.RFPDupeFilter"

#3.指定连接到Redis时要使用的主机和端口     目的是连接上redis数据库
REDIS_HOST="localhost"
REDIS_PORT=6379

# 不清理redis队列,允许暂停/恢复抓取    (可选)    允许暂停,redis数据不丢失     可以实现断点续爬!!!
SCHEDULER_PERSIST = True


# 第二步:开启将数据存储进redis公共区域的管道!
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    
    
   # 'film.pipelines.FilmPipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline': 100,
   # 'film.pipelines.DoubanSqlPipeline': 200,
}

1.3 Note: There is an optional option SCHEDULER_PERSIST in the above settings.py configuration, which is used to decide whether to clean up the redis queue:

  • First, set its value to True to allow continuous crawling at breakpoints, and observe the public area of ​​data stored in redis through Redis Desktop Manager: (You will find that one of them is Douban:duperfilter, which contains the fingerprint of each request URL; One is Douban:items, which contains the final crawled data!)

insert image description hereinsert image description here

However, if the value of the optional SCHEDULER_PERSIST is set to False, continuous crawling at breakpoints is not allowed. Observe: (I found that there is only one Douban:items, which contains the final crawled data! There is no fingerprint data for each request URL! )

insert image description here

1.4 But the information of each request URL cannot be seen in redis above:

  • This is because after each execution, redis does not exist. Therefore, in order to observe that scrapy_redis sends each request to redis, we force the project to close after running the project for a period of time, and then observe redis. Discovery: (including Douban:requests!!!)

insert image description here

1.5 Example to realize breakpoint continuous climbing:

  • First, set SCHEDULER_PERSIST to True, and immediately force the interruption after running the framework for a while!

  • Note that the pipeline for storing local txt data is opened in settings.py!
    insert image description here
    It will be found that there are no 25 pieces of data stored locally. This is because the breakpoint continues to crawl, and then the previously interrupted point continues to crawl:
    insert image description here

Guess you like

Origin blog.csdn.net/qq_44907926/article/details/131798682