Scrapy distributed + fingerprint deduplication principle

1. The principle of fingerprint deduplication exists in scrapy.util.requests

Packages that need to be installed

pip install scrapy-redis-cluster #Install the module
pip install scrapy-redis-cluster==0.4 #Specify the version when installing the module
pip install --upgrade scrapy-redis-cluster #Upgrade the module version

2. setting configuration

# -*- coding: utf-8 -*-

BOT_NAME = 'zongheng'

SPIDER_MODULES = ['rankxs.spiders']
NEWSPIDER_MODULE = 'rankxs.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

######################################################
##############下面是Scrapy-Redis相关配置################
######################################################

# 指定Redis的主机名和端口
REDIS_HOST = 'localhost'
REDIS_PORT = 6379

# 调度器启用Redis存储Requests队列
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# 确保所有的爬虫实例使用Redis进行重复过滤
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# 将Requests队列持久化到Redis,可支持暂停或重启爬虫
SCHEDULER_PERSIST = True

# Requests的调度策略,默认优先级队列
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

# 将爬取到的items保存到Redis 以便进行后续处理
ITEM_PIPELINES = {
   "rankxs.pipelines.RankxsPipeline": 1,
   'scrapy_redis.pipelines.RedisPipeline': 2
}

The key point here is the order of the pipelines. If the order is wrong, the data in the database will not be able to enter.

3. Display of fingerprint deduplication in redis, using redis for deduplication logic - advance team

zongheng:items and zongheng:dupefilters two keys  

One is the data in items, and the other is the crawled md5 url.

The crawler will first query the dupefiters md5 url in redis

4. Multiple crawler tasks can be executed in the same file CrawlerProcess and CrawlerRunner

crawler.CrawlerRunner
from scrapy.utils.project import get_project_settings
# process=crawler.CrawlerProcess(get_project_settings())
# process.crawl(ZonghengSpider)
# process.start(stop_after_crawl=False)

Program execution error

The key point is that stop_after_crawl configuration and comment TWISTED_REACTOR can be avoided. For specific reasons, you can check the source code.

5. About multi-function

from scrapy.linkextractors import LinkExtractor

The link extractor that can be used with either scrapy or scrapwlspider is more practical.

Guess you like

Origin blog.csdn.net/Steven_yang_1/article/details/131942809