1. The principle of fingerprint deduplication exists in scrapy.util.requests
Packages that need to be installed
pip install scrapy-redis-cluster #Install the module
pip install scrapy-redis-cluster==0.4 #Specify the version when installing the module
pip install --upgrade scrapy-redis-cluster #Upgrade the module version
2. setting configuration
# -*- coding: utf-8 -*-
BOT_NAME = 'zongheng'
SPIDER_MODULES = ['rankxs.spiders']
NEWSPIDER_MODULE = 'rankxs.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
######################################################
##############下面是Scrapy-Redis相关配置################
######################################################
# 指定Redis的主机名和端口
REDIS_HOST = 'localhost'
REDIS_PORT = 6379
# 调度器启用Redis存储Requests队列
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 确保所有的爬虫实例使用Redis进行重复过滤
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 将Requests队列持久化到Redis,可支持暂停或重启爬虫
SCHEDULER_PERSIST = True
# Requests的调度策略,默认优先级队列
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
# 将爬取到的items保存到Redis 以便进行后续处理
ITEM_PIPELINES = {
"rankxs.pipelines.RankxsPipeline": 1,
'scrapy_redis.pipelines.RedisPipeline': 2
}
The key point here is the order of the pipelines. If the order is wrong, the data in the database will not be able to enter.
3. Display of fingerprint deduplication in redis, using redis for deduplication logic - advance team
zongheng:items and zongheng:dupefilters two keys
One is the data in items, and the other is the crawled md5 url.
The crawler will first query the dupefiters md5 url in redis
4. Multiple crawler tasks can be executed in the same file CrawlerProcess and CrawlerRunner
crawler.CrawlerRunner
from scrapy.utils.project import get_project_settings
# process=crawler.CrawlerProcess(get_project_settings()) # process.crawl(ZonghengSpider) # process.start(stop_after_crawl=False)
Program execution error
The key point is that stop_after_crawl configuration and comment TWISTED_REACTOR can be avoided. For specific reasons, you can check the source code.
5. About multi-function
from scrapy.linkextractors import LinkExtractor
The link extractor that can be used with either scrapy or scrapwlspider is more practical.