0. scrapy-redis reptile process
1. scrapy whether the framework can achieve their own distributed
-
Not. There are two reasons.
First: because scrapy deployed on multiple machines will each have their own scheduler, so that makes multiple machines can not allocate start_urls list url. (Multiple machines can not share the same scheduler)
Second: multiple machines crawling data can not be sustained by the data stored in the data with a unified pipeline. (Multiple machines can not share the same conduit)
2. Distributed crawler assembly scrapy-redis
- Scrapy-component Redis us a good package can be shared by multiple machines scheduler and pipes, we can directly use the data and implement the distributed crawling.
- implementation:
1. Based on the component class RedisSpider
2. RedisCrawlSpider classes based on the component
3. distributed implementation process: the two distributed processes are different ways to achieve unity
- 3.1 Download scrapy-redis components: pip install scrapy-redis
- 3.2 Redis profile configuration:
- comment on the line: bind 127.0.0.1, represents can let other ip access redis - yes the will is no: protected-mode no, ip representation can let other operating redis
3.3 crawler modify file relevant code:
- the parent reptilian or modified based RedisSpider RedisCrawlSpider. Note: If the original file is a crawler-based Spider, the parent should be modified to RedisSpider, if the original document is based on CrawlSpider reptiles, the parent class should be modified to RedisCrawlSpider.
- Note start_urls list or delete, cut redis_key added attribute, the attribute value is the name scrpy-redis assembly scheduler queue
3.4 the configuration to the configuration file, opening them scrapy-redis component encapsulated conduit
ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 400 } img
3.5 the configuration to the configuration file, opening them scrapy-redis encapsulated component scheduler
# Scrapy-redis used to re-assembly queue DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # use scrapy-redis assembly own scheduler SCHEDULER = "scrapy_redis.scheduler.Scheduler" # Allow suspension SCHEDULER_PERSIST = True
3.6 crawler redis link configuration in the configuration file:
REDIS_HOST = 'ip address redis services' REDIS_PORT = 6379 REDIS_ENCODING = 'UTF-. 8' REDIS_PARAMS {= 'password': '123456'}
3.7 open redis server: redis-server configuration file
3.8 open redis client: redis-cli
3.9 Run reptiles file: scrapy runspider SpiderFile
3.10 thrown into a starting url (operation redis client) to a scheduler queue: lpush redis_key starting url attribute value
4. Examples
-
Reptile file
Scrapy Import from scrapy.linkextractors Import LinkExtractor from scrapy.spiders Import CrawlSpider, the Rule from scrapy_redis.spiders Import RedisCrawlSpider from redisChoutiPro.items Import RedischoutiproItem class ChoutiSpider (RedisCrawlSpider): name = 'chouti' # allowed_domains = [ 'www.xxx.com' ] # start_urls = [ 'http://www.xxx.com/'] # = redis_key 'chouti' names may be shared scheduler # url into the global starting, so that all of the distributed clusters to grab grab can be resolved from the starting url out into the sub-url governor queue will be requested to ensure a starting url redis_key = 'chouti' # scheduler queue name the rules = ( the Rule (LinkExtractor (R & lt the allow = '/ All / Hot / Recent / \ + D'), the callback = 'parse_item', Follow = True), ) parse_item DEF (Self, Response): div_list = response.xpath ( '// div [@ class = "Item"]') for div in div_list: title = div.xpath ( './ div [. 4] / div [. 1 ] / A / text () '). extract_first () author = div.xpath (' ./ div [. 4] / div [2] / A [. 4] / B / text () '). extract_first () # example object, to the object encapsulates data Item = RedischoutiproItem () Item [ 'title'] = title [ 'author'] = author Item # submitted to scrapy duct native in not shared pipeline, the need to profile modification yield item
-
iterm
import scrapy class RedischoutiproItem(scrapy.Item): # define the fields for your item here like: title = scrapy.Field() author = scrapy.Field()
-
pipeline
class RedischoutiproPipeline(object): def process_item(self, item, spider): return item
-
setings
= BOT_NAME 'redisChoutiPro' SPIDER_MODULES = [ 'redisChoutiPro.spiders'] NEWSPIDER_MODULE =' redisChoutiPro.spiders' ROBOTSTXT_OBEY = False the USER_AGENT = 'the Mozilla / 5.0 (the Windows NT 6.1; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 72.0.3626.119 Safari / 537.36 ' ITEM_PIPELINES = { ' scrapy_redis.pipelines.RedisPipeline ': 400 } # adds weight to a container class configuration, the role of Redis set using set of fingerprint data storage request, in order to achieve request to the persistent weight DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # use scrapy-redis assembly own scheduler sCHEDULER = "scrapy_redis.scheduler.Scheduler" REDIS_HOST = '127.0.0.1' # configure the dispatcher whether to persist, that is, when the end of reptiles, Do not set Redis to empty the request queue and go heavy fingerprints. If True, SCHEDULER_PERSIST = True # fingerprint data REDIS_PORT = 6379
5. Summary
## 5. Summary `` `Python - Why can not achieve native scrapy distributed? - scheduler can not be shared - the pipe can not be shared - what role scrapy-redis component is? - provides a scheduler can be shared and pipeline - distributed crawling implementation process 1. installation environment: PIP install scrapy-Redis 2. create project 3. create reptiles file: RedisCrawlSpider RedisSpider - scrapy genspider -t crawl xxx www.xxx.com 4. relevant attributes of reptiles file edit: - Medicine: Import from scrapy_redis.spiders RedisCrawlSpider - the parent file is set to the current crawler RedisCrawlSpider - url list to replace the starting redis_key = 'xxx' (the name of the queue scheduler) 5. in the configuration file configuration: - use the encapsulated component may be shared pipeline: ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 400 } - configuration scheduler (using encapsulated component may be shared by the scheduler) # Adds fingerprint data to a reconfiguration of the containers, the effect of using the set Redis set of store request, the request to re-achieve persistence DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # scrapy-redis assembly using its own scheduler is sCHEDULER = "scrapy_redis.scheduler.Scheduler" # configure the dispatcher whether to persist, that is, when the reptile is over, you do not empty the set Redis request queue and go heavy fingerprints. If it is True, it means to persistent storage, the data is not empty, otherwise empty data SCHEDULER_PERSIST = True - redis designated to store data: REDIS_HOST = 'ip address redis services' REDIS_PORT = 6379 - redis database configuration profiles - Cancel protected mode: protected-the mODE NO - the bind bind: #bind 127.0.0.1 - start redis turn redis server: redis-server configuration files open redis client: redis-cli 6. perform distributed program runspider xxx.py Scrapy 7. still to the scheduler queue a start url: performed in redis-cli: `