Distributed Scrapy-redis

0. scrapy-redis reptile process

1. scrapy whether the framework can achieve their own distributed

  • Not. There are two reasons.

      First: because scrapy deployed on multiple machines will each have their own scheduler, so that makes multiple machines can not allocate start_urls list url. (Multiple machines can not share the same scheduler)

      Second: multiple machines crawling data can not be sustained by the data stored in the data with a unified pipeline. (Multiple machines can not share the same conduit)

2. Distributed crawler assembly scrapy-redis

- Scrapy-component Redis us a good package can be shared by multiple machines scheduler and pipes, we can directly use the data and implement the distributed crawling.

- implementation:

1. Based on the component class RedisSpider

2. RedisCrawlSpider classes based on the component

3. distributed implementation process: the two distributed processes are different ways to achieve unity

- 3.1 Download scrapy-redis components: pip install scrapy-redis

- 3.2 Redis profile configuration:

- comment on the line: bind 127.0.0.1, represents can let other ip access redis 
 - yes the will is no: protected-mode no, ip representation can let other operating redis

imgClick and drag to move

3.3 crawler modify file relevant code:

- the parent reptilian or modified based RedisSpider RedisCrawlSpider. Note: If the original file is a crawler-based Spider, the parent should be modified to RedisSpider, if the original document is based on CrawlSpider reptiles, the parent class should be modified to RedisCrawlSpider.

- Note start_urls list or delete, cut redis_key added attribute, the attribute value is the name scrpy-redis assembly scheduler queue

3.4 the configuration to the configuration file, opening them scrapy-redis component encapsulated conduit

 ITEM_PIPELINES = {
     'scrapy_redis.pipelines.RedisPipeline': 400
 }
img

Click and drag to move

3.5 the configuration to the configuration file, opening them scrapy-redis encapsulated component scheduler

# Scrapy-redis used to re-assembly queue 
 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" 
 # use scrapy-redis assembly own scheduler 
 SCHEDULER = "scrapy_redis.scheduler.Scheduler" 
 # Allow suspension 
 SCHEDULER_PERSIST = True

imgClick and drag to move

3.6 crawler redis link configuration in the configuration file:

REDIS_HOST = 'ip address redis services' 
 REDIS_PORT = 6379 
 REDIS_ENCODING = 'UTF-. 8' 
 REDIS_PARAMS {= 'password': '123456'}

imgClick and drag to move

3.7 open redis server: redis-server configuration file

3.8 open redis client: redis-cli

3.9 Run reptiles file: scrapy runspider SpiderFile

3.10 thrown into a starting url (operation redis client) to a scheduler queue: lpush redis_key starting url attribute value

4. Examples

  • Reptile file

Scrapy Import 
 from scrapy.linkextractors Import LinkExtractor 
 from scrapy.spiders Import CrawlSpider, the Rule 
 from scrapy_redis.spiders Import RedisCrawlSpider 
 from redisChoutiPro.items Import RedischoutiproItem 
 class ChoutiSpider (RedisCrawlSpider): 
     name = 'chouti' 
     # allowed_domains = [ 'www.xxx.com' ] 
     # start_urls = [ 'http://www.xxx.com/'] 
     
     # = redis_key 'chouti' names may be shared scheduler 
     # url into the global starting, so that all of the distributed clusters to grab grab can be resolved from the starting url out into the sub-url governor queue will be requested to ensure a starting url 
     redis_key = 'chouti' # scheduler queue name 
     the rules = ( 
         the Rule (LinkExtractor (R & lt the allow = '/ All / Hot / Recent / \ + D'), the callback = 'parse_item', Follow = True), 
     )
     parse_item DEF (Self, Response): 
         div_list = response.xpath ( '// div [@ class = "Item"]') 
         for div in div_list: 
             title = div.xpath ( './ div [. 4] / div [. 1 ] / A / text () '). extract_first () 
             author = div.xpath (' ./ div [. 4] / div [2] / A [. 4] / B / text () '). extract_first () 
             # example object, to the object encapsulates data 
             Item = RedischoutiproItem () 
             Item [ 'title'] = title 
             [ 'author'] = author Item 
             # submitted to scrapy duct native in not shared pipeline, the need to profile modification 
             yield item
  • iterm

 import scrapy
 ​
 ​
 class RedischoutiproItem(scrapy.Item):
     # define the fields for your item here like:
     title = scrapy.Field()
     author = scrapy.Field()
  • pipeline

 class RedischoutiproPipeline(object):
     def process_item(self, item, spider):
         return item
  • setings

= BOT_NAME 'redisChoutiPro' 
 SPIDER_MODULES = [ 'redisChoutiPro.spiders'] 
 NEWSPIDER_MODULE =' redisChoutiPro.spiders' 
 ROBOTSTXT_OBEY = False 
 the USER_AGENT = 'the Mozilla / 5.0 (the Windows NT 6.1; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 72.0.3626.119 Safari / 537.36 ' 
 ITEM_PIPELINES = { 
     ' scrapy_redis.pipelines.RedisPipeline ': 400 
 } 
 # adds weight to a container class configuration, the role of Redis set using set of fingerprint data storage request, in order to achieve request to the persistent weight 
 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" 
 # use scrapy-redis assembly own scheduler 
 sCHEDULER = "scrapy_redis.scheduler.Scheduler" 
 REDIS_HOST = '127.0.0.1'

  
 # configure the dispatcher whether to persist, that is, when the end of reptiles, Do not set Redis to empty the request queue and go heavy fingerprints. If True,
 SCHEDULER_PERSIST = True # fingerprint data 
 REDIS_PORT = 6379

5. Summary

## 5. Summary 

`` `Python 
- Why can not achieve native scrapy distributed? 
    - scheduler can not be shared 
    - the pipe can not be shared 

- what role scrapy-redis component is? 
    - provides a scheduler can be shared and pipeline 

- distributed crawling implementation process 
1. installation environment: PIP install scrapy-Redis 
2. create project 
3. create reptiles file: RedisCrawlSpider RedisSpider 
    - scrapy genspider -t crawl xxx www.xxx.com 
4. relevant attributes of reptiles file edit: 
    - Medicine: Import from scrapy_redis.spiders RedisCrawlSpider 
    - the parent file is set to the current crawler RedisCrawlSpider 
    - url list to replace the starting redis_key = 'xxx' (the name of the queue scheduler) 
5. in the configuration file configuration: 
    - use the encapsulated component may be shared pipeline: 
        ITEM_PIPELINES = { 
            'scrapy_redis.pipelines.RedisPipeline': 400 
            } 
    - configuration scheduler (using encapsulated component may be shared by the scheduler)
        # Adds fingerprint data to a reconfiguration of the containers, the effect of using the set Redis set of store request, the request to re-achieve persistence 
        DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" 
        # scrapy-redis assembly using its own scheduler is 
        sCHEDULER = "scrapy_redis.scheduler.Scheduler" 
        # configure the dispatcher whether to persist, that is, when the reptile is over, you do not empty the set Redis request queue and go heavy fingerprints. If it is True, it means to persistent storage, the data is not empty, otherwise empty data 
        SCHEDULER_PERSIST = True 

     - redis designated to store data: 
        REDIS_HOST = 'ip address redis services' 
        REDIS_PORT = 6379 

     - redis database configuration profiles 
        - Cancel protected mode: protected-the mODE NO 
        - the bind bind: #bind 127.0.0.1 

     - start redis 
    turn redis server: redis-server configuration files 
    open redis client: redis-cli 

6. perform distributed program
    runspider xxx.py Scrapy

7. still to the scheduler queue a start url: 
    performed in redis-cli: 
`

  

Guess you like

Origin www.cnblogs.com/yzg-14/p/12208040.html