scrapcrawl + scrapyredis distributed

Distributed:
- Concept: components of a distributed cluster, so that allowed the joint implementation of the same set of procedures, the distribution of the data crawling.
- How to implement distributed?
- scrapy-redis assemblies are combined scrapy native implementation of distributed
- scrapy native can not be achieved distributed?
- not share schedulers
- can not share the same conduit
- scrapy-redis effect:
- provide a scheduler and a shared duct to scrapy

Distributed encoding process:
1. Create project
2. Create a file reptile
- Spider
- CrawlSpider
3. Modify reptile file:
- guide package: from scrapy_redis.spiders Import RedisCrawlSpider
- the parent class reptilian modified to RedisCrawlSpider
- will allow_demains and start_urls delete
- add a new attribute: redis_key = 'xxx': the name of the queue scheduler
- improvement of the relevant code reptiles (connected to the extractor, parser rules, analytical method)

4. profile prepared:
- Specify pipe:
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 400
}
- Specifies the scheduler:
# adds weight to a container class configuration, the effect of using the set Redis set of store request fingerprint data, to the request in order to achieve lasting weight
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# use scrapy-redis assembly own scheduler
sCHEDULER = "scrapy_redis.scheduler.Scheduler"
# whether to configure the scheduler persistence, i.e. when the reptile is over, you do not empty the set Redis request queue and go heavy fingerprints. If it is True, it means to persistent storage, the data is not empty, otherwise empty data
SCHEDULER_PERSIST = True

- Specify the database:
REDIS_HOST = 'ip address redis services'
REDIS_PORT = 6379

5. Modify the redis configuration: redis.windows.conf
- Close default binding: #bind 127.0.0.1
- Close Protection Mode: NO protected-MODE
6. The binding profile with the service start redis
- redis-server ./redis.windows .conf
7. the client starts redis
- redis-CLI
8. the boot execution distributed Engineering:
- Scrapy crawl FBS
- Scrapy runspider ./xxx.py
9. the starting thrown url to a scheduler queue:
- scheduler the queue is located in the redis
- the redis-cli: www.xxx.com value of lpush redis_key

Guess you like

Origin www.cnblogs.com/duoduoyichen/p/11129338.html