redis distributed deployment
- concept: a set of programs can be executed on multiple machines (distributed cluster), the data to be distributed crawling.
Whether 1.scrapy framework can achieve its own distributed?
First: because scrapy deployed on multiple machines will each have their own scheduler, so that makes multiple machines can not allocate start_urls list url. (Multiple machines can not share the same scheduler)
Second: multiple machines crawling data can not be sustained by the data stored in the data with a unified pipeline. (Multiple machines can not share the same conduit)
2. Distributed crawler assembly scrapy-redis
- scrapy-redis component for us a good package can be shared by multiple machines scheduler and pipes, we can directly use the data and implement the distributed crawling.
- Method to realize:
1. Based on the component class RedisSpider
2. RedisCrawlSpider classes based on the component
3. distributed implementation process: the two distributed processes are different ways to achieve unity
- 3.1 Download scrapy-redis components: pip install scrapy-redis
- 3.2 redis profile configuration:
3.3 crawler modify file relevant code:
- the parent reptilian or modified based RedisSpider RedisCrawlSpider.
Note: If the original file is a crawler-based Spider, the parent should be modified to RedisSpider, if the original document is based on CrawlSpider reptiles, the parent class should be modified to RedisCrawlSpider.
- Note start_urls list or delete, cut redis_key added attribute, the attribute value is the name scrpy-redis assembly scheduler queue
Setting in the configuration to 3.4, the opening using scrapy-redis component encapsulated conduit
3.5 in the related configuration setting file, using the open scrapy-redis encapsulated component scheduler
3.6 crawler link redis configuration setting in the file:
3.7 open redis server: redis-server configuration file
Carrying Profile Start [windows terminal services redis redis switch to the file directory]
redis-server ./redis.windows.conf
3.8 open redis client: [windows] input terminal
redis-cli
3.9 Run reptiles file: scrapy runspider SpiderFile
pycharm terminal into the spider file directory, enter [scrapy runspider xxx.py]
3.10 thrown into a starting url (operation redis client) to a scheduler queue: lpush redis_key starting url attribute value
Input terminal windows: lpush ts www.xxx.com
3.11 redis redis-cli [client] [display data items: xxx]
Example: Sunshine Hotline
# 1.spider file Import Scrapy from scrapy.linkextractors Import LinkExtractor from scrapy.spiders Import CrawlSpider, the Rule from scrapyRedisPro.items Import ScrapyredisproItem from scrapy_redis.spiders Import RedisCrawlSpider class ScrapyredisSpider (RedisCrawlSpider): name = ' scrapyredis ' redis_key = ' TS ' # It may be shared scheduler queue name the rules = ( the Rule (LinkExtractor (the allow=r'tupe=4&page=\d+'), callback='parse_item', follow=True), ) def parse_item(self, response): tr_list = response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr') for tr in tr_list: title = tr.xpath('./td[2]/a[2]/text()').extract_first() net_friend = tr.xpath('./td[4]/text()').extract_first() item =SunlinecrawlItem () Item [ ' title ' ] = title Item [ ' net_friend ' ] = net_friend the yield Item # Item must ensure filed submitted to the pipeline may be shared --------------- -------------------------------------------------- ----------------------- # 2.setting file BOT_NAME = ' scrapyRedisPro ' SPIDER_MODULES = [ ' scrapyRedisPro.spiders ' ] NEWSPIDER_MODULE = ' scrapyRedisPro.spiders ' the USER_AGENT = ' The Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 76.0.3809.132 Safari / 537.36 ' ROBOTSTXT_OBEY = False , LOG_LEVEL, = ' ERROR ' ITEM_PIPELINES = { ' scrapy_redis.pipelines.RedisPipeline ' : 400 , } # use scrapy-redis assembly to re-queue # increase a de reconfiguration container class, the role of a fingerprint data storage request set with the Redis sET, to achieve the request to re-persistent DUPEFILTER_CLASS = " scrapy_redis. dupefilter.RFPDupeFilter " # use scrapy-redis assembly own scheduler sCHEDULER = "scrapy_redis.scheduler.Scheduler " # Allow Pause # configure scheduler whether to persist, that is, when the reptile is over, you do not empty the Redis request queue and go heavy fingerprints set.True: said to persistent storage, you do not empty data; False clear data = SCHEDULER_PERSIST true # designated ip and Port REDIS_HOSR = ' 127.0.0.1 ' # {ip address server redis] [ip] real REDIS_PORT = 6379 # current request quantity threads [32] open CONCURRENT_REQUESTS = 32 --- -------------------------------------------------- ----------------------------------- # 3.items file Import scrapy class ScrapyredisproItem (scrapy.Item): title = scrapy.Field () net_friend = scrapy.Field()
Redis instruction corresponding to terminal opening redis server: redis - Server Profile D: \ redis > ./ redis-server.exe redis.windows.conf open redis client D: \ redis > redis- CLI view the current data redis database 127.0.0.1 : 6379> Keys * (List empty or SET) into an open to a scheduler queue a start url (operation redis client) 127.0.0.1:6379> LPUSH TS http://wz.sun0769.com/index. php / question / questionType? type = 4 & page =