python web crawler - distributed reptiles

redis distributed deployment

- concept: a set of programs can be executed on multiple machines (distributed cluster), the data to be distributed crawling.

    Whether 1.scrapy framework can achieve its own distributed?

      First: because scrapy deployed on multiple machines will each have their own scheduler, so that makes multiple machines can not allocate start_urls list url. (Multiple machines can not share the same scheduler)

      Second: multiple machines crawling data can not be sustained by the data stored in the data with a unified pipeline. (Multiple machines can not share the same conduit)

    2. Distributed crawler assembly scrapy-redis

        - scrapy-redis component for us a good package can be shared by multiple machines scheduler and pipes, we can directly use the data and implement the distributed crawling.

        - Method to realize:

            1. Based on the component class RedisSpider

            2. RedisCrawlSpider classes based on the component

    3. distributed implementation process: the two distributed processes are different ways to achieve unity

        - 3.1 Download scrapy-redis components: pip install scrapy-redis

        - 3.2 redis profile configuration:

  - 注释该行:bind 127.0.0.1,表示可以让其他ip访问redis
  - 将yes该为no:protected-mode no,表示可以让其他ip操作redis

        3.3 crawler modify file relevant code:

            - the parent reptilian or modified based RedisSpider RedisCrawlSpider.

     Note: If the original file is a crawler-based Spider, the parent should be modified to RedisSpider, if the original document is based on CrawlSpider reptiles, the parent class should be modified to RedisCrawlSpider.

            - Note start_urls list or delete, cut redis_key added attribute, the attribute value is the name scrpy-redis assembly scheduler queue

        Setting in the configuration to 3.4, the opening using scrapy-redis component encapsulated conduit

    ITEM_PIPELINES = {
          'scrapy_redis.pipelines.RedisPipeline':400,
          }

        3.5 in the related configuration setting file, using the open scrapy-redis encapsulated component scheduler

    # Scrapy-redis used to re-assembly queue
    # reconfiguration to increase a container class, is to store the fingerprint data set requested by the Redis SET, the request to re-achieve persistence
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter "
    # use scrapy-redis components own scheduler
    sCHEDULER =" scrapy_redis.scheduler.Scheduler "
    # configure the dispatcher whether to persist, that is, when the reptile is over, you do not empty the queue and Redis request to re-fingerprint set. True: said to persistent storage, the data is not clear; False clear data
    SCHEDULER_PERSIST = True

        3.6 crawler link redis configuration setting in the file:

    REDIS_HOST = 'redis服务的ip地址'

    REDIS_PORT = 6379

    REDIS_ENCODING = ‘utf-8’

    REDIS_PARAMS = {‘password’:’123456’}

        3.7 open redis server: redis-server configuration file

     Carrying Profile Start [windows terminal services redis redis switch to the file directory]

     redis-server ./redis.windows.conf

        3.8 open redis client: [windows] input terminal

     redis-cli

        3.9 Run reptiles file: scrapy runspider SpiderFile 

    pycharm terminal into the spider file directory, enter [scrapy runspider xxx.py]

        3.10 thrown into a starting url (operation redis client) to a scheduler queue: lpush redis_key starting url attribute value

    Input terminal windows: lpush ts www.xxx.com

  3.11 redis redis-cli [client] [display data items: xxx]

Example: Sunshine Hotline

# 1.spider file 

Import Scrapy
 from scrapy.linkextractors Import LinkExtractor
 from scrapy.spiders Import CrawlSpider, the Rule
 from scrapyRedisPro.items Import ScrapyredisproItem
 from scrapy_redis.spiders Import RedisCrawlSpider 

class ScrapyredisSpider (RedisCrawlSpider): 
    name = ' scrapyredis ' 

    redis_key = ' TS '  # It may be shared scheduler queue name 
    the rules = ( 
        the Rule (LinkExtractor (the allow=r'tupe=4&page=\d+'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        tr_list = response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr')
        for tr in tr_list:
            title = tr.xpath('./td[2]/a[2]/text()').extract_first()
            net_friend = tr.xpath('./td[4]/text()').extract_first()
            item =SunlinecrawlItem () 
            Item [ ' title ' ] = title 
            Item [ ' net_friend ' ] = net_friend 

            the yield Item # Item must ensure filed submitted to the pipeline may be shared 
--------------- -------------------------------------------------- ----------------------- # 2.setting file 
BOT_NAME = ' scrapyRedisPro ' 
SPIDER_MODULES = [ ' scrapyRedisPro.spiders ' ] 
NEWSPIDER_MODULE = ' scrapyRedisPro.spiders ' 
the USER_AGENT =



' The Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 76.0.3809.132 Safari / 537.36 ' 

ROBOTSTXT_OBEY = False 

, LOG_LEVEL, = ' ERROR ' 

ITEM_PIPELINES = {
     ' scrapy_redis.pipelines.RedisPipeline ' : 400 , 
} 

# use scrapy-redis assembly to re-queue 
# increase a de reconfiguration container class, the role of a fingerprint data storage request set with the Redis sET, to achieve the request to re-persistent 
DUPEFILTER_CLASS = " scrapy_redis. dupefilter.RFPDupeFilter " 

# use scrapy-redis assembly own scheduler 
sCHEDULER = "scrapy_redis.scheduler.Scheduler " 

# Allow Pause 
# configure scheduler whether to persist, that is, when the reptile is over, you do not empty the Redis request queue and go heavy fingerprints set.True: said to persistent storage, you do not empty data; False clear data 
= SCHEDULER_PERSIST true 

# designated ip and Port 
REDIS_HOSR = ' 127.0.0.1 '  # {ip address server redis] [ip] real 
REDIS_PORT = 6379 # current request quantity threads [32] open 
CONCURRENT_REQUESTS = 32 
--- -------------------------------------------------- ----------------------------------- # 3.items file Import scrapy class ScrapyredisproItem (scrapy.Item): 
    title = scrapy.Field () 
    net_friend







= scrapy.Field()

 

Redis instruction corresponding to terminal 

  opening redis server: redis - Server Profile 
  D: \ redis > ./ redis-server.exe redis.windows.conf 

  open redis client 
  D: \ redis > redis- CLI 

  view the current data redis database

   127.0.0.1 : 6379> Keys * 
  (List empty or SET) 

  into an open to a scheduler queue a start url (operation redis client)
   127.0.0.1:6379> LPUSH TS http://wz.sun0769.com/index. php / question / questionType? type = 4 & page =

 

Guess you like

Origin www.cnblogs.com/bilx/p/11600973.html