Distributed crawler of scrapy framework

  • Concept: We need to build a distributed cluster to allow it to perform distributed and joint crawling of a set of resources
    • Role: Improve the efficiency of crawling data
    • How to achieve distributed?
      • Install a scrapy-redis component (using the redis database)
      • Native scrapy cannot implement distributed crawlers. Scrapy must be combined with scrapy-redis components to implement distributed crawlers.
      • Why can't the native scrapy framework implement distributed crawlers?
        • 1. The scheduler cannot be shared by the distributed cluster
          2. The pipeline cannot be shared by the distributed cluster
      • The role of scrapy-redis component:
        • It can provide a shared scheduler and pipeline for the native scrapy framework
      • Implementation process:
        • Create a project
        • Create a crawler file based on CrawlSpider
        • Modify the current crawler file:
          • 导包:from scrapy_redis.spiders import RedisCrawlSpider
          • Annotate start_urls and allowed_domins
          • Add a new attribute: redis_key ='sun' is the name of the scheduler queue that can be shared
          • Write data analysis related operations
          • Modify the parent class of the current crawler class to RedisCrawlSpider
        • Modify the configuration file settings
          • Specify the use of shared pipelines (dead, can be copied directly)
            ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 400, }

          • Specify the scheduler that can be shared (dead, can be copied directly) #Added
            a configuration of deduplication container class, which uses the set collection of Redis to store the requested fingerprint data,
            so as to realize the persistence of request
            deduplication DUPEFILTER_CLASS = "scrapy_redis .dupefilter.RFPDupeFilter"
            #Use the scrapy_redis component's own scheduler
            SCHEDULER = "scrapy_redis.scheduler.Scheduler" #Configure
            whether the scheduler should be persistent, that is, when the crawler is over, should you empty the request queue and deduplication fingerprint in Redis
            set. If it is True, do not clear, that is, only crawl the data that has not been crawled, and the crawled data will not be crawled.
            SCHEDULER_PERSIST=True
          • Specify the resid server:
            REDIS_HOST='redis remote server IP (self-modify)'
            REDIS_PORT='port number'
        • Redis related operation settings: (installation can refer to the rookie tutorial)
          • Configure redis configuration file:
            • Linux or mac: redis.conf
            • Windows : redis.windows.conf
            • Open the configuration file to modify (vi redis.windows.conf) (Windows direct Notepad modification):
              • Delete bind 127.0.0.1
              • Turn off protected mode: changed protected-mode yes to no
          • Combine the configuration file to start the redis service
            • redis_server configuration file (redis.windows.conf)
          • Start the client
            • redis_cli.exe
          • Execute the project (enter the spiders directory)
            • scrapy runspider xxx.py
          • Put a starting URL into the scheduler's queue:
            • The scheduler's queue is in the redis client
              • lpush xxx (queue name) www.xxx.com
          • The crawled data is stored in the redis proName:items data structure

In fact, distributed code is not difficult, mainly because the configuration is relatively time-consuming, the following code refers to the previous blog.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_redis.spiders import RedisCrawlSpider
from fenbushipro.items import FenbushiproItem
class FbsSpider(RedisCrawlSpider):
    name = 'fbs'
    # allowed_domains = ['www.xxx.com']
    # start_urls = ['http://www.xxx.com/']
    redis_key = 'sun'
    rules = (
        Rule(LinkExtractor(allow=r'id=1&page=\d+'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')
        for li in li_list:
            new_num = li.xpath('/html/body/div[2]/div[3]/ul[2]/li[1]/span[1]/text()').extract_first()
            new_title = li.xpath('/html/body/div[2]/div[3]/ul[2]/li[1]/span[3]/a/text()').extract_first()
            print(new_num,new_title)  #用于测试程序是否出错
            item = FenbushiproItem()
            item['new_num'] = new_num
            item['new_title'] = new_title
            yield item

ITEM_PIPELINES = {
    
    
        'scrapy_redis.pipelines.RedisPipeline': 300,
        }
#指定可以共享的调度器
#增加了一个去重容器类的配置,作用 使用Redis的set集合来存储请求的指纹数据,
#从而实现请求去重的持久化
DUPEFILTER_CLASS ="scrapy_redis.dupefilter.RFPDupeFilter"
#使用scrapy_redis组件自己的调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
#配置调度器是否要持久化,也就是当爬虫结束了,要不要清空Redis中请求队列和去重指纹的
#set。如果是True,不清空
SCHEDULER_PERSIST=True
#指定服务器
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379

Guess you like

Origin blog.csdn.net/qwerty1372431588/article/details/107303456