Based on the framework of a distributed crawler scrapy

distributed

  • The concept: You can use more than one computer components of a distributed fleet, allowed to perform the same set of programs, the same set of network resources for joint crawling.

  • scrapy native can not be achieved distributed

    • The scheduler can not be shared
    • Pipeline can not be shared
  • scrapy + redis (scrapy & scrapy-redis components) based implementation of distributed

  • scrapy-redisComponent Role:

    • It can be provided and shared pipeline scheduler
  • Installation Environment:

    pip install scrapy-redis
  • Encoding process:

    1.创建工程
    
    2.cd proName
    
    3.创建crawlspider的爬虫文件
    
    4.修改一下爬虫类:
        - 导包:from scrapy_redis.spiders import RedisCrawlSpider
        - 修改当前爬虫类的父类:RedisCrawlSpider
        - allowed_domains和start_urls删除
        - 添加一个新属性:redis_key = 'xxxx'可以被共享的调度器队列的名称
    
    5.修改配置settings.py
        - 指定管道
            ITEM_PIPELINES = {
                    'scrapy_redis.pipelines.RedisPipeline': 400
                }
        - 指定调度器
    增加了一个去重容器类的配置, 作用使用Redis的set集合来存储请求的指纹数据, 从而实现请求去重的持久化
            DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
            # 使用scrapy-redis组件自己的调度器
            SCHEDULER = "scrapy_redis.scheduler.Scheduler"
            # 配置调度器是否要持久化, 也就是当爬虫结束了, 要不要清空Redis中请求队列和去重指纹的set。如果是True, 就表示要持久化存储, 就不清空数据, 否则清空数据
            SCHEDULER_PERSIST = True
        - 指定redis数据库
            REDIS_HOST = '127.0.0.1'
            REDIS_PORT = 6379
    
     6.配置redis数据库(redis.windows.conf)
        - 关闭默认绑定
            - 56Line:#bind 127.0.0.1
        - 关闭保护模式
            - 75line:protected-mode no
    
     7.启动redis服务(携带配置文件)和客户端
        - redis-server.exe redis.windows.conf
        - redis-cli
    
     8.执行工程
        - scrapy runspider spider.py
    
     9.将起始的url仍入到可以被共享的调度器的队列(sun)中
        - 在redis-cli中操作:lpush sun www.xxx.com
    
     10.redis:
        - xxx:items:存储的就是爬取到的数据

    Reptile Code:

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy_redis.spiders import RedisCrawlSpider
    from fbsPro.items import FbsproItem
    class FbsSpider(RedisCrawlSpider):
        name = 'fbs'
        # allowed_domains = ['www.xxx.com']
        # start_urls = ['http://www.xxx.com/']
    
        redis_key = 'sun'     # 可以被共享的调度器队列的名称
    
        link = LinkExtractor(allow=r'type=4&page=\d+')
        rules = (
            Rule(link, callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            tr_list = response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr')
            for tr in tr_list:
                title = tr.xpath('./td[2]/a[2]/@title').extract_first()
                status = tr.xpath('./td[3]/span/text()').extract_first()
    
                item = FbsproItem()
                item['title'] = title
                item['status'] = status
    
                yield item
    
    

Guess you like

Origin www.cnblogs.com/zhufanyu/p/12020536.html