[Scrapy framework is based on scrapy-redis implement distributed reptiles] --2019-08-07 10:14:58

Original: http://106.13.73.98/__/26/

Scrapy distributed framework can not achieve its own, for two reasons

  1. Scrapy deployed on multiple machines each have their own scheduler, so that makes multiple machines can not be assigned start_urlsa list of url, that is, multiple machines can not share the same scheduler .
  2. Multiple machines crawling data that can not be unified with the persistent storage through a pipeline that multiple machines can not share the same pipeline .
    ___

Distributed crawler assembly scrapy-redis

installation:pip install scrapy-redis

scrapy-redis us a good package assembly can be shared for multiple machines and piping scheduler, we can use it to directly implement the distributed crawling data.

scrapy-redis common key name

  1. 爬虫名:items type list, save the acquired data reptiles, item objects, JSON string content.
  2. 爬虫名:dupefilter set types, for url crawler to be re-visited, content url hash of a string of 40 characters.
  3. 爬虫名:start_urls type list, the default share scheduler name, other names may be specified as, for acquiring spider crawling during initial startup url, need to manually writing starting url.
  4. 爬虫名:requests zset type, scheduler for scheduling requests are serialized string content request object.

Steps for usage

  1. Modify the parent class of reptiles, reptile if the original file-based Spider , the parent should be amended to RedisSpider , if the original crawler-based file CrawlSpider , the parent class should be modified to RedisCrawlSpider .
  2. Deletion start_urls list and add redis_key attribute value of the attribute scrapy-redis assembly in a scheduler queue name.
  3. For configuration in the configuration file used to open scrapy-redis component encapsulated duct configuration as follows.

Related

# 使用scrapy-redis组件中封装好的管道,直接将数据存入Redis中,不再需要pipelines文件
ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 100
}

# 使用scrapy-redis组件的去重队列,不再使用scrapy默认的去重方式
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'

# 使用scrapy-redis组件的调度器,不使用默认的调度器
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'

# 允许暂停,redis请求记录不会丢失
SCHEDULER_PERSIST = True
# 即:配置调度器是否允许持久化,即爬取结束后 是否清空Redis中的请求队列和去重指纹set


"""请求队列形式"""
# 1.按优先级(默认)
# SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
# 2.队列形式,请求先进先出
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
# 3.栈形式,请求先进后出
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"


"""连接Redis配置"""
REDIS_HOST = '9.0.0.1'  # Redis监听地址
REDIS_PORT = 6380  # Redis监听端口
REDIS_ENCODING = 'utf-8'  # 编码

# 如果你的Redis设置了连接密码:
# REDIS_PARAMS = {'password': 'Redis密码'}

start using

The first, based on RedisSpider distributed reptiles:

# -*- coding: utf-8 -*-
import scrapy
from blog.items import  BlogItem
from scrapy_redis.spiders import RedisSpider  # pip install scrapy-redis


class Blog01Spider(RedisSpider):  # 更改继承类为RedisSpider
    name = 'blog01'

    # 删掉起始url列表,使用下面的redis_key
    # start_urls = ['http://www.xxx.com/']

    # 指定共享的调度器队列名称(起始url列表)
    redis_key = 'cmda'  # cmda:中国医师协会简称
    # 如果你不指定此属性,则默认的共享调度器队列名称为:blog01:start_urls

    present_page = 1  # 用于标识当前页面
    max_page = 1000  # 你想爬取多少页数据
    url = 'http://db.pharmcube.com/database/cfda/detail/cfda_cn_instrument/%d'  # 用于计算页面的url


    def parse(self, response):

        # 解析每一个页面
        item = BlogItem()
        item['num'] = response.xpath('/html/body/div/table/tbody/tr[1]/td[2]/text()').extract_first()  # 注册证编号
        item['company_name'] = response.xpath('//html/body/div/table/tbody/tr[2]/td[2]/text()').extract_first()  # 注册人名称
        item['company_address'] = response.xpath('/html/body/div/table/tbody/tr[3]/td[2]/text()').extract_first()  # 注册人住所
        
        yield item
        # 在items文件中写好要保存的字段
        # 这里会将item对象提交给scrapi-redis组件的管道

        # 下面开始递归爬取所有页面
        urls = []
        if self.present_page <= self.max_page:
            self.present_page += 1
            url = format(self.url % self.present_page)
            yield scrapy.Request(url, callback=self.parse)

# 注意修改你的配置文件,写入相关的配置

The second, based on RedisCrawlSpider distributed reptiles:

# -*- coding: utf-8 -*-
import scrapy
from blog.items import ChoutiProItem
from scrapy_redis.spiders import RedisCrawlSpider  # pip install scrapy-redis
from scrapy.linkextractors import LinkExtractor
# LinkExtractor:链接提取器,用于提取起始url页面中符合匹配规则的链接
from scrapy.spiders import CrawlSpider, Rule
# CrawlSpider:是Spider的一个子类,除了继承了Spider的特性和功能外,还派生了自己独有的强大特性和功能
# Rule:规则解析器,用于将链接提取器提取到的链接对应的页面源码数据根据指定的解析方法(回调函数)进行解析


class Blog02Spider(RedisCrawlSpider):  # 更改继承类为RedisCrawlSpider
    name = 'blog02'

    # allowed_domains = ['www.xxx.com']

    # 删掉起始url列表,使用下面的redis_key
    # start_urls = ['http://www.xxx.com/']

    # 指定共享的调度器队列名称(共享的起始url列表)
    redis_key = 'chouti' # https://dig.chouti.com/r/scoff/hot
    # 如果你不指指定此属性,则默认的共享调度器队列名称为:blog01:start_urls

    # 定义链接提取规则
    link = LinkExtractor(allow=r'/r/scoff/hot/\d+')

    # 在这里定义规则解析器
    rules = (
        Rule(link, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        div_list = response.xpath('//div[@id="content-list"]/div')
        for div in div_list:
            item = ChoutiProItem()
            item['title'] = div.xpath('.//div[@class="part1"]/a/text()').extract_first()  # 段子标题
            item['author'] = div.xpath('.//div[@class="part2"]/a[4]/b/text()').extract_first()  # 段子内容
            print(item.__dict__)

            yield item
            # 在items文件中写好要保存的字段
            # 这里会将item对象提交给scrapi-redis组件的管道

# 注意修改你的配置文件,写入相关的配置

When ready, start redis server and client, execute commands on all machines reptile scrapy runspider 爬虫文件路径run crawlers, and then execute the redis client lpush redis_key属性值 起始urlto pass the starting url. At this point, all the crawlers will start working.

After successfully crawling data, you can use the redis client smembers 爬虫名:dupefilterto view crawl data.
___

All eager to understand other people's behavior, the behavior is weak. Powerful people, the first step to learn to loneliness, the second step to learn not to be construed, the third step is to use the results to rolling. Come on!

Original: http://106.13.73.98/__/26/

Guess you like

Origin www.cnblogs.com/gqy02/p/11313587.html