[Reptile] study notes day59 7.1 scrapy-redis real - source code comes with Item Description + scrapy-redis use the example to modify + dmoz + myspider_redis + mycrawler_r

7.1 scrapy-redis combat - own source Item Description

Here Insert Picture Description

Source comes Item Description:

Use example scrapy-redis to modify

Github to get start scrapy-redis example, and then the inside of the example-project directory to a designated address:

# clone github scrapy-redis源码文件
git clone https://github.com/rolando/scrapy-redis.git

# 直接拿官方的项目范例,改名为自己的项目用(针对懒癌患者)
mv scrapy-redis/example-project ~/scrapyredis-project

We clone to scrapy-redis source has a built-example-project project, which includes three spider, are dmoz, myspider_redis, mycrawler_redis.

一、dmoz (class DmozSpider(CrawlSpider))

The reptile inherited CrawlSpider, which is used to explain the persistence Redis, when we first run dmoz reptiles, then Ctrl + C then stopped, then run dmoz reptile, crawling before the record is retained in the Redis .

Analysis of them, in fact, this is a scrapy-redis version of the CrawlSpiderclass, you need to set rules Rule, and callback can not write parse () method.

execution way:scrapy crawl dmoz

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class DmozSpider(CrawlSpider):
    """Follow categories and extract links."""
    name = 'dmoz'
    allowed_domains = ['dmoztools.net/']
    start_urls = ['http://dmoztools.net/']

    rules = [
        Rule(LinkExtractor(
            restrict_css=('.top-cat', '.sub-cat', '.cat-item')
        ), callback='parse_directory', follow=True),
    ]

    def parse_directory(self, response):
        for div in response.css('.title-and-desc'):
            yield {
                'name': div.css('.site-title::text').extract_first(),
                'description': div.css('.site-descr::text').extract_first().strip(),
                'link': div.css('a::attr(href)').extract_first(),
            }

二、myspider_redis (class MySpider(RedisSpider))

The reptile inherited RedisSpider, it can support a distributed crawling, using the basic spider, need to write a parse function.

Secondly, there is no longer a start_urls, replaced by redis_key, scrapy-redis from Redis in the key pop out and become url address request.

from scrapy_redis.spiders import RedisSpider


class MySpider(RedisSpider):
    """Spider that reads urls from redis queue (myspider:start_urls)."""
    name = 'myspider_redis'

    # 注意redis-key的格式:
    redis_key = 'myspider:start_urls'

    # 可选:等效于allowd_domains(),__init__方法按规定格式写,使用时只需要修改super()里的类名参数即可
    def __init__(self, *args, **kwargs):
        # Dynamically define the allowed domains list.
        domain = kwargs.pop('domain', '')
        self.allowed_domains = filter(None, domain.split(','))

        # 修改这里的类名为当前类名
        super(MySpider, self).__init__(*args, **kwargs)

    def parse(self, response):
        return {
            'name': response.css('title::text').extract_first(),
            'url': response.url,
        }

note:

RedisSpider class does not need to write allowd_domainsand start_urls:

  1. scrapy-redis in the constructor will __init__()be defined dynamically in the crawler crawling gamut may be selected directly written allowd_domains.
  2. You must specify redis_key, namely reptiles start command reference format:redis_key = 'myspider:start_urls'
  3. According to the specified format, start_urlsto the Master lpush end redis-cli in the Redis database, RedisSpider start_urls acquired in the database.

execution way:

  1. By runspider method of execution reptile py files (also can be divided into multiple execution), reptiles (they) will be in a wait state ready:

    scrapy runspider myspider_redis.py

  2. In Master end redis-cli push instruction input, reference format:

    $redis > lpush myspider:start_urls http://dmoztools.net/

  3. Slaver end crawler acquisition request to start crawling.

三、mycrawler_redis (class MyCrawler(RedisCrawlSpider))

This RedisCrawlSpider reptile class inherits RedisCrawlSpider, capable of supporting a distributed crawling. Because the use of crawlSpider, so it is necessary to comply with Rule rules, and callback can not write parse () method.

There is also no longer a start_urls, replaced by redis_key, scrapy-redis from Redis in the key pop out and become url address request.

from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor

from scrapy_redis.spiders import RedisCrawlSpider


class MyCrawler(RedisCrawlSpider):
    """Spider that reads urls from redis queue (myspider:start_urls)."""
    name = 'mycrawler_redis'
    redis_key = 'mycrawler:start_urls'

    rules = (
        # follow all links
        Rule(LinkExtractor(), callback='parse_page', follow=True),
    )

    # __init__方法必须按规定写,使用时只需要修改super()里的类名参数即可
    def __init__(self, *args, **kwargs):
        # Dynamically define the allowed domains list.
        domain = kwargs.pop('domain', '')
        self.allowed_domains = filter(None, domain.split(','))

        # 修改这里的类名为当前类名
        super(MyCrawler, self).__init__(*args, **kwargs)

    def parse_page(self, response):
        return {
            'name': response.css('title::text').extract_first(),
            'url': response.url,
        }

note:

The same, RedisCrawlSpider class does not need to write allowd_domainsand start_urls:

  1. scrapy-redis in the constructor will __init__()be defined dynamically in the crawler crawling gamut may be selected directly written allowd_domains.
  2. You must specify redis_key, namely reptiles start command reference format:redis_key = 'myspider:start_urls'
  3. According to the specified format, start_urlsto the Master lpush end redis-cli in the Redis database, RedisSpider start_urls acquired in the database.

execution way:

  1. By runspider method of execution reptile py files (also can be divided into multiple execution), reptiles (they) will be in a wait state ready:

    scrapy runspider mycrawler_redis.py

  2. In Master end redis-cli push instruction input, reference format:

    $redis > lpush mycrawler:start_urls http://www.dmoz.org/

  3. Reptiles get url, started.

to sum up:

  1. If only to re-use and saving Redis, and we chose the first;
  2. If writing distributed, according to the situation, to select the second, third;
  3. Typically, the depth of focus will select prepared with crawler third embodiment.
Published 290 original articles · won praise 94 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_35456045/article/details/104111475