Scrapy reptile template --CrawlSpider

From the beginning of this article, I will explain using the three articles were Scrapy reptile template. Scrapy reptile template contains four templates:

  1. Basic: The most basic template, here we do not explain;
  2. CrawlSpider
  3. XMLFeedSpider
  4. CSVFEEDSpider

This article I will explain to CrawlSpider template.

Zero, explain

CrawlSpider is commonly used Spider, to follow links by custom rules. For most of the sites we crawl task can be accomplished by modifying the rules. CrawlSpider common attributes are the rules *, which is one or more of Rule objects show the form of the tuple. Each Rule Object defines the behavior of crawling target site.

Tip: If you have multiple Rule objects hit the same link, only the first entry into force of a Rule Object.

We look Role syntax:

Rule(link_extractor [,callback = None] [,cb_kwargs = None] [,follow = None] [,process_links = None] [,process_request = None])

Parameters Resolution:

  • link_extractor: Link Extrator object is a regular expression. The main elements of which are defined as extracts continue to follow a link from a Web page;
  • callback: callback function, it can be a callback function string name. Response received as a parameter, or returns a Item List Request object;
  • cb_kwargs: type object dictionary, the parameters passed to the callback function;
  • follow: whether to extract links from Response according link_extractor of this Rule;
  • process_links: callback function, it can be a callback function string name. This function will be called from link_extractor getting into a linked list. This method is mainly used for filtering;
  • process_request: callback function, it can be a callback function string name. Used to filter the Request, the rule will be called function to extract each Request.

I. Case

This case is our crawling famous celebrity site, we need to do is extract the contents of the famous, the author's name and label, then enter the page of the presentation by the author link, author of the final details of our crawling.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class Quotes(CrawlSpider):
    name = "quotes"
    allow_domain = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com']

    rules = (
        Rule(LinkExtractor(allow='/page/\d+'), callback='parse_quotes', follow=True),
        Rule(LinkExtractor(allow='/author/\w+_'), callback='parse_author')
    )

    def parse_quotes(self, response):
        for quote in response.css('.quote'):
            yield {
                'content': quote.css('.text::text').extract_first(),
                'author': quote.css('.author::text').extract_first(),
                'tags': quote.css('.tag::text').extract()
            }

    def parse_author(self, response):
        name = response.css('.author-title::text').extract_first()
        author_born_date = response.css('.author_born_date::text').extract_first()
        author_description = response.css('.author_description::text').extract_first()
        return ({
            'name': name,
            'author_born_date': author_born_date,
            'author_description': author_description
        })
        import scrapy
        from scrapy.spiders import CrawlSpider, Rule
        from scrapy.linkextractors import LinkExtractor


        class Quotes(CrawlSpider):
            name = "quotes"
            allow_domain = ['quotes.toscrape.com']
            start_urls = ['http://quotes.toscrape.com']

            rules = (
                Rule(LinkExtractor(allow='/page/\d+'), callback='parse_quotes', follow=True),
                Rule(LinkExtractor(allow='/author/\w+_'), callback='parse_author')
            )

            def parse_quotes(self, response):
                for quote in response.css('.quote'):
                    yield {
                        'content': quote.css('.text::text').extract_first(),
                        'author': quote.css('.author::text').extract_first(),
                        'tags': quote.css('.tag::text').extract()
                    }

            def parse_author(self, response):
                name = response.css('.author-title::text').extract_first()
                author_born_date = response.css('.author_born_date::text').extract_first()
                author_description = response.css('.author_description::text').extract_first()
                return ({
                    'name': name,
                    'author_born_date': author_born_date,
                    'author_description': author_description
                })

The above code Rule(LinkExtractor(allow='/page/\d+'), callback='parse_quotes', follow=True),snippet defines the rules to crawl all the pages of famous celebrities that they meet /page/\d+all of the links is considered favorite sayings page, and then we call parse_quotes method to extract the relevant data. In Rule(LinkExtractor(allow='/author/\w+_'), callback='parse_author')the snippet we defined the rules crawling pages of information that they meet /author/\w+_all of the links is considered the author information page, and then we call the parse_author method to extract the relevant data.
) 代码段中我们定义了爬取作者信息页的规则,即只要符合/ Author / \ w + _``` of all links is considered the author information page, and then we call the parse_author method to extract the relevant data.

Published 204 original articles · won praise 101 · Views 350,000 +

Guess you like

Origin blog.csdn.net/gangzhucoll/article/details/103707341