Scrapy reptile template --XMLFeedSpider

XMLFeedSpider RSS is mainly used for crawling. RSS is an XML-based information and technology bureau. Finally, look at the summary of this article I will use an example to crawl Economic Observer Online RSS to explain its specific use. Now we look at the common attributes XMLFeedSpider.

Zero, common attributes

  1. iterator: iterator, is mainly used to analyze an RSS feed, available iterator in three ways:
  • iternode: high performance regular expression iterator, the default iterator
  • html: load all DOM structure analysis, but if the huge amount of data will have performance problems. The only advantage is unreasonable to deal with the label would be useful
  • xml: html iterators and similar.
  1. itertag: Specifies iteration node
  2. namespaces: namespace definitions when processing documents required to be used.

A commonly used method

  1. adapt_response (response): In the pre-treatment analysis Response triggered mainly used to modify the contents of Response, the return type is Response.
  2. parse_node (response, selectot): This method is triggered when processing data afraid node channels match. This method must be implemented in the project code, or reptile does not work, and must return Item, Request or contain both iterators.
  3. process_result (response, result): The triggered when returning crawling, crawling for the result to come to deal with the core framework to make last-minute changes.

Case

Let us look at how XMLFeedSpider used in combat by crawling Economic Observer of RSS. First we look at the RSS structure Economic Observer Online:
Here Insert Picture Description

As can be seen from the chart information useful to us are in the item label between, then the content between the tags is that we need to grab something, this tag is called a node.

# -*- coding: utf-8 -*-
from scrapy.spiders import XMLFeedSpider
from ..items import RsshubItrem


class RsshubSpider(XMLFeedSpider):
    name = 'rsshub'
    allowed_domains = ['rsshub.app']
    start_urls = ['https://rsshub.app/eeo/01']
    iterator = 'iternodes'
    itertag = 'item'

    def parse_node(self, response, selector):
        item = RsshubItrem()
        item['title'] = selector.css("title::text").extract_first()
        item['public_date'] = selector.css("publicDate::text").extract_first()
        item['link'] = selector.css("link::text").extract_first()
        return item

        import scrapy


class RsshubItrem(scrapy.Item):
    title = scrapy.Field()
    public_date = scrapy.Field()
    link = scrapy.Field()

Published 204 original articles · won praise 101 · Views 350,000 +

Guess you like

Origin blog.csdn.net/gangzhucoll/article/details/103797247