CrawlSpider and XMLFeedSpider

1. CrawlSpider

    In addition to the attributes inherited from Spider, CrawlSpider also provides a new rules attribute, providing follow-up link function,

    The rules property is a collection containing one or more Rule objects,

    Each Rule defines specific rules for crawling websites.

    For multiple Rules matching the same link, the first one is used according to the order in which they are defined in the rules property.

    You can override the parse_start_url(response) method, which is called when the start_url request returns.

    Rule class:       

  1. class scrapy.contrib.spiders.Rule (  
  2. link_extractor, callback=None,cb_kwargs=None,follow=None,process_links=None,process_request=None )

        Description of construction parameters:

            link_extractor is a LinkExtractor object that defines how to extract links from crawled pages

       callback The link obtained from link_extractor will call the function, avoid using parse as the callback function

       cb_kwargs dictionary of arguments passed to the callback function

       follow A boolean value specifying whether the extracted link needs to be followed

       The links obtained by process_links from link_extractor will call the function to filter the links

       Each Request extracted by process_request will call this function to filter the Request

class CnblogsSpider(CrawlSpider):
    name = "blogs"
    allowed_domains = ["blogs.com"]
    start_urls = [
        "http://www.blogs.com/boy/default.html"
    ]
    rules = (
        Rule(LinkExtractor(allow=("/boy/default.html")),
                follow=True,
                callback="parse_item"
            ),
        )
    def parse_item(self, response):
        pass

There is only one instance of Rule in the rules attribute, and it should also be separated by a comma ","

    About the construction parameters of the LinkExtractor object:

        allow: Extract links that satisfy the regular expression

        deny: exclude links that satisfy the regular expression

        allow_domains: Allowed domains

        deny _domains: Excluded domains

        restrict_xpaths: Extract links that satisfy XPath selection criteria

        restrict_css: Extract links that satisfy css selection criteria

        unique: whether the link is deduplicated, boolean type

2. XMLFeedSpider

        Analyze the XML source by iterating over the individual nodes,

        The iterator can be selected: Iternodes, XML, HTML, the default is Iternodes

        Attributes:

            iterator: select iterator

            itertag: the node name from which to start the iteration

            namespaces: namespaces

        Overridable methods:

            adapt_response(response): called before the Spider analyzes the Response and returns a Response

            parse_node(response, selector): called when the node matches the provided itertag

            process_results(response,results): When Spider returns Item or Request, it is called, and the final modification processing such as modifying the content of Item is done before the result is returned to the framework core. Returns a list of results.

class XMLSpider(XMLFeedSpider):
    name = "xmlspider"
    allowed_domains = ["blogs.com"]
    start_urls = [
        "http://www.blogs.com/boy/default.html"
    ]
    iterator = "html"
    itertag = "entry"

    def adapt_response(self, response):
        return response

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324762359&siteId=291194637