1. CrawlSpider
In addition to the attributes inherited from Spider, CrawlSpider also provides a new rules attribute, providing follow-up link function,
The rules property is a collection containing one or more Rule objects,
Each Rule defines specific rules for crawling websites.
For multiple Rules matching the same link, the first one is used according to the order in which they are defined in the rules property.
You can override the parse_start_url(response) method, which is called when the start_url request returns.
Rule class:
- class scrapy.contrib.spiders.Rule (
- link_extractor, callback=None,cb_kwargs=None,follow=None,process_links=None,process_request=None )
Description of construction parameters:
link_extractor is a LinkExtractor object that defines how to extract links from crawled pages
callback The link obtained from link_extractor will call the function, avoid using parse as the callback function
cb_kwargs dictionary of arguments passed to the callback function
follow A boolean value specifying whether the extracted link needs to be followed
The links obtained by process_links from link_extractor will call the function to filter the links
Each Request extracted by process_request will call this function to filter the Request
class CnblogsSpider(CrawlSpider): name = "blogs" allowed_domains = ["blogs.com"] start_urls = [ "http://www.blogs.com/boy/default.html" ] rules = ( Rule(LinkExtractor(allow=("/boy/default.html")), follow=True, callback="parse_item" ), ) def parse_item(self, response): pass
There is only one instance of Rule in the rules attribute, and it should also be separated by a comma ","
About the construction parameters of the LinkExtractor object:
allow: Extract links that satisfy the regular expression
deny: exclude links that satisfy the regular expression
allow_domains: Allowed domains
deny _domains: Excluded domains
restrict_xpaths: Extract links that satisfy XPath selection criteria
restrict_css: Extract links that satisfy css selection criteria
unique: whether the link is deduplicated, boolean type
2. XMLFeedSpider
Analyze the XML source by iterating over the individual nodes,
The iterator can be selected: Iternodes, XML, HTML, the default is Iternodes
Attributes:
iterator: select iterator
itertag: the node name from which to start the iteration
namespaces: namespaces
Overridable methods:
adapt_response(response): called before the Spider analyzes the Response and returns a Response
parse_node(response, selector): called when the node matches the provided itertag
process_results(response,results): When Spider returns Item or Request, it is called, and the final modification processing such as modifying the content of Item is done before the result is returned to the framework core. Returns a list of results.
class XMLSpider(XMLFeedSpider): name = "xmlspider" allowed_domains = ["blogs.com"] start_urls = [ "http://www.blogs.com/boy/default.html" ] iterator = "html" itertag = "entry" def adapt_response(self, response): return response