CrawlSpider crawler tutorial

CrawlSpider
is in the crawler case of the previous Embarrassment Encyclopedia. We get the url of the next page after parsing the entire page by ourselves, and then resend a request. Sometimes we want to do this, as long as the url that meets a certain condition is crawled for me. Then at this time we can use CrawlSpider to help us complete it. CrawlSpider inherits from Spider, but it adds new functions on the basis of the previous ones. It can define the rules of crawling urls. In the future, scrapy will crawl all urls that meet the conditions without manual yield request.

CrawlSpider crawler:
Create a CrawlSpider crawler:
The way to create a crawler before is through scrapy genspider [crawler name] [domain name]. If you want to create a CrawlSpider crawler, you should create it with the following command:

scrapy genspider -c crawl [crawler name] [domain name]
LinkExtractors link extractor:
use LinkExtractors to extract the desired url without programmers themselves, and then send the request. All these tasks can be handed over to LinkExtractors, who will find URLs that meet the rules in all crawled pages and realize automatic crawling. The following is a brief introduction to the LinkExtractors class:

class scrapy.linkextractors.LinkExtractor(
allow = (),
deny = (),
allow_domains = (),
deny_domains = (),
deny_extensions = None,
restrict_xpaths = (),
tags = (‘a’,‘area’),
attrs = (‘href’),
canonicalize = True,
unique = True,
process_value = None
)
主要参数讲解:

allow: Allowed urls. All urls matching this regular expression will be extracted.
deny: Forbidden urls. All urls satisfying this regular expression will not be fetched.
allow_domains: Allowed domain names. Only the urls of the domains specified in this will be fetched.
deny_domains: Denied domain names. All urls of the domain names specified in this will not be fetched.
restrict_xpaths: strict xpaths. Filter links together with allow.
Rule rule class:
defines the rule class of crawlers. Here is a brief introduction to this class:

class scrapy.spiders.Rule(
link_extractor,
callback = None,
cb_kwargs = None,
follow = None,
process_links = None,
process_request = None
)
main parameter explanation:

link_extractor: A LinkExtractor object used to define crawling rules.
callback: URL that satisfies this rule, which callback function should be executed. Because CrawlSpider uses parse as a callback function, do not override parse as a callback function of your own callback function.
follow: Specifies whether the link extracted from the response according to this rule needs to be followed up.
process_links: After the link is obtained from link_extractor, it will be passed to this function to filter the links that do not need to be crawled.
WeChat applet community CrawlSpider case

Guess you like

Origin blog.csdn.net/qq_17584941/article/details/123439799