scrapy use -LinkExtractor

background:

  In some crawling content is the need to obtain specific information website under the tag, we need to get the link under these labels, if each acquisition, acquiring it through the following information, which is inefficient, time complexity O ( n ^ 2), but if first obtain the link, and then acquires the content, the time complexity is O (n) + O (n), each executing the depth of 2, the time complexity is O (n). efficiency significantly increased for the whole station is crawling.

allow () #allow (regular expression (or list)) - a single regular expression (or list of regular expressions), (absolute) urls must match in order to extract. If no (or empty), it will match all links. 
deny () #deny (regular expressions or regex list) - a regular expression (or list of regular expressions), (absolute) urls must match in order to exclude (ie, not extract). It takes precedence over the allow parameter. If no (or empty), it will not rule out any link. 
allow_domains () #allow_domains (str or list) - or a single value to be considered for extracting a linked list of strings domains 
deny_domains () #deny_domains (str or list) - or a single value can not be considered for extraction domains linked list of strings 
deny_extensions () #deny_extensions (list) - comprising a single value or a list of strings extend in extracting links should be ignored. If not given, it defaults to a list IGNORED_EXTENSIONS defined in scrapy.linkextractors package. 
restrict_xpaths () # restrict_xpaths (str or list) - is an XPath (or the XPath list), which defines the response should be extracted in the region are linked. If given, only those selected text XPath link will be scanned. See the following example. 
restrict_css () # restrict_css (str or list) - a CSS selector (or a select list), for the definition of areas to be extracted link response. We have the same behavior restrict_xpaths. Label (str or list) - list of tags or labels in extracting the link to be considered. The default is. ( 'a', 'area' )
attrs () # attrs (list) - list of properties or attributes when looking to extract links that should be considered (applies only to those parameters specified label tags). The default is ( 'the href',) 
the canonicalize () # the canonicalize (Boolean) - Standardization of each extracted url (using w3lib.url.canonicalize_url). The default is True. 
UNIQUE () # UNIQUE (boolean) - whether to deal with link extracted repeat application filtering. 
process_value () # process_value (callable) - 
receiving and scanning each value extracted from the attribute and the tag value can be modified and returns the new value of the function, or completely ignore the return None link. If no, process_value default. lambda x: x

  

Guess you like

Origin www.cnblogs.com/superSmall/p/12057599.html