[Scrapy framework] "Version 2.4.0 source code" link extractor (Link Extractors) detailed articles

All source code analysis article index directory portal

[Scrapy Framework] Version 2.4.0 Source Code: All Configuration Directory Index

Introduction

The link extractor is the object that extracts the link from the response. Return a list of matching Link objects Response from the object LxmlLinkExtractor.extract_links. The link extractor CrawlSpider is used in spiders through a set of Rule objects.

Instantiate callback

def parse(self, response):
    for link in self.link_extractor.extract_links(response):
        yield Request(link.url, callback=self.parse)

Link extractor

from scrapy.linkextractors import LinkExtractor

LxmlLinkExtractor

class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href'), canonicalize=False, unique=True, process_value=None, strip=True)

Parameter Description:

  1. allow (str or list): A single regular expression (or list of regular expressions) that the URL must match before it can be extracted. If not specified (or empty), it will match all links.

  2. deny (str or list): A single regular expression (or list of regular expressions) that the URL must match in order to be excluded (that is, not extracted). It takes precedence over the allow parameter. If not specified (or empty), no links will be excluded.

  3. allow_domains (str or list): a single value or string list containing domains, which will be considered when extracting links

  4. deny_domains (str or list): A single value or string list containing domains will not be considered for extracting links

  5. deny_extensions (list): A single value or string list containing extensions, which should be ignored when extracting links. If not given, the default is scrapy.linkextractors.IGNORED_EXTENSIONS. Including 7z, 7zip, apk, bz2, cdr, dmg, ico, iso, tar, tar.gz, webm, and xz.

  6. strict_xpaths (str or list): is an XPath (or a list of XPaths) that defines the area from which links should be extracted in the response. If defined, scan only those text selected by XPath to find links.

  7. strict_css (str or list): CSS selector (or selector list), used to define the area in the response from which the link should be extracted.

  8. limit_text (str or list): A single regular expression (or list of regular expressions) that must be matched to extract the text of the link. If not specified (or empty), all links will be matched. If a list of regular expressions is given, if the link matches at least one link, the link will be extracted.

  9. tags (str or list): tags or list of tags to be considered when extracting links. Default ('a','area')

  10. attrs (list): An attribute or a list of attributes that should be considered when searching for links to be extracted (only applicable to those tags specified in the tags parameter). The default is ('href')

  11. canonicalize (bool): Normalize each extracted URL (using w3lib.url.canonicalize_url). The default is False. canonicalize_url is used for duplicate checking; it can change the URL that is visible on the server side, so the response can be different for requests with canonicalized URLs and original URLs. If you use LinkExtractor to track links, it is more reliable to keep the default value canonicalize=False.

  12. Unique (bool): whether duplicate filtering should be applied to the extracted links.

  13. process_value (collections.abc.Callable): This function receives each value extracted from the tag and scanned attributes, and can modify the value and return a new value, or return None to completely ignore the link. If not given, process_value defaults to.

  14. strip (bool): Whether to remove spaces from the extracted attributes. According to the HTML5 standard, leading and trailing spaces must be stripped of href attributes,And many other elements, src attributes, elements, etc., so LinkExtractor defaults to the space character. Set strip=False to turn it off (for example, if you want to extract URLs from elements or attributes that allow leading/suffix spaces).

Link

class scrapy.link.Link(url, text='', fragment='', nofollow=False)

The link object represents the link extracted by the LinkExtractor.

Parameter Description:

  1. url : The absolute URL linked to in the anchor tag.

  2. text : The text in the anchor tag.

  3. fragment : The part after the hash symbol in the web address.

  4. nofollow : Indicates whether there is nofollow value in the rel tag tag attribute.

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/113520529