Source code analysis of Spider and CrawlSpider

1. Spider source code analysis

Before analyzing the source code of CrawlSpider, first conduct an analysis of the Spider source code.

1.1. Introduction to Spider and explanation of main functions

The Spider class defines how to crawl a certain (or some) website. It includes crawling actions (whether to follow links) and how to extract structured data from the content of web pages (extract Items). Spider is where you define crawling actions and analyze a certain (or certain) web page.
Spider is the most basic class, and all crawlers must inherit this class.
The main functions and calling sequence of the Spider class are:
1) init()
Initialize the crawler name and start_urls list .
Important: The crawler name here is required and must be unique.

    def __init__(self, name=None, **kwargs):
        if name is not None:
            self.name = name
        elif not getattr(self, 'name', None):
            raise ValueError("%s must have a name" % type(self).__name__)
        self.__dict__.update(kwargs)
        if not hasattr(self, 'start_urls'):
            self.start_urls = []

2) start_requests() calls make_requests_from_url()
to generate the Requests object and give it to Scrapy to download and return the Response.
Important: This method will only be called once.

    def start_requests(self):
        cls = self.__class__
        if method_is_overridden(cls, Spider, 'make_requests_from_url'):
            warnings.warn(
                "Spider.make_requests_from_url method is deprecated; it "
                "won't be called in future Scrapy releases. Please "
                "override Spider.start_requests method instead (see %s.%s)." % (
                    cls.__module__, cls.__name__
                ),
            )
            for url in self.start_urls:
                yield self.make_requests_from_url(url)
        else:
            for url in self.start_urls:
                yield Request(url, dont_filter=True)

    def make_requests_from_url(self, url):
        """ This method is deprecated. """
        return Request(url, dont_filter=True)

3) parse()
Parse the response and return Item or Requests (callback function needs to be specified). The Item is passed to the Item pipline for persistence, and the Requests are downloaded by Scrapy and processed by the specified callback function. The loop continues until all data is processed.
Key point: This class needs to be implemented by ourselves. And it is the default Request object callback function to process the returned response.

    def parse(self, response):
        raise NotImplementedError

1.2. Spider source code analysis

Because there is not a lot of Spider source code, I will explain it directly by adding comments to its source code, as follows:

"""
Base class for Scrapy spiders

See documentation in docs/topics/spiders.rst
"""
import logging
import warnings

from scrapy import signals
from scrapy.http import Request
from scrapy.utils.trackref import object_ref
from scrapy.utils.url import url_is_from_spider
from scrapy.utils.deprecate import create_deprecated_class
from scrapy.exceptions import ScrapyDeprecationWarning
from scrapy.utils.deprecate import method_is_overridden

#所有爬虫的基类,用户定义的爬虫必须从这个类继承
class Spider(object_ref):
    """Base class for scrapy spiders. All spiders must inherit from this
    class.
    """

    #1、定义spider名字的字符串。spider的名字定义了Scrapy如何定位(并初始化)spider,所以其必须是唯一的。
    #2、name是spider最重要的属性,而且是必须的。一般做法是以该网站的域名来命名spider。例如我们在爬取豆瓣读书爬虫时使用‘name = "douban_book_spider"’  
    name = None
    custom_settings = None

    #初始化爬虫名字和start_urls列表。上面已经提到。
    def __init__(self, name=None, **kwargs):
        #初始化爬虫名字
        if name is not None:
            self.name = name
        elif not getattr(self, 'name', None):
            raise ValueError("%s must have a name" % type(self).__name__)
        self.__dict__.update(kwargs)

        #初始化start_urls列表,当没有指定的URL时,spider将从该列表中开始进行爬取。 因此,第一个被获取到的页面的URL将是该列表之一,后续的URL将会从获取到的数据中提取。  
        if not hasattr(self, 'start_urls'):
            self.start_urls = []

    @property
    def logger(self):
        logger = logging.getLogger(self.name)
        return logging.LoggerAdapter(logger, {'spider': self})

    def log(self, message, level=logging.DEBUG, **kw):
        """Log the given message at the given log level

        This helper wraps a log call to the logger within the spider, but you
        can use it directly (e.g. Spider.logger.info('msg')) or use any other
        Python logger too.
        """
        self.logger.log(level, message, **kw)

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = cls(*args, **kwargs)
        spider._set_crawler(crawler)
        return spider

    def set_crawler(self, crawler):
        warnings.warn("set_crawler is deprecated, instantiate and bound the "
                      "spider to this crawler with from_crawler method "
                      "instead.",
                      category=ScrapyDeprecationWarning, stacklevel=2)
        assert not hasattr(self, 'crawler'), "Spider already bounded to a " \
                                             "crawler"
        self._set_crawler(crawler)

    def _set_crawler(self, crawler):
        self.crawler = crawler
        self.settings = crawler.settings
        crawler.signals.connect(self.close, signals.spider_closed)

    #该方法将读取start_urls列表内的地址,为每一个地址生成一个Request对象,并返回这些对象的迭代器。
    #注意:该方法只会调用一次。
    def start_requests(self):
        cls = self.__class__
        if method_is_overridden(cls, Spider, 'make_requests_from_url'):
            warnings.warn(
                "Spider.make_requests_from_url method is deprecated; it "
                "won't be called in future Scrapy releases. Please "
                "override Spider.start_requests method instead (see %s.%s)." % (
                    cls.__module__, cls.__name__
                ),
            )
            for url in self.start_urls:
                yield self.make_requests_from_url(url)
        else:
            for url in self.start_urls:
                yield Request(url, dont_filter=True)

    #1、start_requests()中调用,实际生成Request的函数。
    #2、Request对象默认的回调函数为parse(),提交的方式为get。
    def make_requests_from_url(self, url):
        """ This method is deprecated. """
        return Request(url, dont_filter=True)

    #默认的Request对象回调函数,处理返回的response。  
    #生成Item或者Request对象。这个类需要我们自己去实现。
    def parse(self, response):
        raise NotImplementedError

    @classmethod
    def update_settings(cls, settings):
        settings.setdict(cls.custom_settings or {}, priority='spider')

    @classmethod
    def handles_request(cls, request):
        return url_is_from_spider(request.url, cls)

    @staticmethod
    def close(spider, reason):
        closed = getattr(spider, 'closed', None)
        if callable(closed):
            return closed(reason)

    def __str__(self):
        return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self))

    __repr__ = __str__


BaseSpider = create_deprecated_class('BaseSpider', Spider)


class ObsoleteClass(object):
    def __init__(self, message):
        self.message = message

    def __getattr__(self, name):
        raise AttributeError(self.message)

spiders = ObsoleteClass(
    '"from scrapy.spider import spiders" no longer works - use '
    '"from scrapy.spiderloader import SpiderLoader" and instantiate '
    'it with your project settings"'
)

# Top-level imports
from scrapy.spiders.crawl import CrawlSpider, Rule
from scrapy.spiders.feed import XMLFeedSpider, CSVFeedSpider
from scrapy.spiders.sitemap import SitemapSpider

2. CrawlSpider source code analysis

After explaining the Spider source code analysis, I will conduct an analysis of the CrawlSpider source code.

2.1. Introduction to CrawlSpider and explanation of main functions

CrawlSpider is a commonly used spider for crawling general websites. It defines some rules to provide a convenient mechanism for following links. Maybe this spider is not completely suitable for a specific website or project, but it is used in many situations.
Therefore, based on it, we can modify some methods according to needs. Of course we can also implement our own spider. In addition to the (must be provided) attributes inherited from Spider, it also provides a new attribute:
1) rules
A contains one (or more ) A collection (list) of Rule objects. Each Rule defines specific behavior for crawling the website. If multiple Rules match the same link, the first one will be used according to the order in which they are defined in this attribute.
Usage examples are as follows:

rules = (
    # 提取匹配 'category.php' (但不匹配 'subsection.php') 的链接并跟进链接(没有callback意味着follow默认为True)
    Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

    # 提取匹配 'item.php' 的链接并使用spider的parse_item方法进行分析
    Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item',  follow=True),
)

2) parse_start_url(response)
This method is called when the request for start_url returns. This method parses the initial return value and must return an Item object or a Request object or an iterable containing both.
This spider method needs to be rewritten by the user.

def parse_start_url(self, response):  
    return []  

3) parse(), be sure not to rewrite this method
Through the above introduction, we know that the parse() method in Spider needs to be rewritten, as follows: < /span>

def parse(self, response):
        raise NotImplementedError

However, the parse() method in CrawlSpider has already implemented some functions in the source code, as follows:

def parse(self, response):
        return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

So when we use CrawlSpider, we must not rewrite the parse() function (emphasis added). You can use the callback in Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item', follow=True) to specify the parse that needs to be jumped.
For example, we are in Crawler Classroom (Twenty-Five)|Using CrawlSpider, LinkExtractors, and Rule to crawl the whole site Explain how to use Jianshu to crawl the entire site, as follows:

class JianshuCrawl(CrawlSpider):
    name = "jianshu_spider_crawl"
    # 可选,加上会有一个爬取的范围
    allowed_domains = ["jianshu.com"]
    start_urls = ['https://www.jianshu.com/']

    # response中提取链接的匹配规则,得出符合条件的链接
    pattern = '.*jianshu.com/u/*.'
    pagelink = LinkExtractor(allow=pattern)

    # 可以写多个rule规则
    rules = [
        # 只要符合匹配规则,在rule中都会发送请求,同时调用回调函数处理响应。
        # rule就是批量处理请求。
        Rule(pagelink, callback='parse_item', follow=True),
    ]

    # 不能写parse方法,因为源码中已经有了,会覆盖导致程序不能跑
    def parse_item(self, response):
        for each in response.xpath("//div[@class='main-top']"):
    ......

callback='parse_item' (parse_item here is a string) specifies the jump function def parse_item(self, response).

2.2. CrawlSpider source code analysis

Similarly, because there is not much source code for CrawlSpider, I will explain it directly by adding comments to its source code, as follows:

class CrawlSpider(Spider):

    rules = ()

    def __init__(self, *a, **kw):
        super(CrawlSpider, self).__init__(*a, **kw)
        self._compile_rules()

    #1、首先调用parse()方法来处理start_urls中返回的response对象。
    #2、parse()将这些response对象传递给了_parse_response()函数处理,并设置回调函数为parse_start_url()。
    #3、设置了跟进标志位True,即follow=True。
    #4、返回response。
    def parse(self, response):
        return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

    #处理start_url中返回的response,需要重写。
    def parse_start_url(self, response):
        return []

    def process_results(self, response, results):
        return results

    def _build_request(self, rule, link):
         #构造Request对象,并将Rule规则中定义的回调函数作为这个Request对象的回调函数。这个‘_build_request’函数在下面调用。
        r = Request(url=link.url, callback=self._response_downloaded)
        r.meta.update(rule=rule, link_text=link.text)
        return r

    #从response中抽取符合任一用户定义'规则'的链接,并构造成Resquest对象返回。
    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        #抽取所有链接,只要通过任意一个'规则',即表示合法。 
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            #将链接加入seen集合,为每个链接生成Request对象,并设置回调函数为_repsonse_downloaded()。
            for link in links:
                seen.add(link)
                #构造Request对象,并将Rule规则中定义的回调函数作为这个Request对象的回调函数。这个‘_build_request’函数在上面定义。
                r = self._build_request(n, link)
                #对每个Request调用process_request()函数。该函数默认为indentify,即不做任何处理,直接返回该Request。
                yield rule.process_request(r)

    #处理通过rule提取出的连接,并返回item以及request。
    def _response_downloaded(self, response):
        rule = self._rules[response.meta['rule']]
        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

    #解析response对象,使用callback解析处理他,并返回request或Item对象。
    def _parse_response(self, response, callback, cb_kwargs, follow=True):
        #1、首先判断是否设置了回调函数。(该回调函数可能是rule中的解析函数,也可能是 parse_start_url函数)  
        #2、如果设置了回调函数(parse_start_url()),那么首先用parse_start_url()处理response对象,  
        #3、然后再交给process_results处理。返回cb_res的一个列表。  
        if callback:
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)
            for requests_or_item in iterate_spider_output(cb_res):
                yield requests_or_item

        #如果需要跟进,那么使用定义的Rule规则提取并返回这些Request对象。
        if follow and self._follow_links:
            #返回每个Request对象。
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item

    def _compile_rules(self):
        def get_method(method):
            if callable(method):
                return method
            elif isinstance(method, six.string_types):
                return getattr(self, method, None)

        self._rules = [copy.copy(r) for r in self.rules]
        for rule in self._rules:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
        spider._follow_links = crawler.settings.getbool(
            'CRAWLSPIDER_FOLLOW_LINKS', True)
        return spider

    def set_crawler(self, crawler):
        super(CrawlSpider, self).set_crawler(crawler)
        self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)

Reference materials: scrapy official website (the official website doesn’t talk much about this, but there are many other things that I haven’t talked about here. You really need to take a closer look at the official website), two articles on Scrapy source code analysis on CSDN.



Author: Xiaoguai Talks about the Workplace
Link: https://www.jianshu.com/p/d492adf17312
Source: Jianshu a>
The copyright of the brief book belongs to the author. For any form of reprint, please contact the author for authorization and indicate the source.

Guess you like

Origin blog.csdn.net/rankun1/article/details/82290959