Python distributed crawler frame Scrapy 7-2 scrapy architecture description and source code structure

Scrapy Chart

Most online a scrapy architecture diagram:

You can see all the data streams are to go through Scrapy Engine, which is a transit point. It did not give the order, can only analyze what component, so we see the figure.

scrapy official documentation provided architecture diagram:

In fact, their understanding to understand this figure can generally understand some of the principles of the scrapy. Do not ignore the two major middleware, and 45 after the process is download middleware, and 67 after the process is spider middleware.

For further understanding, we can read the source code scrapy various components we have just seen, in addition to spider and pipeline are written by ourselves, in fact, can be found in the core package under scrapy package:

1 above process is the spider in our written yield.

2 is solved by the following process engine.py Methods:

    def schedule(self, request, spider):
        self.signals.send_catch_log(signal=signals.request_scheduled,
                request=request, spider=spider)
        if not self.slot.scheduler.enqueue_request(request):
            self.signals.send_catch_log(signal=signals.request_dropped,
                                        request=request, spider=spider)

3 is solved by the following process engine.py Methods:

    def _next_request_from_scheduler(self, spider):
        slot = self.slot
        request = slot.scheduler.next_request()
        if not request:
            return
        d = self._download(request, spider)
        d.addBoth(self._handle_downloader_output, request, spider)
        d.addErrback(lambda f: logger.info('Error while handling downloader output',
                                           exc_info=failure_to_exc_info(f),
                                           extra={'spider': spider}))
        d.addBoth(lambda _: slot.remove_request(request))
        d.addErrback(lambda f: logger.info('Error while removing request from slot',
                                           exc_info=failure_to_exc_info(f),
                                           extra={'spider': spider}))
        d.addBoth(lambda _: slot.nextcall.schedule())
        d.addErrback(lambda f: logger.info('Error while scheduling new request',
                                           exc_info=failure_to_exc_info(f),
                                           extra={'spider': spider}))
        return d

The following figure illustrates our downloader that supports a variety of download:

Then you can look at the organizational structure scrapy source code, it is much like django, but also you can look at your own debugging program execution.

Published 101 original articles · won praise 26 · views 10000 +

Guess you like

Origin blog.csdn.net/liujh_990807/article/details/100085705