[Scrapy framework] "Version 2.4.0 source code" architecture analysis (Architecture overview) detailed articles

All source code analysis article index directory portal

[Scrapy Framework] Version 2.4.0 Source Code: All Configuration Directory Index

Introduction

This document introduces Scrapy's architecture and how its components interact.

Business process

Insert picture description here

  1. The engine gets the spider from the initial request.
  2. The engine schedules the request scheduler and requests the next request to crawl.
  3. The plan returns the engine for the next request.
  4. The engine sends the request to the downloader through the downloader middleware.
  5. After the page is downloaded, the Downloader will generate a response with the page, and send it to the Engine, and through the Downloader Middlewares.
  6. The engine receives the response from the downloader and sends it to the spider for processing, through the spider middleware.
  7. The spider processes the response and returns the scraped items and the new request (following) engine through the spider middleware.
  8. The engine sends the processed items to the item pipeline, and then dispatches the processed requests and requests new requests to crawl.
  9. Repeat the process (from step 1) until there are no more Scheduler requests.

Each component part

  1. Scrapy engine The
    engine is responsible for controlling the data flow between all components of the system and triggering events when certain operations occur.
  2. The Scheduler
    receives requests from the engine and queues them when the engine requests them, so that they can be provided to them later (also sent to the engine).
  3. The Downloader
    is responsible for obtaining web pages and feeding them to the engine, which in turn feeds it to the spider.
  4. Spiders
    spiders are items written by Scrapy users to parse responses and extract custom classes.
  5. Item Pipeline
    Once the spider extracts them (or Scrapy), the item pipeline is responsible for processing the item. Typical tasks include cleanup, verification, and persistence (such as storing items in a database).
    If you need to perform one of the following operations, use Downloader middleware:
  • Process the request before sending it to the downloader (that is, before Scrapy sends the request to the website);
  • Change the received response before passing it to the spider;
  • Send a new request instead of passing the received response to the spider;
  • Pass the response to the spider without getting the web page;
  • Quietly discard some requests.
  1. Spider middlewares
    are a specific hook between the engine and the spider, and can handle the input (response) and output (items and requests) of the spider.
    If you need to perform one of the following operations, please use Spider middleware
  • Post-processing output of spider callback-change/add/delete request or item;
  • Post-processing start_requests;
  • Handling spider exceptions;
  • According to the response content, errback is called instead of callback for some requests.
  1. Event-driven networking
    Scrapy is written in Twisted, a popular event-driven Python networking framework. Therefore, it is implemented concurrently using asynchronous code.

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/113524228