Python scrapy framework teaching (2): Scrapy framework structure

Thinking

  • Why is scrapy a framework and not a library?
  • How does scrapy work?

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542

Python learning exchange group: 1039645993

 

Project structure

Before starting crawling, a new Scrapy project must be created. Go to the directory where you plan to store the code and run the following command:

Note: When creating a project, a new crawler project directory will be created in the current directory.

These files are:

  • scrapy.cfg: project configuration file
  • quotes/: The python module of the project. You will add the code here later. quotes/items.py: The item file in the project.
  • quotes/middlewares.py: crawler middleware, download middleware (processing request body and response body)
  • quotes/pipelines.py: the pipeline file in the project. quotes/settings.py: the project settings file
  • quotes/spiders/: The directory where spider code is placed.

Scrapy schematic

Introduction of each component

  1. Engine. The engine, which handles the data flow processing and triggering transactions of the entire system, is the core of the entire framework.

  2. Item. Item, it defines the data structure of the crawling result, and the crawled data will be assigned to the Item object.

  3. Scheduler. The scheduler accepts the request sent by the engine and adds it to the queue, and provides the request to the engine when the engine requests it again.

  4. Downloader. Downloader, downloads web content, and returns web content to the spider.

  5. Spiders. The spider defines the crawling logic and web page parsing rules. It is mainly responsible for parsing responses and generating results and new requests.

  6. Item Pipeline. The project pipeline is responsible for processing the projects extracted by the spider from the web pages. Its main task is to clean, verify and store data.

  7. Downloader Middlewares. The downloader middleware, the hook frame between the engine and the downloader, mainly handles the request and response between the engine and the downloader.

  8. Spider Middlewares. Spider middleware, the hook framework between the engine and the spider, mainly deals with the response of spider input and output results and new requests

Data flow

  • Scrapy Engine: Responsible for communication, signal and data transmission among Spider, ItemPipeline, Downloader, and Scheduler.

  • Scheduler: Responsible for accepting Request requests sent by the engine, sorting them in a certain way, enqueuing
    them, and returning them to the engine when needed .

  • Downloader: Responsible for downloading all Requests sent by the Scrapy Engine, and return the obtained Responses to the Scrapy Engine, which will be handled by the Spider

  • Spider: Responsible for processing all Responses, analyzing and extracting data from them, obtaining the data required by the Item field, submitting the URL that needs to be followed up to the engine, and entering the Scheduler again

  • Item Pipeline: The place responsible for processing the Item obtained from the Spider and performing post-processing (detailed analysis, filtering, storage, etc.).

  • Downloader Middlewares: You can treat it as a component that can customize and extend download functions.

  • Spider Middlewares (Spider middleware): You can understand it as a functional component that can be customized to extend and operate the communication between the engine and the Spider (such as the Responses that enter the Spider; and the Requests that go out of the Spider)

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/114532574