Scrapy source code analysis (1) architecture overview

Wechat search followed the "Water Drop and Silver Bullet" public account to get high-quality technical dry goods in the first place. With 7 years of experience in back-end research and development, he explained the technology clearly in a simple way.

In the field of crawler development, the most commonly used mainstream languages ​​are mainly Java and Python. If you often use Python to develop crawlers, you must have heard of Scrapy, an open source framework, which is written in Python.

Scrapy has a great reputation among open source crawler frameworks. Almost anyone who writes crawlers in Python has used this framework. Moreover, many open source crawler frameworks in the industry are implemented by imitating and referring to the ideas and architecture of Scrapy. If you want to learn crawlers in depth, it is necessary to study the source code of Scrapy.

From the beginning of this article, I will share with you my thoughts and experience summary of reading the Scrapy source code when I was doing crawlers.

In this article, we first introduce the overall architecture of Scrapy, and learn the process of Scrapy running from a macro level. In the next few articles, I will take you into each module and analyze the implementation details of this framework.

Introduction

First, let's take a look at how Scrapy's official introduction introduces it. From the official website, we can see that Scrapy is defined as follows.

Scrapy is an open source crawler framework written in Python language. It can help you build crawlers in a fast and simple way and extract the data you need from the website.

In other words, using Scrapy can help you quickly and easily write a crawler to crawl website data.

This article no longer introduces the installation and use of Scrapy. This series mainly explains the realization of Scrapy by reading the source code. For questions about how to install and use, please refer to the official website and official documents to learn. (Note: At the time of writing this article, Scrapy version is 1.2. Although the version is a bit low, it is basically not very different from the latest version.)

Using Scrapy to develop a crawler is very simple. Here is an example on Scrapy's official website to illustrate how to write a simple crawler:
Insert picture description here

Simply put, writing and running a crawler only requires the following steps:

  1. Use scrapy startprojectcommand to create a template reptiles, reptile or write your own codes according to the template
  2. Define a reptile, and inherited scrapy.Spider, and then rewriting parsemethod
  3. parse Method to write web page parsing logic and crawl path
  4. Use scrapy runspider <spider_file.py>run this reptile

It can be seen that using Scrapy to write a few simple lines of code can collect data on a website page, which is very convenient.

But what happened behind this? How does Scrapy help us work?

Architecture

To know how Scrapy works, first let's take a look at the architecture diagram of Scrapy to understand how it works from a macro perspective:
Insert picture description here

Core module

As you can see from the architecture diagram, Scrapy mainly contains the following five modules:

  • Scrapy Engine: The core engine is responsible for controlling and scheduling various components to ensure data flow;
  • Scheduler: Scheduler responsible for managing tasks, filtering tasks, output tasks, storage and deduplication tasks are all controlled here;
  • Downloader: Downloader, responsible for downloading data on the network, input the URL to be downloaded, and output the download result;
  • Spiders: Crawler logic written by ourselves to define crawling intentions;
  • Item Pipeline: Responsible for outputting structured data, customizable format and output location;

If you observe carefully, you can see that there are two more modules:

  • Downloader middlewares: Between the engine and the downloader, you can perform logical processing before and after the web page is downloaded;
  • Spider middlewares: Between the engine and the crawler, before inputting the download result to the crawler, and after the crawler outputs the request/data, perform logical processing;

Knowing these core modules, let's look at how the internal collection process flows when using Scrapy, that is to say how the various modules interact and cooperate to complete the entire crawling task.

Run process

According to the sequence number identified in the above architecture diagram, we can see that the data flow during Scrapy runtime is roughly like this:

  1. Engine from a custom crawler obtain initialization request (also called seed URL) in;
  2. The engine puts the request into the scheduler , and the scheduler obtains the request to be downloaded from the engine;
  3. The scheduler sends the request to be downloaded to the engine;
  4. The engine sends a request to the downloader , which will pass through a series of downloader middleware ;
  5. After the request is downloaded by the downloader, a response object is generated and returned to the engine, which will pass through a series of downloader middleware again ;
  6. After the engine receives the response returned by the downloader, it sends it to the crawler. In the middle, it passes through a series of crawler middleware , and finally executes the crawler's custom parsing logic ;
  7. After the crawler executes the custom parsing logic, it generates the result object or new request object to the engine, and passes through a series of crawler middleware again ;
  8. The engine passes the result object returned by the crawler to the result processor , and passes the new request through the engine to the scheduler ;
  9. Repeat 1-8 until there is no new request processing in the scheduler and the task ends;

Collaboration of core modules

It can be seen that Scrapy's architecture diagram is relatively clear, and the various modules cooperate with each other to complete the crawling task.

I read it in the source code, put together a more detailed core module interaction diagram, which shows more details of the module, you can refer to:
Insert picture description here
It should explain the figure of the Scrapyermodule, in fact, this is one Scrapy The core module, but not shown in the official architecture diagram. This module is in fact Engine, Spiders, Pipelinebetween, a bridge connecting these three modules, I specifically mentioned later in the article source code analysis.

Core class diagram

In addition, in the process of reading the source code, I also compiled the class diagrams of these core modules, which will be of great help for you to learn the source code.
Insert picture description here
Briefly explain this core class diagram:

  • Black text without style is the core attribute of the class ;
  • The highlighted text with yellow style is the core method of the class ;

When you read the source code, you can focus on these core attributes and methods.

Combined with the official architecture diagram and the core module interaction diagram and core class diagram I summarized, we can see that the components involved in Scrapy mainly include the following.

  • Five core Scrapy Enginecategories: Scheduler, Downloader, Spiders、Item Pipeline, ;
  • Four middleware manager DownloaderMiddlewareManagerclass: SpiderMiddlewareManager, ItemPipelineMiddlewareManager, ExtensionManager, ;
  • Other auxiliary Requestcategories: Response, Selector, ;

We first have a preliminary understanding of the entire architecture of Scrapy. In the next article, I will explain the source code in more detail for the above-mentioned classes and methods.

Wechat search followed the "Water Drop and Silver Bullet" public account to get high-quality technical dry goods in the first place. With 7 years of experience in back-end research and development, he explained the technology clearly in a simple way.

Guess you like

Origin blog.csdn.net/ynxts/article/details/112342537