Lecture 41: Introduction to the Scrapy crawler framework everyone knows

When writing a crawler, if we use requests, aiohttp and other libraries, we need to implement the crawler from start to finish, such as exception handling, crawl scheduling, etc. If we write too much, it will indeed be more troublesome.

So is there any way to improve the efficiency of our crawler? Of course there is, and that is to use the existing crawler framework.

Speaking of Python's crawler framework, Scrapy deserves to be the most popular and powerful framework. In this section, we will have a preliminary understanding of Scrapy, and we will introduce Scrapy's functional modules in detail in the following lessons.

Scrapy introduction

Scrapy is an asynchronous processing framework based on Twisted. It is a crawler framework implemented in pure Python. Its architecture is clear, the coupling between modules is low, the scalability is extremely strong, and various requirements can be flexibly fulfilled. We only need to customize and develop a few modules to easily implement a crawler.

First, let's look at the architecture of the Scrapy framework, as shown in the figure:
Insert picture description here
it can be divided into the following parts.

  • Engine: Used to process the data stream processing and trigger transactions of the entire system, and is the core of the entire framework.
  • Item: Defines the data structure of the crawling result, and the crawled data will be assigned to the object.
  • Scheduler: Used to accept requests from the engine and join the queue, and provide it to the engine when the engine requests it again.
  • Downloader (downloader): used to download web content and return the web content to the spider.
  • Spiders: It defines crawling logic and web page parsing rules. It is mainly responsible for parsing responses and generating extraction results and new requests.
  • Item Pipeline: Responsible for processing items extracted from web pages by spiders. Its main task is to clean, verify and store data.
  • Downloader Middlewares (downloader middleware): the hook framework between the engine and the downloader, mainly processing the request and response between the engine and the downloader.
  • Spider Middlewares: A hook framework located between the engine and the spider. The main job is to process the spider input response and output result and new requests.

At first glance it looks rather ignorant, but don't worry, we will introduce Scrapy's functional modules in combination with cases in the following article. I believe you will gradually understand the meaning and functions of each module.

data flow

After understanding the architecture, the next step is to understand how it performs data crawling and processing, so we need to understand Scrapy's data flow mechanism next.

The data flow in Scrapy is controlled by the engine, and the process is as follows:

  • Engine first opens a website, finds the spider processing the website and requests the spider the first URL to crawl.
  • Engine obtains the first URL to be crawled from Spider and dispatches it in the form of Request through Scheduler.
  • The Engine requests the next URL to be crawled from the Scheduler.
  • The Scheduler returns the next URL to be crawled to the Engine, and the Engine forwards the URL to the Downloader through Downloader Middlewares for download.
  • Once the page is downloaded, the Downloader generates a response for the page and sends it to the Engine through Downloader Middlewares.
  • Engine receives Response from the downloader and sends it to Spider for processing through Spider Middlewares.
  • Spider processes Response and returns the crawled Item and new Request to Engine.
  • The Engine sends the Item returned by the Spider to the Item Pipeline, and sends the new Request to the Scheduler.
  • Repeat the second to the last step until there are no more Requests in the Scheduler, Engine closes the website, and the crawling ends.

Through the mutual cooperation of multiple components, the difference in the work done by different components, and the component's support for asynchronous processing, Scrapy maximizes the use of network bandwidth and greatly improves the efficiency of data crawling and processing.

installation

After understanding the basic situation of Scrapy, let us install it in the next step.

The installation method of Scrapy is of course the first official document, its address is: https://docs.scrapy.org/en/latest/intro/install.html , you can also refer to https://cuiqingcai.com/5421.html .

After the installation is complete, if you can use the scrapy command normally, it is fine.

Project structure

Since Scrapy is a framework, Scrapy must have pre-configured many available components and scaffolding for writing crawlers, which is to pre-generate a project framework. We can quickly write crawlers based on this framework.

The Scrapy framework creates a project through the command line. The command to create a project is as follows:

scrapy startproject demo

After the execution is complete, a folder called demo will appear in the current running directory, which is a Scrapy project framework, and we can write crawlers based on this project framework.

The project file structure is as follows:

scrapy.cfg
project/
    __init__.py
    items.py
    pipelines.py
    settings.py
    middlewares.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...

The functions of each file are described as follows:

  • scrapy.cfg: It is the configuration file of the Scrapy project, which defines the project's configuration file path and deployment related information.
  • items.py: It defines the Item data structure, and all Item definitions can be put here.
  • pipelines.py: It defines the implementation of Item Pipeline. All the implementations of Item Pipeline can be put here.
  • settings.py: It defines the global configuration of the project.
  • middlewares.py: It defines the implementation of Spider Middlewares and Downloader Middlewares.
  • Spiders: It contains the realization of each Spider, and each Spider has a file.

Well, so far we have roughly understood the basic architecture of Scrapy and created a Scrapy project in practice. We will learn more about the usage of Scrapy and feel its power. See you in the next lesson.

Guess you like

Origin blog.csdn.net/weixin_38819889/article/details/107934457