Scrapy framework

Introduction to Scrapy

Scrapy is an application framework written in pure Python for crawling website data and extracting structural data. It has a wide range of uses.

With the power of the framework, users only need to customize and develop a few modules to easily implement a crawler to crawl web content and various pictures, which is very convenient.

Scrapy uses the Twisted ['twɪstɪd](its main opponent is Tornado) asynchronous network framework to handle network communication, which can speed up our download speed without having to implement the asynchronous framework by ourselves, and includes various middleware interfaces, which can flexibly fulfill various needs.

Scrapy Architecture Diagram

The green line is the data flow

  • Scrapy Engine(引擎): Responsible for Spider, ItemPipeline, Downloader, Schedulerintermediate communication, signal, data transmission, etc.

  • Scheduler(调度器): It is responsible for accepting 引擎the Request request sent, and arranges it in a certain way, joins the queue, and returns it when 引擎needed 引擎.

  • Downloader(下载器): Responsible for downloading Scrapy Engine(引擎)all Requests requests sent, and returning the Responses obtained Scrapy Engine(引擎)by 引擎it to Spiderbe processed by the handover,

  • Spider(爬虫): It is responsible for processing all Responses, analyzing and extracting data from it, obtaining the data required by the Item field, and submitting the URL that needs to be followed up, and 引擎entering again Scheduler(调度器).

  • Item Pipeline(管道): It is responsible for processing Spiderthe items obtained in the process and performing post-processing (detailed analysis, filtering, storage, etc.).

  • Downloader Middlewares(下载中间件): You can regard it as a component that can customize the extension download function.

  • Spider Middlewares(Spider中间件): You can understand it as a functional component that can be extended and operated on a custom 引擎basis Spider( 通信such as incoming SpiderResponses; and Spideroutgoing Requests)

How Scrapy works

Reference: https://zhuanlan.zhihu.com/p/33979115

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325017409&siteId=291194637