Introduction to Scrapy
Scrapy is an application framework written in pure Python for crawling website data and extracting structural data. It has a wide range of uses.
With the power of the framework, users only need to customize and develop a few modules to easily implement a crawler to crawl web content and various pictures, which is very convenient.
Scrapy uses the Twisted ['twɪstɪd]
(its main opponent is Tornado) asynchronous network framework to handle network communication, which can speed up our download speed without having to implement the asynchronous framework by ourselves, and includes various middleware interfaces, which can flexibly fulfill various needs.
Scrapy Architecture Diagram
The green line is the data flow
-
Scrapy Engine(引擎)
: Responsible forSpider
,ItemPipeline
,Downloader
,Scheduler
intermediate communication, signal, data transmission, etc. -
Scheduler(调度器)
: It is responsible for accepting引擎
the Request request sent, and arranges it in a certain way, joins the queue, and returns it when引擎
needed引擎
. -
Downloader(下载器)
: Responsible for downloadingScrapy Engine(引擎)
all Requests requests sent, and returning the Responses obtainedScrapy Engine(引擎)
by引擎
it toSpider
be processed by the handover, -
Spider(爬虫)
: It is responsible for processing all Responses, analyzing and extracting data from it, obtaining the data required by the Item field, and submitting the URL that needs to be followed up, and引擎
entering againScheduler(调度器)
. -
Item Pipeline(管道)
: It is responsible for processingSpider
the items obtained in the process and performing post-processing (detailed analysis, filtering, storage, etc.). -
Downloader Middlewares(下载中间件)
: You can regard it as a component that can customize the extension download function. -
Spider Middlewares(Spider中间件)
: You can understand it as a functional component that can be extended and operated on a custom引擎
basisSpider
(通信
such as incomingSpider
Responses; andSpider
outgoing Requests)
How Scrapy works
Reference: https://zhuanlan.zhihu.com/p/33979115