Detailed framework Scrapy

I. Introduction: scrapy is a professional, efficient crawler frame, it uses specialized Twisted packet efficient network communication using lxml (professional XML processing packets), cssselect efficient extraction of useful information HTML page process, but it also provides an efficient thread management, reptiles framework is a software architecture and functional components to achieve a collection of reptiles function, reptiles framework is a semi-finished products, it can help users to achieve professional web crawler

Second, the installation: pipenv install scrapy

Third, the structure: 5 + 2 structure

Five modules are: engine block + spiders module (entrance frame) + scheduler module + downloader module + item pipelines (outlet frame) module

2 middleware are: spiderMiddleware and downloaderMiddleware

Article data flow paths 3:

1️⃣: spiders module scheduler module through the block reaches the engine (request Request)

2️⃣: scheduler module reaches the downloader module engine module, the data module back to the spider

3️⃣spider engine module and the module to achieve the item pipelines module scheduler module respectively

Three, engine module function :( entire core framework, control all modules of the data stream, according to the conditions triggering event)

1) downloader module :( download page according to the request)

2) scheduler: responsible for scheduling

3) item pipelines modules: a pipelined manner crawling spider items generated by a set sequence of operations, similar to the pipeline, each operation is a type of item Piplines

4) spider: Downloader parses the response returned (the Response), to produce items crawling, crawling generate additional requests (Request)

四,Downloader Middleware:

User-configurable control between the embodiment Engine, scheduler and downloader: object

Function: Review, discarded, or a new response to a user request to edit the configuration code can

 

Guess you like

Origin www.cnblogs.com/wangyue0925/p/11233188.html