(6) Scrapy crawler framework
(1) Scrapy crawler frame structure
- Scrapy is not a function library, but a crawler framework
- The crawler framework is a collection of software structures and functional components that implement crawler functions
- The crawler framework is a semi-finished product that can help users achieve professional web crawlers
- Look at the structure: distributed, "5 + 2" structure
- Look at the process: data flow
1. The path of data flow
- Engine gets a crawl request from Spider
- Engine forwards the crawl request to Scheduler for scheduling
- Engine gets the next request to crawl from Scheduler
- Engine sends crawl request to Downloader through middleware
- After crawling the webpage, Downloader forms a response and sends it to Engine through the middleware
- Engine sends the received response to Spider for processing through middleware
- After the spider processes the response, it generates a scraped item and a new crawl request to the engine
- Engine sends the crawl item to Item Pipeline (frame exit)
- Engine sends crawl request to Scheduler
2. Entrance and exit of data stream
-
The Engine controls the data flow of each module, and obtains crawl requests continuously from the ongoing Scheduler until the request is empty
-
Framework entrance: Spider's initial crawl request
-
Frame export: Item Pipeline
3. Module analysis
- Engine
- Control data flow between all modules
- Trigger events based on conditions
- No user modification
- Downloader
- Download web pages upon request
- No user modification
- Scheduler
- Schedule management for all crawl requests
- No user modification
- Downloader Middleware
- Implement user-configurable control between Engine, Scheduler and Downloader
- Modify, discard, add request or response
- User can write configuration code
- Spider
- Parse the response returned by Downloader (Response)
- Generate scraped items
- Generate additional crawl requests (Request)
- Require users to write configuration code
- Item Pipelines
- Pipeline processing of spider-generated crawl items
- Consists of a sequence of operations, similar to a pipeline, each operation is an Item Pipeline type
- Possible operations include: cleaning up, checking, and rechecking HTML data in crawl items, and storing the data in a database
- Require users to write configuration code
- Spider Middleware
- Reprocessing of requests and crawls
- Modify, discard, add requests or crawl items
- User can write configuration code
(2) Comparison of requests library and Scrapy crawler
-
Same point:
-
Both can perform page request and crawl, two important technical routes of Python crawler
-
Both have good usability, rich documentation, and easy to get started
-
Both have no functions such as processing js, submitting forms, responding to verification codes (extensible)
requests | Scrapy |
---|---|
Page-level crawler | Site-level crawler |
Function library | frame |
Insufficient concurrency considerations, poor performance | Good concurrency and high performance |
The focus is on page downloads | The focus is on reptile structure |
Flexible customization | General customization is flexible, but in-depth customization is difficult |
Getting started is very simple | Getting started is a little difficult |
-
How to choose a technical route
-
Very small requirements, requests library
-
Not too small demand, Scrapy framework
-
High degree of customization (regardless of scale), self-built framework requests> Scrapy
(3) Common commands of Scrapy crawler
-
Scrapy command line
scrapy -h command
scrapy [options] [args] Scrapy command line format
- The command line (not a graphical interface) is easier to automate, suitable for scripting
data control and storing data in a database- Require users to write configuration code
- Spider Middleware
- Reprocessing of requests and crawls
- Modify, discard, add requests or crawl items
- User can write configuration code
(2) Comparison of requests library and Scrapy crawler
-
Same point:
-
Both can perform page request and crawl, two important technical routes of Python crawler
-
Both have good usability, rich documentation, and easy to get started
-
Both have no functions such as processing js, submitting forms, responding to verification codes (extensible)
requests | Scrapy |
---|---|
Page-level crawler | Site-level crawler |
Function library | frame |
Insufficient concurrency considerations, poor performance | Good concurrency and high performance |
The focus is on page downloads | The focus is on reptile structure |
Flexible customization | General customization is flexible, but in-depth customization is difficult |
Getting started is very simple | Getting started is a little difficult |
-
How to choose a technical route
-
Very small requirements, requests library
-
Not too small demand, Scrapy framework
-
High degree of customization (regardless of scale), self-built framework requests> Scrapy
(3) Common commands of Scrapy crawler
-
Scrapy command line
scrapy -h command
scrapy [options] [args] Scrapy command line format
[External link image is being transferred ... (img-QQ2eu90d-1587603530585)]
- Command line (not graphical interface) is easier to automate and suitable for script control
- Essentially, Scrapy is for programmers, and functions (rather than interfaces) are more important