Python crawler learning (7) Detailed explanation of Scrapy crawler framework

(6) Scrapy crawler framework

(1) Scrapy crawler frame structure

  • Scrapy is not a function library, but a crawler framework
  • The crawler framework is a collection of software structures and functional components that implement crawler functions
  • The crawler framework is a semi-finished product that can help users achieve professional web crawlers

Insert picture description here

  • Look at the structure: distributed, "5 + 2" structure
  • Look at the process: data flow

1. The path of data flow

  1. Engine gets a crawl request from Spider
  2. Engine forwards the crawl request to Scheduler for scheduling
  3. Engine gets the next request to crawl from Scheduler
  4. Engine sends crawl request to Downloader through middleware
  5. After crawling the webpage, Downloader forms a response and sends it to Engine through the middleware
  6. Engine sends the received response to Spider for processing through middleware
  7. After the spider processes the response, it generates a scraped item and a new crawl request to the engine
  8. Engine sends the crawl item to Item Pipeline (frame exit)
  9. Engine sends crawl request to Scheduler

2. Entrance and exit of data stream

  • The Engine controls the data flow of each module, and obtains crawl requests continuously from the ongoing Scheduler until the request is empty

  • Framework entrance: Spider's initial crawl request

  • Frame export: Item Pipeline

3. Module analysis

  1. Engine
    • Control data flow between all modules
    • Trigger events based on conditions
    • No user modification
  2. Downloader
    • Download web pages upon request
    • No user modification
  3. Scheduler
    • Schedule management for all crawl requests
    • No user modification
  4. Downloader Middleware
    • Implement user-configurable control between Engine, Scheduler and Downloader
    • Modify, discard, add request or response
    • User can write configuration code
  5. Spider
    • Parse the response returned by Downloader (Response)
    • Generate scraped items
    • Generate additional crawl requests (Request)
    • Require users to write configuration code
  6. Item Pipelines
    • Pipeline processing of spider-generated crawl items
    • Consists of a sequence of operations, similar to a pipeline, each operation is an Item Pipeline type
    • Possible operations include: cleaning up, checking, and rechecking HTML data in crawl items, and storing the data in a database
    • Require users to write configuration code
  7. Spider Middleware
    • Reprocessing of requests and crawls
    • Modify, discard, add requests or crawl items
    • User can write configuration code

(2) Comparison of requests library and Scrapy crawler

  • Same point:

  • Both can perform page request and crawl, two important technical routes of Python crawler

  • Both have good usability, rich documentation, and easy to get started

  • Both have no functions such as processing js, submitting forms, responding to verification codes (extensible)

requests Scrapy
Page-level crawler Site-level crawler
Function library frame
Insufficient concurrency considerations, poor performance Good concurrency and high performance
The focus is on page downloads The focus is on reptile structure
Flexible customization General customization is flexible, but in-depth customization is difficult
Getting started is very simple Getting started is a little difficult
  • How to choose a technical route

  • Very small requirements, requests library

  • Not too small demand, Scrapy framework

  • High degree of customization (regardless of scale), self-built framework requests> Scrapy

(3) Common commands of Scrapy crawler

  • Scrapy command line

    scrapy -h command

    scrapy [options] [args] Scrapy command line format

Insert picture description here

  • The command line (not a graphical interface) is easier to automate, suitable for scripting
    data control and storing data in a database
    • Require users to write configuration code
  1. Spider Middleware
    • Reprocessing of requests and crawls
    • Modify, discard, add requests or crawl items
    • User can write configuration code

(2) Comparison of requests library and Scrapy crawler

  • Same point:

  • Both can perform page request and crawl, two important technical routes of Python crawler

  • Both have good usability, rich documentation, and easy to get started

  • Both have no functions such as processing js, submitting forms, responding to verification codes (extensible)

requests Scrapy
Page-level crawler Site-level crawler
Function library frame
Insufficient concurrency considerations, poor performance Good concurrency and high performance
The focus is on page downloads The focus is on reptile structure
Flexible customization General customization is flexible, but in-depth customization is difficult
Getting started is very simple Getting started is a little difficult
  • How to choose a technical route

  • Very small requirements, requests library

  • Not too small demand, Scrapy framework

  • High degree of customization (regardless of scale), self-built framework requests> Scrapy

(3) Common commands of Scrapy crawler

  • Scrapy command line

    scrapy -h command

    scrapy [options] [args] Scrapy command line format

[External link image is being transferred ... (img-QQ2eu90d-1587603530585)]

  • Command line (not graphical interface) is easier to automate and suitable for script control
  • Essentially, Scrapy is for programmers, and functions (rather than interfaces) are more important
Published 10 original articles · Like1 · Visits 131

Guess you like

Origin blog.csdn.net/qq_39419113/article/details/105699274