Python crawler Scrapy framework selection 01 episode

The concept and process of scrapy

learning target:
  1. Understand the concept of scrapy
  2. Understand the role of scrapy framework
  3. Master the operating process of the scrapy framework
  4. Master the role of each module in scrapy

1. The concept of scrapy

Scrapy is an open source web crawler framework written in Python. It is a framework designed to crawl network data and extract structured data.

Scrapy uses the Twisted['twɪstɪd] asynchronous network framework, which can speed up our download speed.

Scrapy document address: http://scrapy-chs.readthedocs.io/zh_CN/1.0/intro/overview.html

2. The role of scrapy framework

A small amount of code can be quickly crawled

3. Scrapy workflow

3.1 Review the previous crawler process

3.2 The above process can be rewritten as

3.3 scrapy process

The process can be described as follows:
  1. The starting url in the crawler is constructed as a request object -> crawler middleware -> engine -> scheduler
  2. The scheduler puts request -> engine -> download middleware -> downloader
  3. The downloader sends the request and gets the response response---->download middleware---->engine—>crawler middleware—>crawler
  4. The crawler extracts the URL address and assembles it into a request object -> crawler middleware -> engine -> scheduler, repeat step 2
  5. Crawler extracts data—>engine—>pipeline processing and saving data
note:
  • The Chinese in the picture is added for the convenience of understanding
  • The green line in the figure indicates the transfer of data
  • Pay attention to the position of the middleware in the figure, which determines its role
  • Pay attention to the location of the engine, all modules were independent of each other before, and only interact with the engine

3.4 Three built-in objects of scrapy

  • request request object: composed of url method post_data headers, etc.
  • response object: composed of url body status headers, etc.
  • item data object: essentially a dictionary

3.5 The specific role of each module in scrapy

note:
  • The crawler middleware and the download middleware are only different in the location of the running logic, and their functions are repeated: such as replacing UA, etc.

summary

  1. The concept of scrapy: Scrapy is an application framework written for crawling website data and extracting structured data
  2. The operation process and data transfer process of the scrapy framework:
    1. The starting url in the crawler is constructed as a request object -> crawler middleware -> engine -> scheduler
    2. The scheduler puts request -> engine -> download middleware -> downloader
    3. The downloader sends the request and gets the response response---->download middleware---->engine—>crawler middleware—>crawler
    4. The crawler extracts the URL address and assembles it into a request object -> crawler middleware -> engine -> scheduler, repeat step 2
    5. Crawler extracts data—>engine—>pipeline processing and saving data
  3. The role of the scrapy framework: Fast crawling through a small amount of code
  4. Master the role of each module in scrapy:
    engine: responsible for the transmission of data and signals between modules without back pain
    . Scheduler: implement a queue to store the request object sent by the engine.
    Downloader: send The request sent by the engine, get the response, and give the response to the engine.
    Spider: Process the response sent by the engine, extract the data, extract the URL, and deliver it to the engine
    pipeline (pipeline): Process the data passed by the engine, For example, storage
    downloader middleware: download extensions that can be customized, such as setting proxy ip
    crawler middleware (spider middleware): request requests and response filtering can be customized, which is the same as downloading middleware.

Guess you like

Origin blog.csdn.net/weixin_38640052/article/details/115059230