Use Scrapy framework

Introduction Scrapy framework

  Scrapy Twisted is an asynchronous frame based processing, is implemented in Python crawler frame pure, Chengdu low coupling module between its clear structure, highly scalable, flexible complete needs. We just need a few custom development modules can easily achieve a reptile.

1. Architecture Introduction

It can be divided into several sections as follows.

  • Engine engine, processes the data stream processing of the entire system, triggering affairs, is the core of the framework.
  • Item project, which defines a data structure of a result crawling, crawling data will be assigned to the Item object.
  • Scheduler scheduler accepts engine request sent, and added to the queue, when the engine will be requested again request to the engine.
  • Downloader download, download web content and web content returned to the spider.
  • Spider spider, which defines parsing rules crawling logical page, which is mainly responsible for parsing the response and generates extraction result and a new request.
  • Item Pipeline project pipeline, responsible for handling the spider web drawn from the project, which main task is to clean, validate and store data.
  • Downloader Middleware downloaded middleware, located between the engine and the hook frame downloader, requests and responses between the main processing engine and downloader.
  • Spider Middleware middleware spider, located between the engine and the frame hook spiders, and the main processing result in response to the output of the spider and enter a new request.

Guess you like

Origin www.cnblogs.com/jeavy/p/11470455.html