The Web crawler: Scrapy uses detailed solution (b) -scrapy small reptiles Birth

I. Introduction:

  • In the previous chapter describes the installation process Scrapy reptile and to create a separate space.
  • The Our goal is no cavities!
  • The Our goal is no cavities!
  • The Our goal is no cavities!
  • Well, do not hit me, our goal is to understand scrapy structure of the project and through a small reptile in the end how to verify its easy to use!

Two, scrapy project structure

  • First, let's create a command execution scrapy scrapy project, the project will look at which parts wrapped

  • Ok, we have created a project called scrapydemo's, cd to the next directory you like, and then execute the command, if you want to create in the virtualenv development environment, remember to enter the environment virtualenv Oh, it is not clear you can click Scrapy use explain (a)

      scrapy startproject scrapydemo
  • After you have performed, you will see the following directory structure

      scrapydemo/
          scrapy.cfg           
          scrapydemo/            
                __init__.py
                items.py         
                pipelines.py      
                settings.py       
                spiders/         
                      __init__.py
  • Files role is as follows: 1, scrapy.cfg Project Profile 2, stored under scrapydemo / directory is our python code files related to the project, the general code inside the 3, items.py defines the data model with crawled, Once you understand which data is java bean 4, pipelines.py responsible for processing data scrapy crawling, we resolve to download the target page, the data will be sent to pipelines for processing, pipelines.py file def process_item approach to receiving and processing data 5, settings.py crawler configuration files, such as: user agent, such as delayed crawling configuration information is in this configuration inside the 6, spiders / directory is the actual directory code we wrote

Third, create a crawler

  • The OK, we have a general understanding of the structure, we use the following template to generate a scrapy provide web crawling retractor stencil crawler genspider spiderLagou look scrapy https://www.lagou.com --template = crawl

  • Ha, FIG collected online, do little processing, Scrapy crawling process, as shown below: 1- "2-" 3- "4-" 1 so the cycle ..... scrapy.png

  • Engine (Scrapy Engine), for processing the entire data stream processing system, the transaction is triggered.

  • The scheduler (Scheduler), for receiving a request sent over the engine, is pressed into the queue, and returns the engine again when the request.

  • Downloader (Downloader), for downloading web content and web content returned to the spider.

  • Spiders (Spiders), the spider is a major work, and use it to develop rules to resolve specific domains or web page. Written response class used to analyze and extract the item (ie acquired item) or additional follow-up URL. Each spider is responsible for handling a particular (or some) website.

  • Project pipeline (Item Pipeline), responsible for handling spiders drawn from the Web project, his main task is to clean, validate and store data. When a page is parsed spider, the project will be sent to the pipeline, and after a few specific order of processing data.

  • Download middleware (Downloader Middlewares), the frame is located between the hook and the downloader Scrapy engine, mainly dealing with requests and responses between the engine and Scrapy downloader.

  • Spider middleware (Spider Middlewares), the hook is interposed between the frame and the engine Scrapy spiders, the main task is processed in response to the input output request and spiders.

  • Scheduling middleware (Scheduler Middlewares), interposed between the middleware engine Scrapy and scheduling, transmitted from the engine to the Scrapy scheduling requests and responses.

Related Reading

Guess you like

Origin www.cnblogs.com/cnblogzaizai/p/11570606.html