scrapy introduction and use of

scrapy process

 

Which process may be described as follows:

  1. The scheduler requests -> Engine -> Download middleware ---> Downloader
  2. Sends a request to download, acquisition response ----> ---- downloaded middleware> Engine ---> --- crawler middleware> reptiles
  3. Extracting the url crawler, assembled into a request object ----> crawler middleware ---> --- Engine> scheduler
  4. Reptile extract data ---> Engine ---> pipeline
  5. Pipeline processing and storing data

note:

  • The green transmission lines represent data
  • Note that in FIG intermediate position, determine its role
  • Note the location where the engine before all modules independent of each other, interact only with engine

scrapy specific role of each module

 

 

 1.scrapy project implementation process

  • Scrapy create a project:scrapy startproject 项目名

  • Generate a reptile:scrapy genspider 爬虫名 允许爬取的范围

  • Extract data:完善spider,使用xpath等方法

  • save data:pipeline中保存数据

2. Create a project scrapy

command:scrapy startproject +<项目名字>

Example:scrapy startproject myspider

Generated directory and files results are as follows:

 

 

settings.pyThe focus and content fields

  • USER_AGENT Set ua
  • ROBOTSTXT_OBEY Compliance with robots protocol, the default is to comply
  • CONCURRENT_REQUESTS Setting the number of concurrent requests, default 16
  • DOWNLOAD_DELAY Download delay, no delay default
  • COOKIES_ENABLED Whether to open the cookie, that is, each time a cookie request before the belt is enabled by default
  • DEFAULT_REQUEST_HEADERS Set the default request header
  • SPIDER_MIDDLEWARES Reptiles middleware, and the same setting process pipe
  • DOWNLOADER_MIDDLEWARES Download Middleware

Create a reptile

command:scrapy genspider +<爬虫名字> + <允许爬取的域名>

Generated directory and files results are as follows:

Perfect spider

Spider that is perfect for extracting data by methods do gymnastics:

note:

  1. response.xpathIt returns a result of the method is similar to the type of list, which contains the object is a selector, and a list of operating the same, but there are additional methods
  2. extract() It returns a string that contains the list
  3. extract_first() Return the first string in the list, the list is empty does not return None
  4. parse method spider must have
  5. Need to grab the url address must belong allowed_domains, but the url address start_urls is no such restriction
  6. Start crawler when the attention of the start position, is to start in the project path

 

Data is transmitted to the pipeline

Why use yield?

  • Let the entire function into a generator, what good is it?
  • Traversing the function's return value when, one by one to read the data memory, will not cause excessive memory footprint of the moment
  • python3 in the range of xrange empathy and python2

note:

  • yield能够传递的对象只能是:BaseItem,Request,dict,None

6. 完善pipeline

 

 

 

 

pipeline在settings中能够开启多个,为什么需要开启多个?

  • 不同的pipeline可以处理不同爬虫的数据
  • 不同的pipeline能够进行不同的数据处理的操作,比如一个进行数据清洗,一个进行数据的保存

pipeline使用注意点

  • 使用之前需要在settings中开启
  • pipeline在setting中键表示位置(即pipeline在项目中的位置可以自定义),值表示距离引擎的远近,越近数据会越先经过
  • 有多个pipeline的时候,process_item的方法必须return item,否则后一个pipeline取到的数据为None值
  • pipeline中process_item的方法必须有,否则item没有办法接受和处理
  • process_item方法接受item和spider,其中spider表示当前传递item过来的spider

 

Guess you like

Origin www.cnblogs.com/skaarl/p/11919540.html