scrapy process
Which process may be described as follows:
- The scheduler requests -> Engine -> Download middleware ---> Downloader
- Sends a request to download, acquisition response ----> ---- downloaded middleware> Engine ---> --- crawler middleware> reptiles
- Extracting the url crawler, assembled into a request object ----> crawler middleware ---> --- Engine> scheduler
- Reptile extract data ---> Engine ---> pipeline
- Pipeline processing and storing data
note:
- The green transmission lines represent data
- Note that in FIG intermediate position, determine its role
- Note the location where the engine before all modules independent of each other, interact only with engine
scrapy specific role of each module
1.scrapy project implementation process
-
Scrapy create a project:
scrapy startproject 项目名
-
Generate a reptile:
scrapy genspider 爬虫名 允许爬取的范围
-
Extract data:
完善spider,使用xpath等方法
-
save data:
pipeline中保存数据
2. Create a project scrapy
command:scrapy startproject +<项目名字>
Example:scrapy startproject myspider
Generated directory and files results are as follows:
settings.py
The focus and content fields
USER_AGENT
Set uaROBOTSTXT_OBEY
Compliance with robots protocol, the default is to complyCONCURRENT_REQUESTS
Setting the number of concurrent requests, default 16DOWNLOAD_DELAY
Download delay, no delay defaultCOOKIES_ENABLED
Whether to open the cookie, that is, each time a cookie request before the belt is enabled by defaultDEFAULT_REQUEST_HEADERS
Set the default request headerSPIDER_MIDDLEWARES
Reptiles middleware, and the same setting process pipeDOWNLOADER_MIDDLEWARES
Download Middleware
Create a reptile
command:scrapy genspider +<爬虫名字> + <允许爬取的域名>
Generated directory and files results are as follows:
Perfect spider
Spider that is perfect for extracting data by methods do gymnastics:
note:
response.xpath
It returns a result of the method is similar to the type of list, which contains the object is a selector, and a list of operating the same, but there are additional methodsextract()
It returns a string that contains the listextract_first()
Return the first string in the list, the list is empty does not return None- parse method spider must have
- Need to grab the url address must belong allowed_domains, but the url address start_urls is no such restriction
- Start crawler when the attention of the start position, is to start in the project path
Data is transmitted to the pipeline
Why use yield?
- Let the entire function into a generator, what good is it?
- Traversing the function's return value when, one by one to read the data memory, will not cause excessive memory footprint of the moment
- python3 in the range of xrange empathy and python2
note:
- yield能够传递的对象只能是:
BaseItem
,Request
,dict
,None
6. 完善pipeline
pipeline在settings中能够开启多个,为什么需要开启多个?
- 不同的pipeline可以处理不同爬虫的数据
- 不同的pipeline能够进行不同的数据处理的操作,比如一个进行数据清洗,一个进行数据的保存
pipeline使用注意点
- 使用之前需要在settings中开启
- pipeline在setting中键表示位置(即pipeline在项目中的位置可以自定义),值表示距离引擎的远近,越近数据会越先经过
- 有多个pipeline的时候,process_item的方法必须
return item
,否则后一个pipeline取到的数据为None值 - pipeline中process_item的方法必须有,否则item没有办法接受和处理
- process_item方法接受item和spider,其中spider表示当前传递item过来的spider