Scrapy use (a)

About
Scrapy is a website for crawling data, extract structured data written application framework that can be applied in a series of program data mining, information processing or storage of historical data, etc., it can be used for a wide range of Scrapy use data mining to detect and automation test, Scrapy Twisted asynchronous network libraries using network communication processing.


flow chart


The scheduler to request objects to the engine, the engine is responsible for the request object to (the middle of a download middleware) downloader, downloader occur request to obtain response, the response to the engine download, download in the response to the ( a reptile intermediate middle) reptiles, reptiles extracted data, and then extracted to the response, to the engine, the engine response to the pipeline.


Scrapy major components

Scrapy basic commands

scrapy startproject Project Name # create a project file in the current directory
cd project name # into the project file
scrapy genspider Spider Name link # Create a crawler application
scrapy
scrapy List # show reptiles list
scrapy crawl reptile name # operation reptile
scrapy - -nolog # does not display the log information can now run when used in conjunction with reptiles
scrapy genspider -t crawl Spider name # Links created c'rawl crawler
scrapy --help # help documentation can be obtained reptiles

Debug information

 

File Description

文件说明
Spider:
自定义spdider类,继承scrapy.spider,这个主要是用来接收引擎发过来的response我进行数据提取。parse()里面可以写一些信息提取规则,详细见图。

Iter:
iter用来做数据格式化的,类型为字典,详情请看图。

setting:
setting设置爬虫的请求头,cookie,数据库信息等内容。

Pipeline:
pipeline主要进行的是数据的持久化,比如:一些数据要保存到数据库,文件或者别的地方都可以在pipeline下面来进行设置保存。
PS:
1.其中它的process_item()方法名是不能更改为别的名称。
2.如果要使用pipeline需要到setting配置文件中设置,代码如下:
3.pipeline中权重越小优先级越高
4.pipeline可以设置多个class来进行数据的保存

pipeline setting设置
ITEM_PIPELINE={'myspider.MyspiderPipeline':100} #{pipeline的位置:权重}


yiled object 与 yiled Request
yiled object:
yiled对象必须返回 Request,Baseitem,dict,None

yiled Request:
能构建一个request,同时指定提取数据的callback函数,meta
ps:
参数:meta:传递信息给下一个函数
日志的设置

Crawl spider



Crawl spider 总结
)

 

 

Guess you like

Origin www.cnblogs.com/pythonlxf/p/11257238.html