scrapy the crawler frame (reference video horse)

this
Here Insert Picture Description

Getting Scrapy

1. Create a scrapy project:
scrapy startproject mySpider (mySpider the project name, can be arbitrarily changed)
2. Generate a reptile:
scrapy genspider itcast itcast.cn (itcast as reptiles file name must be unique and can not duplicate the name of the project, behind itcast.cn our web site will be crawling, crawling prevent other address)
3. extract data:
scrapy crawl itcast # crawling
perfect spider, using methods such as xpath
fill parse method, and can not change the name of the parse method, only able to parse ()

def parse(self,response):
    #处理start_url地址对应的响应
	ret1=response.xpath("//div[@class='tea_con']//h3/text()").extract()
	print(ret1)

Dedicated in settings set, log_level a total of four levels, debug, info, warning and error, set the display and above the log only warning followed by warning, which would allow some useless logs do not display for easy viewing Here Insert Picture Description
scrapy on the extracted result of a package, comprising the result of rules and extracted xpath
Here Insert Picture Descriptionthus added after .extract method xpath () method, the method returns extract, this should be directly extracted by the data xpath
Here Insert Picture Description
further optimization to obtain the title lecturer

li_list=response.xpath("//div[eclass='tea_con']//Li") 
for li in li_list:
    item={} 
    item["name"]=1i.xpath(".//h3oirtext()").extract_first() 
    item["title"]=1i.xpath(".//h4/text()").extract_first() 
    yield item  #将想要的item传给pipeline,由pipeline统一进行处理,但是首先需要在setting中将pipeline开启,默认setting中pipeline是加#作为注释的

After the yield can not list only for the request object, Baseitem, dictionary or none, otherwise an error
Here Insert Picture Description
# extract_first ()
equivalent extract () [0], but when the data is not can not be extracted, extract () [0] returns completely empty list, and extract_first () is in the part can not be extracted with NONE in place, more in line with our usage. It is best to use extract_first () instead of extract () [0]
page demand
is achieved by a method yield scrapy.Request

yield scrapy.Reques(next_page_url,callback=self.parse)  #callback:指定传入的url交给哪个解析函数去处理

Requests can build a scrapy.Request, callback function specifying data extraction, so that here since then processed using the function, so self.prase used
simultaneously provided request header USER_AGENT setting when provided, the default is by #, their change
4. save data:
pipeline data stored
in spider in item yield item will want to pass pipeline, processed by the pipeline unity, but first need to open the pipeline will be setting in, the default setting in the pipeline is added as comments # , simply to remove the pipeline of the #
`` then you can in the pipeline in print (item)
can not be deleted return item back.

Note that when setting the set pipeline: Here Insert Picture Description
figures represent the distance behind the engine of the distance, the smaller the number closer engine, on the first through the pipeline, such as pipeline and the pipeline where a distance from engine 300, 301, respectively, through the first and processing pipeline, after pipeline1, as an expression of the order.
pipeline class name can be changed, such as myspiderpipeline, or myspiderpipeline1, but the method name process_item () can not be changed
Here Insert Picture Description
Here Insert Picture DescriptionHere Insert Picture DescriptionHere Insert Picture DescriptionHere Insert Picture Description

Here Insert Picture Description

Published 12 original articles · won praise 0 · Views 211

Guess you like

Origin blog.csdn.net/Alden_Wei/article/details/105070336