scrapy frame (b)

scrapy frame (b)

A, scrapy selector

Overview:

Scrapy provide resolution mechanism based lxml library, they are called selectors.

Because they "choose" specified by the XPath expression or a CSS part of the HTML document.

Scarpy selector API is very small, and very simple.

 

Scrapy is selected by scrapy.Selector class instance, by passing TextResonse text objects or constructed.

Selector Selector object using

 Selector provides two methods to extract tag xpath ()   # based on the syntax rules xpath css () # based grammar css selector shortcut Selector = Response. Xpath ( '') Selector = Response. Css ( '') they returned object is to select a list of extracted text: selector. extract () returns the text list selector. extract_first () returns the first selector of the text, did not return None; you can set default sometimes we get several calls tag selection method (. XPath () or. CSS () ) Response. CSS ( 'IMG'). XPath ( '@src') Selector there is a. Re ()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 The method to use regular expressions to extract data. It returns a string.
 It is generally used in XPath () , CSS () after the method used to filter the text data.
 re_first () which returns the first matching string.
 For example:
 . Response XPath ( '// A [the contains (@href, "Image")] / text ()'). Re ( R & lt 'the Name:. \ S * (*)')
 the contains () Fuzzy Match

 

Two, scrapy shell debugging tools

Description: Scrapy project code for debugging command line tool.

Start shell
 Start Scrapy shell command syntax is as follows: 
 scrapy shell [ the Option] [ url | File] Note: The analysis is sure to bring a local file path, scrapy shell as the default url
 
 
Using shell
 Scrapy shell is essentially an ordinary Python shell 
 only provides some of the objects you want to use, quick way for us to debug. Shortcut: Shelp () FETCH ( URL [, the redirect = True]) FETCH ( Request) View ( Response) Scrapy objects: content crawler Spider Request Response Settings
 
 
 
 
 
 
 
 
 
 
 
 
 

 

Three, scrapy.Spider

Spider class attributes, methods description
The name attribute The name of spider
start_urls property Spiders start crawling the url list
customer_settings property Custom settings
start_requests () method Before the start of the request
parse(self, response) The default callback function
from_crawler Create a class method of spider

 

 

Guess you like

Origin www.cnblogs.com/yelan5222/p/12080279.html