A, scrapy selector
Overview:
Scrapy provide resolution mechanism based lxml library, they are called selectors.
Because they "choose" specified by the XPath expression or a CSS part of the HTML document.
Scarpy selector API is very small, and very simple.
Scrapy is selected by scrapy.Selector class instance, by passing TextResonse text objects or constructed.
Selector Selector object using
Selector provides two methods to extract tag xpath () # based on the syntax rules xpath css () # based grammar css selector shortcut Selector = Response. Xpath ( '') Selector = Response. Css ( '') they returned object is to select a list of extracted text: selector. extract () returns the text list selector. extract_first () returns the first selector of the text, did not return None; you can set default sometimes we get several calls tag selection method (. XPath () or. CSS () ) Response. CSS ( 'IMG'). XPath ( '@src') Selector there is a. Re ()
The method to use regular expressions to extract data. It returns a string.
It is generally used in XPath () , CSS () after the method used to filter the text data.
re_first () which returns the first matching string.
For example:
. Response XPath ( '// A [the contains (@href, "Image")] / text ()'). Re ( R & lt 'the Name:. \ S * (*)')
the contains () Fuzzy Match
Two, scrapy shell debugging tools
Description: Scrapy project code for debugging command line tool.
Start shell
Start Scrapy shell command syntax is as follows:
scrapy shell [ the Option] [ url | File] Note: The analysis is sure to bring a local file path, scrapy shell as the default url
Using shell
Scrapy shell is essentially an ordinary Python shell
only provides some of the objects you want to use, quick way for us to debug. Shortcut: Shelp () FETCH ( URL [, the redirect = True]) FETCH ( Request) View ( Response) Scrapy objects: content crawler Spider Request Response Settings
Three, scrapy.Spider
Spider class attributes, methods | description |
---|---|
The name attribute | The name of spider |
start_urls property | Spiders start crawling the url list |
customer_settings property | Custom settings |
start_requests () method | Before the start of the request |
parse(self, response) | The default callback function |
from_crawler | Create a class method of spider |