The core technology for extracting data from pages is HTTP text parsing, which is processed by modules commonly used in python:
BeautifulSoup is a very popular parsing library with simple API but slow parsing.
lxml is a set of xml parsing library written in c language, with fast parsing speed and relatively complex API.
The Selector class in Scrapy is based on the lxml library and simplifies the API interface. In the process of using, first use the xpath or css selector to select the data to be extracted from the page, and then extract it.
Extract data
The selected content can be extracted by calling the following methods of the Selector or SelectList object
extract() returns the Unicode string of the selected content.
extract_first() SelectorList exclusive, returns the first Selector object to call the extract method. Usually when the SelectorList contains only one Selector object, choose to call this method.
re() uses a regular expression to extract a part of the selection.
for example
selector.xpath('.//b/text()') .extract #['price: 99.00 yuan', 'price: 88.00 yuan', 'price: 88.00 yuan']
selector.xpath('.//b/text()').re('\d+\.\d+') #[ '99.00','88.00','88.00']
re_first() returns the first Selector object in the SelectorList object to call the re method.
selector.xpath('.//b/text()').re_first('\d+\.\d+') # '99.00'
In the actual development process, there is almost no need to manually create the Selector object. The Response object automatically creates the Selector object with its own parameters.
response.xpath('.//h1/text').extract() # [ 'song','shi','chao']
response.css('li::text').extract() # [ 'song','shi','chao']
xpath selector
Xpath is the xml path language, which is used to determine the language of a certain part of an xml document. An xml document (html belongs to xml) is a tree composed of a series of nodes.
Basic grammar
/ The root of the selected document Describes an absolute path starting from the root./Indicates that it is selected from the current node (for example, if a part is extracted, it will be used if it needs to be extracted from the extraction. If it is not added, it will be extracted from the entire document)
.Select the current node to describe the relative path
..Select the parent node of the current node to describe the relative path
ELEMENT selects all ELEMENT element nodes in child nodes
//ELEMENT selects all ELEMENT element nodes in descendant nodes
* Select all element child nodes
text() selects all text child nodes
@ATTR selects the attribute node named ATTR
@* select all attribute nodes
[Predicate] The predicate is used to find a specific node or a node containing a specific value
Example
response.xpath('/html/body/div') #Select all divs under body
Response.xpath('//a') #Select all a in the document
response.xpath('/html/body//div') #Select the div in all nodes under the body, no matter where it is
Response.xpath('//a/text()') #Select all the text of a
response.xpath('/html/div/*') #Select all element child nodes of div
Response.xpath('//div/*/img') #Select all img of div grandchild node
Response.xpath('//img/@src') #Select the src attribute of all img
Response.xpath('//a[1]/img/@*') #Select all attributes of img under the first a
Response.xpath('//a[2]') #2 of all a
response.xpath('//a[last()]') #The last one in all a('/a[last()-1]')#The second last ('//a[position()< =3]')#Use the position function to select the first three ('//div[@id]')#Select all divs with id attributes ('//div[@id="song"]')#Select all div with id attribute of song
Response.xpath('//p[contains(@class,'song')]') #Select the p element containing 'song' in the class attribute
response.xpath('//div/a | //div/p') Or, the page may be a may be p
css selector
CSS is Cascading Style Sheets. The selector is not as powerful as xpath. The principle is to translate it into xpath expression and call the xpath method when selecting.
* select all nodes
#container selects the node whose id is container
.container selects the node whose class contains container
li a selects all a nodes under all li
ul + p selects the first p element after all ul
#container > ul selects the first ul node whose id is container
a[class] selects all a elements with class attribute
a[href="http://b.com"] 含有href="http://b.com"的a元素
a[href*='job'] contains the a element of the job
a[href^='https'] starts with the a element of https
a[href$='cn'] a element ending with cn
response.css('div a::text').extract() the text of all a under all divs
response.css('div a::attr(href)').extract() the value of href
response.css('div>a:nth-child(1)') Select the first a of each div > It will be set to only be found in the child nodes, not in the grandchild nodes
response.css('div:not(#container)') selects all divs whose id is not container
response.css('div:first-child>a:last-child') The last a in the first div