XPath parse page study notes

XPath-》XML Path Language

advantage:

(1) can directly look up information in XML

(2) support for HTML Find

(3) to navigate through the elements and attributes

Instructions:

1  from lxml Import etree
 2 Selector = etree.HTML (target content)   # convert content to target XPath format 
3 selector.xpath (expression)    # returned to find a list of

 

Basic knowledge of grammar:

(1) // double slash: positioning the root node, the full text will be scanned, select all eligible content in the document, in the form of a list of return

(2) / single slash: Looking at the current level path tag or label path of the label on the current path for the operation content

(3) / text (): Get the text contents of the current path

(4) / @ xxx: extracting the attribute value of the current path tag

(5) | pipe symbol: Alternatively symbols, using | can select a plurality of paths, or similar features, such as // p | div.

(6) points: for selecting the current node

(7) .. double point: select the parent node of the current node

 

Pseudocode example:

from lxml Import etree 
HTML = '    ' 
Selector = etree.HTML (HTML)
 # match the root directory, id = 'text labels at li' id under div tag = 'content ul' tag of ul 
content = selector .xpath ( " // div [@ ID = 'Content'] / UL [@ ID = 'UL'] / Li / text () " )
 for I in Content:
 Print (I)
 # matches a tab root directory, use "@ tag attribute" method, obtaining a tag href attribute value 
#content = selector.xpath ( " // a / @ href " )

 

Special usage:

(1)starts-with

= selector.xpath Content ( " // div [Soho starts-with (@id, 'a')] / text () " )
 # extract div id attribute value of a tag to the beginning of the text, such as <div id = 'aabb '> aa </ div>, is extracted aa

 

(2) text () and position ()

If there is no need to extract tag attributes, tag may be identified by text () or position () method

# Extracted text as a text label p div tag in the hello 
Content = selector.xpath ( " // div [text () = 'hello'] / p / text () " ) 

# extract the text in the div tag hello p second text label 
Content = selector.xpath ( " // div [text () = 'Hello'] / p [position () = 2] / text () " ) 

# can also use multiple filters, such as ul [position () = 3] [@ id = 'a']

 

 

Two Ways of XPath:

(1) The manual input labeling rules

(2) to find the target tag in the browser developer mode --- "Right Copy XPath

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/muouran0120/p/11414559.html
Recommended