XPath-》XML Path Language
advantage:
(1) can directly look up information in XML
(2) support for HTML Find
(3) to navigate through the elements and attributes
Instructions:
1 from lxml Import etree 2 Selector = etree.HTML (target content) # convert content to target XPath format 3 selector.xpath (expression) # returned to find a list of
Basic knowledge of grammar:
(1) // double slash: positioning the root node, the full text will be scanned, select all eligible content in the document, in the form of a list of return
(2) / single slash: Looking at the current level path tag or label path of the label on the current path for the operation content
(3) / text (): Get the text contents of the current path
(4) / @ xxx: extracting the attribute value of the current path tag
(5) | pipe symbol: Alternatively symbols, using | can select a plurality of paths, or similar features, such as // p | div.
(6) points: for selecting the current node
(7) .. double point: select the parent node of the current node
Pseudocode example:
from lxml Import etree HTML = ' ' Selector = etree.HTML (HTML) # match the root directory, id = 'text labels at li' id under div tag = 'content ul' tag of ul content = selector .xpath ( " // div [@ ID = 'Content'] / UL [@ ID = 'UL'] / Li / text () " ) for I in Content: Print (I) # matches a tab root directory, use "@ tag attribute" method, obtaining a tag href attribute value #content = selector.xpath ( " // a / @ href " )
Special usage:
(1)starts-with
= selector.xpath Content ( " // div [Soho starts-with (@id, 'a')] / text () " ) # extract div id attribute value of a tag to the beginning of the text, such as <div id = 'aabb '> aa </ div>, is extracted aa
(2) text () and position ()
If there is no need to extract tag attributes, tag may be identified by text () or position () method
# Extracted text as a text label p div tag in the hello Content = selector.xpath ( " // div [text () = 'hello'] / p / text () " ) # extract the text in the div tag hello p second text label Content = selector.xpath ( " // div [text () = 'Hello'] / p [position () = 2] / text () " ) # can also use multiple filters, such as ul [position () = 3] [@ id = 'a']
Two Ways of XPath:
(1) The manual input labeling rules
(2) to find the target tag in the browser developer mode --- "Right Copy XPath