1. Introduction to XPath
To analyze the hierarchical relationship of web pages, XPath's selection function is very powerful, it provides a very simple and clear path selection expression.
In addition, it also provides more than 100 built-in functions for string, numeric, and time matching, as well as processing of nodes and sequences.
Almost all positioning nodes can be selected using XPath.
Official website: https://www.w3.org/TR/xpath
1. XPath common rules:
2. Basic use
from lxml import etree text = ''' <div> <ul> <li class="one"><a href="link1">1</a></li> <li class="two"><a href="link2">2</a></li> <li class="three"><a href="link3">3</a></li> <li class="four"><a href="link4">4</a></li> <li class="five"><a href="link5">5</a> </ul> </div> ''' #将文本转换为网页类型,并修复补全 html = etree.The HTML (text) convert the page into text type, as bytes#(HTML)PrintHTML etree.parse = ( 'demo.html', etree.HTMLParser ()) #The entire complement of web page structure, the file open path# = Result etree.tostring (HTML) # into str Type Result = result.decode ( " UTF-. 8 " ) Print (Result)
1. Match selection (all nodes)
from lxml import etree text = ''' <div> <ul> <li class="one"><a href="link1">1</a></li> <li class="two"><a href="link2">2</a></li> <li class="three"><a href="link3">3</a></li> <li class="four"><a href="link4">4</a></li> <li class="five"><a href="link5">5</a> </ul> </div> ''' #将文本转换为网页类型,并修复补全 html = etree.The HTML (text) (Result)Print)'// *' Result = html.xpath (selected content matching#
2. Child nodes
from lxml import etree text = ''' <div> <ul> <li class="one"><a href="link1">1</a></li> <li class="two"><a href="link2">2</a></li> <li class="three"><a href="link3">3</a></li> <li class="four"><a href="link4">4</a></li> <li class="five"><a href="link5">5</a> </ul> </div> ''' #将文本转换为网页类型,并修复补全 html = etree.The HTML (text) (Result)Print)'// Li / A' Result = html.xpath (selected content matching#
Here "/" represents direct child nodes, "//" represents all descendant nodes
3. Parent node
Parent node: Use " .. ", you can also use parent :: to represent the parent
from lxml import etree text = ''' <div> <ul> <li class="one"><a href="link1">1</a></li> <li class="two"><a href="link2">2</a></li> <li class="three"><a href="link3">3</a></li> <li class="four"><a href="link4">4</a></li> <li class="five"><a href="link5">5</a> </ul> </div> ''' #将文本转换为网页类型,并修复补全 html = etree.The HTML (text) #)'//a[@href="link4"]/../@class' Result = html.xpath (attribute is a parent class attribute tag link4 #selected content matching# @表示属性 result1 = html.xpath('//a[@href="link4"]/parent::*/@class') print(result) print(result1)
4. Text Acquisition
from lxml import etree text = ''' <div> <ul> <li class="one"><a href="link1">1</a></li> <li class="two"><a href="link2">2</a></li> <li class="three"><a href="link3">3</a></li> <li class="four"><a href="link4">4</a></li> <li class="five"><a href="link5">5</a> </ul> </div> ''' #将文本转换为网页类型,并修复补全 html = etree.The HTML (text) Print)'// a [@ the href = "link4"] / text ()' Result = html.xpath (attribute is a parent class attribute tag link4 #selected content matching# (result)
5. Attribute multi-value matching
from lxml import etree text = ''' <div> <ul> <li class="one"><a href="link1">1</a></li> <li class="two"><a href="link2">2</a></li> <li class="three two"><a href="link3">3</a></li> <li class="four"><a href="link4">4</a></li> <li class="five"><a href="link5">5</a> </ul> </div> ''' #将文本转换为网页类型,并修复补全 html = etree.The HTML (text) )'// Li [the contains (@class, "Three")] / A / text ()' Result = html.xpath (the contains (@ property, value) #selected content matching# print(result)
6. Multi-attribute matching
Multiple attributes determine a node, then you need to match multiple attributes
from lxml import etree text = ''' <div> <ul> <li class="one"><a href="link1">1</a></li> <li class="two three" name="item"><a href="link2">2</a></li> <li class="three two"><a href="link3">3</a></li> <li class="four"><a href="link4">4</a></li> <li class="five"><a href="link5">5</a> </ul> </div> ''' // Li [the contains (@class, "Three ") and @ name =" item "] / a / text ()' Result = html.xpath (the contains (@ property, value) #selected content matching#etree.HTML (text) HTML =Converts text page type, and fix complement# ') print(result)
7. Choose in order
from lxml import etree text = ''' <div> <ul> <li class="one"><a href="link1">1</a></li> <li class="two three" name="item"><a href="link2">2</a></li> <li class="three two"><a href="link3">3</a></li> <li class="four"><a href="link4">4</a></li> <li class="five"><a href="link5">5</a> </ul> </div> ''' #)'// Li [. 1] / A / text ()' RESULT1 = html.xpath (first match Li#selected content matching#etree.HTML (text) HTML =Converts text page type, and fix complement# Finally, a countdown 2 result2 = html.xpath ( ' // Li [Last () - 2] / A / text () ' ) # last result3 = html.xpath ( ' // Li [Last ()] / A / text () ' ) # less than. 3 result4 = html.xpath ( ' // Li [position () <. 3] / A / text () ' ) # built-in functions 100, http: //www.w3school.com.cn/ xpath / xpath_functions.asp print (result1) print (result2) print (result3) print (result4)
8. Node axis selection
# Attribute of a tag link4 parent class attribute