Use of Xpath
XPath uses path expressions to select nodes in an XML document. Along the path through the node, or to select a step.
Listed below are the most useful path expressions:
expression | description | Examples | result |
---|---|---|---|
nodename | Select all the child nodes of this node. | bookstore | Selects all child nodes of the bookstore element. |
/ | If at the top, select representatives from the root node. Otherwise, select a node in a node. | /bookstore | Select the root element bookstore. Note: If the path starts with a forward slash (/), then this path is always representative of the absolute path to an element! |
// | Select a node from the global node, regardless of their location. | //book | Select all book elements. |
. | Select the current node. | ||
.. | Select the parent of the current node. | ||
@ | Select Properties. | //book[@price] | Select node containing price book property |
1: predicate (Predicates)
Predicate is used to find a specific node or a node that contains the value specified. Predicate is embedded in [].
PS: index starts at 1
Path expression | result |
---|---|
/bookstore/book[1] | Select an element belonging to the first sub bookstore book element. |
/bookstore/book[last()] | Select the sub-elements belonging to the bookstore last book element. |
/bookstore/book[last()-1] | Select the sub-elements belonging to the reciprocal bookstore's second book element. |
/bookstore/book[position()<3] | Select book element belonging to two sub-elements of the bookstore element foremost. |
// book [@lang] | Select all book elements with the lang attribute. |
//book[@price='10'] | Select all book elements, and the price attribute value is equal to the book element 10 |
/bookstore/book[price>35.00] | Select all the book elements bookstore element, and wherein the price element value must be greater than 35.00. |
/bookstore/book[price>35.00]/title | Select all the title elements of the book element bookstore element, and wherein the value of the price element must be greater than 35.00. |
2: Select the unknown node
Wildcards | description | Examples | result |
---|---|---|---|
* | Matches any element node. | /bookstore/* | Select the bookstore element of all child elements. |
@* | Matches any attribute node. | //title[@*] | Select all the title elements with attributes. |
3: Select several paths
By using the path expression "|" operator, you can select several paths.
Path expression | result |
---|---|
//book/title | //book/price | Select the book title and price elements of all the elements. |
//title | //price | Select all title and price elements in the document. |
/bookstore/book/title | //price | Select book element of the bookstore element belonging to all the title elements, as well as document all the price elements. |
4: xpath syntax:
Use:
Use // Get element among the entire page, and then write the tag name, and then extract the write predicates. such as:
//div[@class='abc']
Note that knowledge:
/ And // difference: / representative for only the direct child. // Get descendant nodes. // with more than general. Of course, also, as the case may be.
contains: Sometimes an attribute contains multiple values, you can use
contains
the function. Sample code is as follows://div[contains(@class,'job_detail')]
Predicate subscripts starting from 1, not 0's.
Parse html string: Use
lxml.etree.HTML
for resolution. Sample code is as follows:
from lxml import etree #导入etree模块
test=''' HTML代码'''
htmlElement = etree.HTML(text) #HTML(),返回一个Element对象
result=etree.tostring(htmlElement) #tostring()输出修正后的html代码,但是结果是bytes类型
print(result.decode('utf-8')) #转成str类型
5. Parse html files: Use lxml.etree.parse
parsing. Sample code is as follows:
from lxml import etree
html = etree.parse("./tencent.html",etree.HTMLParser()) #parse()
result=etree.tostring(html)
print(result.decode('utf-8'))
6. This function is used by default XML
parser, so if you encounter some non-standard HTML
code when it will parse error, this time we must create your own HTML
parser.
parser = etree.HTMLParser(encoding='utf-8')
htmlElement = etree.parse("lagou.html",parser=parser)
print(etree.tostring(htmlElement, encoding='utf-8').decode('utf-8'))
lxml combined xpath Note:
Use
xpath
syntax. You should use theElement.xpath
method. To perform xpath choice. Sample code is as follows:trs = html.xpath("//tr[position()>1]")
xpath函数
It is always a return to the list.Gets the value of a property tag:
href = html.xpath("//a/@href") # 获取a标签的href属性对应的值
Gets the text, by
xpath
thetext()
function. Sample code is as follows:address = tr.xpath("./td[4]/text()")[0]
In a tab, and then perform the function xpath, get descendant elements under this label, you should add a point before the slash, is to get representatives in the current element. Sample code is as follows:
address = tr.xpath("./td[4]/text()")[0]
Parent ..
result=html.xpath('//a[@href='link1.html']/../@class') #先获取href属性为link1.html的a节点,然后获取其父节点,最后获取其class属性值
---- @ match attributes using attribute filter
.xpath('//li[@class="item-0"]') #获取所有class="item-0"的li标签,返回列表
Property - a multi-value match
.xpath('//li[contains(@class),"属性值"]') #contains()方法,第一个参数传入属性名称,第二个参数传入属性值
Multi-attribute - match
.xpath('//li[contains(@class),"属性值" and @name="属性值"]') .xpath('//li[@class="属性值" and @name="属性值"]') and or < ><= >= 等等