Using XPath 4.1
XPath, full name of the XML Path Language, namely XML Path Language, it is a finding information in an XML document language. It was originally used to search XML documents, but it also applies to search HTML documents.
1, XPath Overview
The official document: https://www.w3.org/TR/xpath/ .
2, XPath common rules
Table 4-1 XPath of several commonly used rule
expression |
description |
---|---|
|
Select all the child nodes of this node |
|
Selected direct child node from the current node |
|
Descendants of the current node from the selected node |
|
Select the current node |
|
Select the parent of the current node |
|
Select Properties |
Here a list of commonly XPath of matching rules, for example:
// title [@ lang = ' what ' ]
This is an XPath rule, it represents all the selected name title
, and attribute lang
values of eng
nodes.
Python will pass behind the lxml library, use XPath to parse the HTML.
3, ready to work
Prior to use, we must first ensure that the installed lxml library.
4, examples of the introduction of
from lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> ''' html = etree.HTML(text) result = etree.tostring(html) print(result.decode('utf-8'))
Here first import lxml library etree module, and then declare a piece of HTML text, HTML call the class is initialized, thus successfully construct an XPath parsing object. It should be noted that, HTML text in the last li
node is not closed, but the module may automatically correct etree HTML text.
Here we call the tostring()
method to output HTML code revised, but the result is bytes
the type. Here the use of decode()
a method which is converted to str
type the following results:
<html><body><div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </li></ul> </div> </body></html>
It can be seen that, after treatment, li
the node label is complement, and also automatically add body
, html
node.
Further, the text may be read directly file is parsed, for example:
html = etree.parse('./test.html',etree.HTMLParser()) result = etree.tostring(html) print(result.decode('utf-8'))
Wherein test.html content is the HTML code in the above example.
The output result is slightly different, more of a DOCTYPE
statement, but no impact on the analytical results are as follows:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </li></ul> </div></body></html>
I'm not sure why more & # 13; & # 13 is the ascii code is 13 characters, is a carriage return.
5, all nodes
We usually use XPath // beginning of the rule to select all the nodes meet the requirements. Here at the front of the HTML text, for example, if you want to select all nodes can be achieved:
html = etree.parse('./test.html',etree.HTMLParser()) result = html.xpath('//*') print(result)
operation result:
[<Element html at 0x29c691f8388>, <Element body at 0x29c6949d2c8>, <Element div at 0x29c6949d308>, <Element ul at 0x29c6949d348>, <Element li at 0x29c6949d648>, <Element a at 0x29c6949d688>, <Element li at 0x29c6949d708>, <Element a at 0x29c6949d748>, <Element li at 0x29c6949d788>, <Element a at 0x29c6949d548>, <Element li at 0x29c6949d7c8>, <Element a at 0x29c6949d808>, <Element li at 0x29c6949d848>, <Element a at 0x29c6949d888>]
Here * for all matching nodes, that is, all nodes will be an entire HTML text is acquired. You can see, is returned in the form of a list, each element Element
type, followed by the name of the node, such as html
, body
, div
, ul
, li
, a
and so on, all nodes are included in the list.
Of course, you can also specify here match the node name. If you want to get all the li
nodes, for example:
html = etree.parse('./test.html',etree.HTMLParser()) result = html.xpath('//li') print(result) print(result[0])
Here to select all li
nodes can be used //
, and can be directly coupled with the node name, the direct use when calling xpath()
method can be.
operation result:
[<Element li at 0x1233c89d308>, <Element li at 0x1233c89d348>, <Element li at 0x1233c89d388>, <Element li at 0x1233c89d688>, <Element li at 0x1233c89d588>]
<Element li at 0x1233c89d308>
6, the child node
We /
or //
you can find the child nodes or descendant node elements. If you now want to select li
all the nodes of direct a
child nodes, can be achieved:
html = etree.parse('./test.html',etree.HTMLParser()) result = html.xpath('//li/a') print(result)
operation result:
[<Element a at 0x1e1d79db408>, <Element a at 0x1e1d79db448>, <Element a at 0x1e1d79db488>, <Element a at 0x1e1d79db788>, <Element a at 0x1e1d79db688>]
Here is /
used to select the direct child, if you want to get all descendant nodes, can be used //
. For example, to get ul
all descendant nodes under the a
node, can be achieved:
html = etree.parse('./test.html',etree.HTMLParser()) result = html.xpath('//ul//a') print(result)
Operating results are the same.
But if you use this //ul/a
, you can not get any result. Because /
for obtaining direct child node, while ul
not directly under the node's a
child nodes, only li
nodes, so it can not get any matches.
7, the parent node
Now the first selected href
property of link4.html
the a
node, then the parent node acquired, and then get its class
properties related code is as follows:
html = etree.parse('./test.html',etree.HTMLParser()) result = html.xpath('//a[@href="link4.html"]/../@class') print(result)
operation result:
['item-1']
At the same time, we can also parent::
acquire the parent node, the code is as follows:
html = etree.parse('./test.html',etree.HTMLParser()) result = html.xpath('//a[@href="link4.html"]/parent::*/@class') print(result)
8, matching properties
When selected, we can also use @
attribute filter symbol. For example, if you want to select this class
is item-1
the li
node, can be achieved:
html = etree.parse('./test.html',etree.HTMLParser()) result = html.xpath('//li[@class="item-0"]') print(result)
Here we are joined by [@class="item-0"]
limiting the node class
property item-0
, and qualified HTML text li
node has two, so the results should be returned to the two matching elements.
operation result:
[<Element li at 0x1b67f5fd3c8>, <Element li at 0x1b67f5fd408>]
9, text acquisition
We use the XPath text()
acquisition method in a text node, then try to obtain the foregoing li
text node, the relevant code is as follows:
html = etree.parse('./test.html',etree.HTMLParser()) result = html.xpath('//li[@class="item-0"]/text()') print(result)
The strange thing is, we did not get into any text, only to get a line break, which is why? Since the XPath text()
front /
, and where /
the meaning of the selected direct child node, it is clear that li
direct child nodes are a
nodes, the text is a
the result of the internal node, so the match to herein is corrected li
in the node newline , because the automatic correction of li
the tail node label for the trip.
So, if you want to get li
the text inside the node, there are two ways, one is to select a
a node and then get the text, and the other is to use //
. Next, we look at the difference between the two.
html = etree.parse('./test.html',etree.HTMLParser()) result = html.xpath('//li[@class="item-0"]/a/text()') print(result)
operation result:
['first item', 'fifth item']
Look at it another way (i.e. using //
) the result of selection, as follows:
html = etree.parse('./test.html',etree.HTMLParser()) result = html.xpath('//li[@class="item-0"]//text()') print(result)
operation result:
['first item', 'fifth item', '\r\n ']
Select Here are all descendants of text node, wherein the first two is the li
child node of a
the text in the node, the other one is the last li
text in the node, i.e., line breaks.
So, if it is to get all of the text inside the descendants of nodes, you can directly //
add text()
the way, so you can ensure to get the most comprehensive text messages, but may be mixed with some line breaks and other special characters. If you want to get all the text in a certain descendant node, you can select a specific node descendants, and then call the text()
method to get its inner text, so you can ensure that the results obtained are neat.
10, property acquisition
We know that with text()
can get inside a text node, the node attribute that how to get it? But it is still a @
sign on it. For example, we all want to get li
all the nodes under the a
node of href
property, the code is as follows:
html = etree.parse('./test.html',etree.HTMLParser()) result = html.xpath('//li/a/@href') print(result)
operation result:
['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
11, a multivalued attribute matches
Sometimes, a property of certain nodes may have multiple values, such as:
from lxml import etree text = ''' <li class="li li-first"><a href="link.html">first item</a></li> ''' html = etree.HTML(text) result = html.xpath('//li[@class="li"]/a/text()') print(result)
Here the HTML text li
node class
attribute has two values li
and li-first
, at this time if you want to match with the attribute acquired before, can not be matched, then the results are as follows:
[]
Then you need to use the contains()
function, the code can be rewritten as follows:
from lxml import etree text = ''' <li class="li li-first"><a href="link.html">first item</a></li> ''' html = etree.HTML(text) result = html.xpath('//li[contains(@class,"li")]/a/text()') print(result)
By this contains()
method, the first argument attribute name, attribute value of the second parameter passed, as long as the value of this attribute contains attribute passed, the matching can be done.
In this case results are as follows:
['first item']
12, multi-attribute matching
In addition, we may also encounter a situation that is determined based on a multiple node attributes, then you need to match multiple properties. At this time, the operator can use and
to connect, for example:
from lxml import etree text = ''' <li class="li li-first" name="item"><a href="link.html">first item</a></li> ''' html = etree.HTML(text) result = html.xpath('//li[contains(@class,"li") and @name="item"]/a/text()') print(result)
Here's li
node adds another attribute name
. To determine this node, according to need to simultaneously class
and name
selected properties, with the proviso that a class
property which contains the li
string, with the proviso that the other name
property is item
a string, to satisfy both needs at the same time, needed and
after the operator is connected, is connected is placed in brackets conditional screening.
operation result:
['first item']
13, sequentially selected
Sometimes, when we select certain attributes may match multiple nodes simultaneously, but only want a node which, as in the second node or the last node, then how to do it?
In this case the brackets can be passed by the method of obtaining the index of the node-specific order, for example:
from lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> ''' html = etree.HTML(text) result = html.xpath('//li[1]/a/text()') print(result) result = html.xpath('//li[last()]/a/text()') print(result) result = html.xpath('//li[position()<3]/a/text()') print(result) result = html.xpath('//li[last()-2]/a/text()') print(result)
The first time you choose, we choose the first li
node, the incoming number 1 in brackets can be. Note that here and in different codes, serial number begins with 1, not 0 at the beginning.
When the second choice, we choose the last li
node in parentheses passed last()
to return is the last li
node.
Selecting the third time, we select the 3 position is less than the li
node, i.e. node position number 1 and 2, the result is obtained in the first two li
nodes.
When the fourth selection, we selected last third li
node, the incoming brackets last()-2
can be. Because last()
the last one, so last()-2
is the third to last.
operation result:
['first item'] ['fifth item'] ['first item', 'second item'] ['third item']
14, the axis selection node
XPath provides many nodes axis selection method comprising obtaining sub-elements, sibling, parent elements, ancestor elements, etc., for example:
from lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html"><span>first item</span></a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> ''' html = etree.HTML(text) result = html.xpath('//li[1]/ancestor::*') print(result) result = html.xpath('//li[1]/ancestor::div') print(result) result = html.xpath('//li[1]/attribute::*') print(result) result = html.xpath('//li[1]/child::a[@href="link1.html"]') print(result) result = html.xpath('//li[1]/descendant::span') print(result) result = html.xpath('//li[1]/following::*[2]') print(result) result = html.xpath('//li[1]/following-sibling::*') print(result)
operation result:
[<Element html at 0x1aabef570c8>, <Element body at 0x1aabf20b588>, <Element div at 0x1aabf20b5c8>, <Element ul at 0x1aabf20b608>] [<Element div at 0x1aabf20b5c8>] ['item-0'] [<Element a at 0x1aabf20b588>] [<Element span at 0x1aabf20b5c8>] [<Element a at 0x1aabf20b608>] [<Element li at 0x1aabf20b948>, <Element li at 0x1aabf20b848>, <Element li at 0x1aabf20b548>, <Element li at 0x1aabf20b648>]
The first time you select, we call the ancestor
axis, you can get all the ancestor nodes. Followed by two colons need to talk, and then select the node, where we directly use *, which matches all nodes, so the result is the first li
ancestor of all nodes, including html
, body
, div
and ul
.
When the second choice, we added a qualification, this time behind the colon and the div
results obtained in this way only div
the ancestors of the node.
When the third selection, we call the attribute
axis, you can get all the property values, followed by the selectors or *, which represents obtain all attribute nodes, the return value is li
all the property values of nodes.
When the fourth selection, we call the child
axis, you can get all the direct child nodes. Here we added a qualification, select the href
properties for link1.html
the a
node.
When select the fifth time, we call the descendant
axis, you can get all descendant nodes. Here we added a qualification acquired span
node, so the returned results contain only span
node does not contain a
nodes.
When the sixth selection, we call the following
axis, you can get all the nodes after the current node. Although we use here it is * match, but added a selection index, so only get a second subsequent node.
When the seventh choice, we call the following-sibling
axis, you can get all the siblings after the current node. Here we use * to match, so get all of the subsequent sibling nodes.
15 Conclusion
If you want to find out more XPath usage, you can view: http://www.w3school.com.cn/xpath/index.asp .
If you want to find out more usage Python lxml library, you can view http://lxml.de/ .