"Python3 web crawler developed real" study notes 3 (Chapter 4: Using Xpath parsing library)

Using XPath 4.1

XPath, full name of the XML Path Language, namely XML Path Language, it is a finding information in an XML document language. It was originally used to search XML documents, but it also applies to search HTML documents.

1, XPath Overview

The official document: https://www.w3.org/TR/xpath/ .

2, XPath common rules

Table 4-1 XPath of several commonly used rule

expression	description
`nodename`	Select all the child nodes of this node
`/`	Selected direct child node from the current node
`//`	Descendants of the current node from the selected node
`.`	Select the current node
`..`	Select the parent of the current node
`@`	Select Properties

Here a list of commonly XPath of matching rules, for example:

// title [@ lang = ' what ' ]

This is an XPath rule, it represents all the selected name title, and attribute langvalues of engnodes.

Python will pass behind the lxml library, use XPath to parse the HTML.

3, ready to work

Prior to use, we must first ensure that the installed lxml library.

4, examples of the introduction of

from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))

Here first import lxml library etree module, and then declare a piece of HTML text, HTML call the class is initialized, thus successfully construct an XPath parsing object. It should be noted that, HTML text in the last linode is not closed, but the module may automatically correct etree HTML text.

Here we call the tostring()method to output HTML code revised, but the result is bytesthe type. Here the use of decode()a method which is converted to strtype the following results:

<html><body><div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </li></ul>
 </div>
</body></html>

It can be seen that, after treatment, lithe node label is complement, and also automatically add body, htmlnode.

Further, the text may be read directly file is parsed, for example:

html = etree.parse('./test.html',etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))

Wherein test.html content is the HTML code in the above example.

The output result is slightly different, more of a DOCTYPEstatement, but no impact on the analytical results are as follows:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>&#13;
    <ul>&#13;
         <li class="item-0"><a href="link1.html">first item</a></li>&#13;
         <li class="item-1"><a href="link2.html">second item</a></li>&#13;
         <li class="item-inactive"><a href="link3.html">third item</a></li>&#13;
         <li class="item-1"><a href="link4.html">fourth item</a></li>&#13;
         <li class="item-0"><a href="link5.html">fifth item</a>&#13;
     </li></ul>&#13;
 </div></body></html>

I'm not sure why more & # 13; & # 13 is the ascii code is 13 characters, is a carriage return.

5, all nodes

We usually use XPath // beginning of the rule to select all the nodes meet the requirements. Here at the front of the HTML text, for example, if you want to select all nodes can be achieved:

html = etree.parse('./test.html',etree.HTMLParser())
result = html.xpath('//*')
print(result)

operation result:

[<Element html at 0x29c691f8388>, <Element body at 0x29c6949d2c8>, <Element div at 0x29c6949d308>, <Element ul at 0x29c6949d348>, <Element li at 0x29c6949d648>, <Element a at 0x29c6949d688>, <Element li at 0x29c6949d708>, <Element a at 0x29c6949d748>, <Element li at 0x29c6949d788>, <Element a at 0x29c6949d548>, <Element li at 0x29c6949d7c8>, <Element a at 0x29c6949d808>, <Element li at 0x29c6949d848>, <Element a at 0x29c6949d888>]

Here * for all matching nodes, that is, all nodes will be an entire HTML text is acquired. You can see, is returned in the form of a list, each element Elementtype, followed by the name of the node, such as html, body, div, ul, li, aand so on, all nodes are included in the list.

Of course, you can also specify here match the node name. If you want to get all the linodes, for example:

html = etree.parse('./test.html',etree.HTMLParser())
result = html.xpath('//li')
print(result)
print(result[0])

Here to select all linodes can be used //, and can be directly coupled with the node name, the direct use when calling xpath()method can be.

operation result:

[<Element li at 0x1233c89d308>, <Element li at 0x1233c89d348>, <Element li at 0x1233c89d388>, <Element li at 0x1233c89d688>, <Element li at 0x1233c89d588>]
<Element li at 0x1233c89d308>

6, the child node

We /or //you can find the child nodes or descendant node elements. If you now want to select liall the nodes of direct achild nodes, can be achieved:

html = etree.parse('./test.html',etree.HTMLParser())
result = html.xpath('//li/a')
print(result)

operation result:

[<Element a at 0x1e1d79db408>, <Element a at 0x1e1d79db448>, <Element a at 0x1e1d79db488>, <Element a at 0x1e1d79db788>, <Element a at 0x1e1d79db688>]

Here is /used to select the direct child, if you want to get all descendant nodes, can be used //. For example, to get ulall descendant nodes under the anode, can be achieved:

html = etree.parse('./test.html',etree.HTMLParser())
result = html.xpath('//ul//a')
print(result)

Operating results are the same.

But if you use this //ul/a, you can not get any result. Because /for obtaining direct child node, while ulnot directly under the node's achild nodes, only linodes, so it can not get any matches.

7, the parent node

Now the first selected hrefproperty of link4.htmlthe anode, then the parent node acquired, and then get its classproperties related code is as follows:

html = etree.parse('./test.html',etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/../@class')
print(result)

operation result:

['item-1']

At the same time, we can also parent::acquire the parent node, the code is as follows:

html = etree.parse('./test.html',etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/parent::*/@class')
print(result)

8, matching properties

When selected, we can also use @attribute filter symbol. For example, if you want to select this classis item-1the linode, can be achieved:

html = etree.parse('./test.html',etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]')
print(result)

Here we are joined by [@class="item-0"]limiting the node classproperty item-0, and qualified HTML text linode has two, so the results should be returned to the two matching elements.

operation result:

[<Element li at 0x1b67f5fd3c8>, <Element li at 0x1b67f5fd408>]

9, text acquisition

We use the XPath text()acquisition method in a text node, then try to obtain the foregoing litext node, the relevant code is as follows:

html = etree.parse('./test.html',etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/text()')
print(result)

The strange thing is, we did not get into any text, only to get a line break, which is why? Since the XPath text()front /, and where /the meaning of the selected direct child node, it is clear that lidirect child nodes are anodes, the text is athe result of the internal node, so the match to herein is corrected liin the node newline , because the automatic correction of lithe tail node label for the trip.

So, if you want to get lithe text inside the node, there are two ways, one is to select aa node and then get the text, and the other is to use //. Next, we look at the difference between the two.

html = etree.parse('./test.html',etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/a/text()')
print(result)

operation result:

['first item', 'fifth item']

Look at it another way (i.e. using //) the result of selection, as follows:

html = etree.parse('./test.html',etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]//text()')
print(result)

operation result:

['first item', 'fifth item', '\r\n     ']

Select Here are all descendants of text node, wherein the first two is the lichild node of athe text in the node, the other one is the last litext in the node, i.e., line breaks.

So, if it is to get all of the text inside the descendants of nodes, you can directly //add text()the way, so you can ensure to get the most comprehensive text messages, but may be mixed with some line breaks and other special characters. If you want to get all the text in a certain descendant node, you can select a specific node descendants, and then call the text()method to get its inner text, so you can ensure that the results obtained are neat.

10, property acquisition

We know that with text()can get inside a text node, the node attribute that how to get it? But it is still a @sign on it. For example, we all want to get liall the nodes under the anode of hrefproperty, the code is as follows:

html = etree.parse('./test.html',etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)

operation result:

['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

11, a multivalued attribute matches

Sometimes, a property of certain nodes may have multiple values, such as:

from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[@class="li"]/a/text()')
print(result)

Here the HTML text linode classattribute has two values liand li-first, at this time if you want to match with the attribute acquired before, can not be matched, then the results are as follows:

[]

Then you need to use the contains()function, the code can be rewritten as follows:

from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class,"li")]/a/text()')
print(result)

By this contains()method, the first argument attribute name, attribute value of the second parameter passed, as long as the value of this attribute contains attribute passed, the matching can be done.

In this case results are as follows:

['first item']

12, multi-attribute matching

In addition, we may also encounter a situation that is determined based on a multiple node attributes, then you need to match multiple properties. At this time, the operator can use andto connect, for example:

from lxml import etree

text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class,"li") and @name="item"]/a/text()')
print(result)

Here's linode adds another attribute name. To determine this node, according to need to simultaneously classand nameselected properties, with the proviso that a classproperty which contains the listring, with the proviso that the other nameproperty is itema string, to satisfy both needs at the same time, needed andafter the operator is connected, is connected is placed in brackets conditional screening.

operation result:

['first item']

13, sequentially selected

Sometimes, when we select certain attributes may match multiple nodes simultaneously, but only want a node which, as in the second node or the last node, then how to do it?

In this case the brackets can be passed by the method of obtaining the index of the node-specific order, for example:

from lxml import etree
text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/a/text()')
print(result)
result = html.xpath('//li[last()]/a/text()')
print(result)
result = html.xpath('//li[position()<3]/a/text()')
print(result)
result = html.xpath('//li[last()-2]/a/text()')
print(result)

The first time you choose, we choose the first linode, the incoming number 1 in brackets can be. Note that here and in different codes, serial number begins with 1, not 0 at the beginning.

When the second choice, we choose the last linode in parentheses passed last()to return is the last linode.

Selecting the third time, we select the 3 position is less than the linode, i.e. node position number 1 and 2, the result is obtained in the first two linodes.

When the fourth selection, we selected last third linode, the incoming brackets last()-2can be. Because last()the last one, so last()-2is the third to last.

operation result:

['first item']
['fifth item']
['first item', 'second item']
['third item']

14, the axis selection node

XPath provides many nodes axis selection method comprising obtaining sub-elements, sibling, parent elements, ancestor elements, etc., for example:

from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html"><span>first item</span></a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/ancestor::*')
print(result)
result = html.xpath('//li[1]/ancestor::div')
print(result)
result = html.xpath('//li[1]/attribute::*')
print(result)
result = html.xpath('//li[1]/child::a[@href="link1.html"]')
print(result)
result = html.xpath('//li[1]/descendant::span')
print(result)
result = html.xpath('//li[1]/following::*[2]')
print(result)
result = html.xpath('//li[1]/following-sibling::*')
print(result)

operation result:

[<Element html at 0x1aabef570c8>, <Element body at 0x1aabf20b588>, <Element div at 0x1aabf20b5c8>, <Element ul at 0x1aabf20b608>]
[<Element div at 0x1aabf20b5c8>]
['item-0']
[<Element a at 0x1aabf20b588>]
[<Element span at 0x1aabf20b5c8>]
[<Element a at 0x1aabf20b608>]
[<Element li at 0x1aabf20b948>, <Element li at 0x1aabf20b848>, <Element li at 0x1aabf20b548>, <Element li at 0x1aabf20b648>]

The first time you select, we call the ancestoraxis, you can get all the ancestor nodes. Followed by two colons need to talk, and then select the node, where we directly use *, which matches all nodes, so the result is the first liancestor of all nodes, including html, body, divand ul.

When the second choice, we added a qualification, this time behind the colon and the divresults obtained in this way only divthe ancestors of the node.

When the third selection, we call the attributeaxis, you can get all the property values, followed by the selectors or *, which represents obtain all attribute nodes, the return value is liall the property values of nodes.

When the fourth selection, we call the childaxis, you can get all the direct child nodes. Here we added a qualification, select the hrefproperties for link1.htmlthe anode.

When select the fifth time, we call the descendantaxis, you can get all descendant nodes. Here we added a qualification acquired spannode, so the returned results contain only spannode does not contain anodes.

When the sixth selection, we call the followingaxis, you can get all the nodes after the current node. Although we use here it is * match, but added a selection index, so only get a second subsequent node.

When the seventh choice, we call the following-siblingaxis, you can get all the siblings after the current node. Here we use * to match, so get all of the subsequent sibling nodes.

15 Conclusion

If you want to find out more XPath usage, you can view: http://www.w3school.com.cn/xpath/index.asp .

If you want to find out more usage Python lxml library, you can view http://lxml.de/ .

"Python3 web crawler developed real" study notes 3 (Chapter 4: Using Xpath parsing library)

Guess you like