Use Python3 Web crawler combat -28, parsing library: XPath

On the one we implemented a basic reptile, but the information we use to extract pages of a regular expression, we find that after using a regular expression is quite a complicated structure, and if there is little wrong place it may cause the match to fail, so use a regular page to extract information or some more or less inconvenient.

For the node page, it can be defined id, class, or other attributes, but also a hierarchical relationship between the nodes, the page can be located through one or more nodes XPath or CSS selectors. So when the page parsing, we use XPath or CSS selectors to extract to a node, and then call the appropriate method to acquire its text content or attribute can not extract any information we want it?

In Python, how do we achieve this operation it? Do not worry, this has been very much parsing library, one of the more powerful library has LXML, BeautifulSoup, PyQuery etc. In this chapter we will introduce the use of these three parsing library, with them, we no longer need to worry about the regular and analytical efficiency will be greatly improved, in fact, an essential tool for reptiles.

Use of XPath

XPath, full name of the XML Path Language, namely XML Path Language, it is a finding information in an XML document language. XPath was originally designed to search XML documents, but it also applies to search HTML documents.

In doing so the reptiles, we can use XPath to do the appropriate information extraction, this section we introduce the basic usage of XPath.

1. XPath Overview

XPath selection is so powerful, it provides a very concise expression path selection, it also provides more than 100 built-in functions for string, numeric, and time matching node processing sequence like, almost we want to target all nodes can use XPath to select.

XPath on November 16, 1999 became a W3C standard, it is designed for use by XSLT, XPointer and other XML parsing software, more documentation can visit their official website:https://www.w3.org/TR/xpath/

2. XPath common rules

We now look at a table lists a few common rules:

expression description
nodename Select all the child nodes of this node
/ Selected direct child node from the current node
// Descendants of the current node from the selected node
. Select the current node
.. Select the parent of the current node
@ Select Properties

Here a list of commonly XPath matching rules, e.g. / representatives selected direct child node representing the selected // all descendant nodes representative of selecting the current node, the current node .. Representative selected parent node @ attribute is added defining, selecting a specific node matching attributes.

E.g:

//title[@lang=’eng’]
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

This is an XPath rule, it represents all the names selected for the title, while the value eng lang attribute node.

Later we will explain in detail the use of XPath, HTML parsing using XPath by LXML library Python.

3. Preparations

Before our first ensure that installed the LXML libraries, such as no installation The installation procedure is the first chapter.

4. Examples of the introduction

We now use an example to feel the use XPath to parse the pages of the process, as follows:

from lxml import etree
text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))

Here we first introduced LXML library etree module, and then declare a piece of HTML text, HTML call the class is initialized, so that we successfully construct an XPath parsing object here noted that the last node li HTML text is not closed, but it can be automatically corrected etree module of HTML text.

Here we call toString () after the output correction method may be the HTML code, but the result is the type of bytes, here we use the decode () method str converted to type the following results:

<html><body><div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </li></ul>
 </div>
</body></html>

We can see that after treatment li node label is completion, and also automatically add the body, html node.

In addition, we can also directly read the text file is parsed, for example:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))

Wherein test.html content is the HTML code in the above example, as follows:

<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

This is slightly different output, more than a DOCTYPE declaration, but has no effect on the analytical results are as follows:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </li></ul>
 </div></body></html>

5. All nodes

We usually use XPath // at the beginning of the rules to select all the nodes to meet the requirements of the above text HTML text, for example, if we want to select all nodes can be achieved:

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//*')
print(result)

operation result:

[<Element html at 0x10510d9c8>, <Element body at 0x10510da08>, <Element div at 0x10510da48>, <Element ul at 0x10510da88>, <Element li at 0x10510dac8>, <Element a at 0x10510db48>, <Element li at 0x10510db88>, <Element a at 0x10510dbc8>, <Element li at 0x10510dc08>, <Element a at 0x10510db08>, <Element li at 0x10510dc48>, <Element a at 0x10510dc88>, <Element li at 0x10510dcc8>, <Element a at 0x10510dd08>]

We are here * for all matching nodes, that is, all nodes will be an entire HTML text is captured, you can see returns in the form of a list, each element is an Element type, followed by the name of the node, such as html, body , div, ul, li, a, etc., all nodes are included in the list.

Of course, you can also specify here match the node name, if we want to get all li nodes, for example:

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li')
print(result)
print(result[0])

Here we want to select all li // node can use, and then directly add the name of the node can be called directly when calling xpath () method can be extracted.

operation result:

[<Element li at 0x105849208>, <Element li at 0x105849248>, <Element li at 0x105849288>, <Element li at 0x1058492c8>, <Element li at 0x105849308>]
<Element li at 0x105849208>

Here we can see that the extraction result is a list form, each of which is an element of Element object, if one of the objects to be taken out can be taken directly indexed with brackets, such as [0].

6. child node

We now we want to select all the direct child node li a node, you can be implemented by / or // to find child nodes or descendant node elements, adding:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)

We here by appending a / a i.e. a selected direct child node to all nodes of all li, li is checked as // li all nodes, / a is selected li all direct child nodes of a node, i.e. the two combined Gets all the child nodes of all direct a li nodes.

operation result:

[<Element a at 0x106ee8688>, <Element a at 0x106ee86c8>, <Element a at 0x106ee8708>, <Element a at 0x106ee8748>, <Element a at 0x106ee8788>]

But here is / are selected direct child node, if we want to get all descendants of nodes in relation to the use of //, for example, we want to get all descendants of a node in the node ul, this can be achieved:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//ul//a')
print(result)

Operating results are the same.

But if we can not get here by // ul / a any results, because / is direct child nodes, while not directly a child node in the node ul, li node only, so I can not get any matches, the code is as follows :

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//ul/a')
print(result)

operation result:

[]

So here we must note / and // difference, / is direct child node, node // is to get children and grandchildren.

7. parent

We know that you can find the child node by node or descendants of continuous / or //, that if we know how to find the child nodes parent do? Here we can use .. to get the parent node.

For example, we are now the first selected href link4.html of a node, then the acquisition of its parent node, and then get their class property, the code is as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/../@class')
print(result)

operation result:

['item-1']

Check the results, it is li class target node we get, to get the parent node successfully.

We can also get the parent node parent ::, code is as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/parent::*/@class')
print(result)

8. attribute matching

We can also be filtered using the @ properties when selected, for example, if we want to choose where li is the node class of item-1, can be achieved:

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]')
print(result)
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Here we by adding [@ class = "item-0"] limits the class attribute node is item-0, and qualified node has two li HTML text, so the results should be returned to the return of two matched Generally, the results are as follows:

[<Element li at 0x10a399288>, <Element li at 0x10a3992c8>]

Visible Results matching exactly two, as is not that correct two, we verify later.

9. The text acquisition

We () method gets a text node with the XPath in the text, we are going to try to get it li node text above, the code is as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/text()')
print(result)

Results are as follows:

['\n ']

It is strange that we did not get into any text, but only to get a line break, which is why? Since the XPath text () is a front /, and this / select the meaning of direct child nodes, but here it is clear that direct child node is a node li, text is in a internal node, so the match to here the result is that the internal node li newline corrected, because the end tag of li node automatic correction for the trip.

That check is that these two nodes:

<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</li>

Wherein a node automatically correct because, when the tail node label li added for the trip, the extraction line breaks between the end result is a unique label li obtained text node label and a tail node.

So, if we want to get the text inside the li node there are two ways, one is to select a node and then get the text, and the other is to use // What is the difference between the two we look at Yes.

First we select the text to a node reacquisition, code is as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/a/text()')
print(result)

operation result:

['first item', 'fifth item']

Here you can see the return value is two, the content is text li attribute node item-0, which also confirms the above results we attribute matching is correct.

Here we are chosen layer by layer, first select a li nodes, and use / to select its direct child node a, and then select the text, and the result is exactly in line with the two results we expected.

Let us look at another way results // selected, as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]//text()')
print(result)

operation result:

['first item', 'fifth item', '\n ']

As expected, the result is three here, can be imagined where the text is to select all descendant nodes, wherein the first two sub-li is a text node in the node, the other one is a li last text in the node, That line breaks.

So, if we want to get all the text inside the descendants of nodes, you can use direct access // add text () way, so you can ensure to get the most comprehensive text messages, but may be mixed with some line breaks and other special characters. If we want to get all the text in a certain descendant node, you can select a specific node descendants, then call text () method to get its inner text, so you can ensure that the results obtained are neat.

10. Property Gets

We know with text () can get inside a text node, the node attribute that how to get it? In fact, you can still use the @ symbol, for example, we want to get the href attribute of all nodes in a li all nodes, as follows:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)

Here we can obtain different node through @href href attribute, and attribute noted here matching method, the matching attribute is added brackets defined attribute names and values ​​of an attribute, such as the [@ href = "link1.html"] while @href here refers to the acquisition of a property node, both need to be distinguished.

operation result:

['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

You can see that we successfully acquired a href attribute nodes in all nodes li, returns a list.

11. multivalued attribute matches

Sometimes some of the nodes of an attribute may have multiple values, such as the following example:

from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[@class="li"]/a/text()')
print(result)

Here li class attribute in HTML text node has two values ​​li and li-first, but before this time if we want to get matched with the property may not match the code the results:

[]

Then if there are a plurality of attribute values ​​you need to use contains () function, the code can be rewritten as follows:

from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)

Thus We contains () method, the first argument attribute name, attribute value of the second parameter passed, so long as this attribute contains attribute values ​​passed to complete matches.

operation result:

['first item']

Such a selection method is often used when an attribute of a node has multiple values, such as class attributes of a node usually more.

12. Multi-attribute matching

In addition, we may also encounter a situation, we may need to determine a plurality of nodes according to the attributes, it would need to match a plurality of attributes can then be used herein operator and connected, for example:

from lxml import etree
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)

Here li HTML text node adds an attribute name, this time we need to be selected according to the class name and properties, and operators can be connected to two conditions, both conditions are surrounded by brackets, results are as follows:

['first item']

Here and in fact in XPath operators, and there are many operators, such as or, mod, etc., here summarized as follows:

Operators description Examples return value
or or price=9.80 or price=9.70 If the price is 9.80, it returns true. If the price is 9.50, false is returned.
and versus price>9.00 and price<9.90 If the price is 9.80, it returns true. If the price is 8.50, false is returned.
mod Calculated remainder of the division 5 mod 2 1
\ Calculating two sets nodes //book //cd Returns the node set has all the elements of the book and cd
+ addition 6 + 4 10
- Subtraction 6 - 4 2
* multiplication 6 * 4 24
div division 8 div 4 2
= equal price=9.80 If the price is 9.80, it returns true. If the price is 9.90, false is returned.
!= not equal to price!=9.80 If the price is 9.90, it returns true. If the price is 9.80, false is returned.
< Less than price<9.80 If the price is 9.00, it returns true. If the price is 9.90, false is returned.
<= less than or equal to price<=9.80 If the price is 9.00, it returns true. If the price is 9.90, false is returned.
> more than the price>9.80 If the price is 9.90, it returns true. If the price is 9.80, false is returned.
>= greater than or equal to price>=9.80 If the price is 9.90, it returns true. If the price is 9.70, false is returned.

This table reference sources:http://www.w3school.com.cn/xp...。

13. sequentially selected

Sometimes when we select certain attributes may match multiple nodes simultaneously, but we only want one of these nodes, such as the second node, or the last node, then how to do it?

In this case the brackets can be passed by the method of obtaining the index of the node-specific order, for example:

from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/a/text()')
print(result)
result = html.xpath('//li[last()]/a/text()')
print(result)
result = html.xpath('//li[position()<3]/a/text()')
print(result)
result = html.xpath('//li[last()-2]/a/text()')
print(result)

The first choice that we choose the first li node, incoming numbers in parentheses 1 can be, and pay attention to where the code is different, No. 1 is the beginning, not the beginning of zero.

Second choice we choose the last li node, in parentheses passed last () to return to the node is the last li.

We selected third node li selected position is less than 3, that is, the result of position number 1 and the node 2, is obtained before the two nodes li.

The fourth choice we selected third last li nodes in parentheses passed last () - 2 can be, because the last () is the last one, the last () - 2 is the last third.

Results are as follows:

['first item']
['fifth item']
['first item', 'second item']
['third item']

Here we use the last (), position () function and the like, XPath 100 are provided a plurality of functions, including access processing function, numeric, string, logical nodes, and a sequence specific reference may effect all functions:http://www.w3school.com.cn/xp...。

13. The node selection shaft

XPath provides a number of node axis selection method, English is called XPath Axes, including access to sub-elements, siblings, parent elements, ancestor elements, etc., used in certain circumstances it may be convenient to complete the selection of nodes, we use an example to feel a bit:

from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html"><span>first item</span></a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/ancestor::*')
print(result)
result = html.xpath('//li[1]/ancestor::div')
print(result)
result = html.xpath('//li[1]/attribute::*')
print(result)
result = html.xpath('//li[1]/child::a[@href="link1.html"]')
print(result)
result = html.xpath('//li[1]/descendant::span')
print(result)
result = html.xpath('//li[1]/following::*[2]')
print(result)
result = html.xpath('//li[1]/following-sibling::*')
print(result)

operation result:

[<Element html at 0x107941808>, <Element body at 0x1079418c8>, <Element div at 0x107941908>, <Element ul at 0x107941948>]
[<Element div at 0x107941908>]
['item-0']
[<Element a at 0x1079418c8>]
[<Element span at 0x107941948>]
[<Element a at 0x1079418c8>]
[<Element li at 0x107941948>, <Element li at 0x107941988>, <Element li at 0x1079419c8>, <Element li at 0x107941a08>]
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

The first choice we call the ancestor axis, you can get all the ancestor nodes, followed by two colons need to talk, and then select the node, where we directly use *, which matches all nodes, so the result is the first all ancestor nodes li nodes, including html, body, div, ul.

Second choice we have added a qualification, this time behind the colon and the div, the results obtained in this way only the ancestor node of the div.

Select the third time we call the attribute axis, you can get all the property values, followed by the selectors or *, which represents obtain all attribute nodes, the return value is the value of all properties li nodes.

We call the selected fourth child axis, have access to all direct child nodes, where we added a qualification selected href attribute of a node link1.html.

Select the fifth time we call the descendant axis, you can get all descendant nodes, where we added a qualification to obtain span node, so the return is not only contains a span node node.

The sixth choice we call the following axes, you can get all the nodes after the current node, here is the match * Although we use, but added a selection index, so only get a second subsequent node.

Seventh we choose to call the following-sibling axis, you can get all the siblings after the current node, we use here is * match, so get all of the subsequent sibling nodes.

These are the simple use of XPath axes

14. Conclusion

Until now we have basically the possible use of XPath selectors introduction is over, XPath very powerful, built-in functions very much, after skillfully use can greatly enhance the extraction efficiency of the HTML information.

Guess you like

Origin blog.51cto.com/14445003/2426467