Python web crawler study notes (eight): the use of XPath

Use XPath

XPath, the full name of XML Path Language, is the XML path language, which is a language for finding information in XML documents.

1. XPath common rules

Insert picture description here

The common matching rules of XPath are listed here, examples are as follows:

//title[@lang='eng']

This is an XPath rule, which means to select all nodes whose name is title and whose attribute lang is eng.

Later, we will use XPath to parse HTML through Python's lxml library.

Let's look at an example:

from lxml import etree
text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))
<html><body><div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </li></ul>
 </div>
</body></html>

Here first import the etree module of the lxml library, and then declare a piece of HTML text, call the HTML class to initialize, so that an XPath parsing object is successfully constructed. The last li node in the HTML text is not closed, but the etree module can automatically correct the HTML text.

Here we call the tostring() method to output the revised HTML code, but the result is of type bytes. Here we use the decode() method to convert it to str type, and the results are as follows:

<html><body><div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </li></ul>
 </div>
</body></html>

As you can see, after processing, the li node label is completed, and the body and html nodes are automatically added.

In addition, you can also directly read the text file for analysis, an example is as follows:

from lxml import etree
 
html = etree.parse('test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>&#13;
    <ul>&#13;
         <li class="item-0"><a href="link1.html">first item</a></li>&#13;
         <li class="item-1"><a href="link2.html">second item</a></li>&#13;
         <li class="item-inactive"><a href="link3.html">third item</a></li>&#13;
         <li class="item-1"><a href="link4.html">fourth item</a></li>&#13;
         <li class="item-0"><a href="link5.html">fifth item</a>&#13;
     </li></ul>&#13;
 </div></body></html>

The output result is slightly different this time, with an additional DOCTYPE declaration, but it has no effect on the analysis

2. All nodes

We generally use the XPath rule beginning with // to select all nodes that meet the requirements. Take the previous HTML text as an example. If you want to select all nodes, you can do this:

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//*')
print(result)

[<Element html at 0x1f6c0205648>, <Element body at 0x1f6c0019bc8>, <Element div at 0x1f6c0019c88>, <Element ul at 0x1f6c0205348>, <Element li at 0x1f6c0205688>, <Element a at 0x1f6c0205708>, <Element li at 0x1f6c0205748>, <Element a at 0x1f6c0205788>, <Element li at 0x1f6c02057c8>, <Element a at 0x1f6c02056c8>, <Element li at 0x1f6c0205808>, <Element a at 0x1f6c0205848>, <Element li at 0x1f6c0205888>, <Element a at 0x1f6c02058c8>]

Use * here to match all nodes, that is, all nodes in the entire HTML text will be retrieved. As you can see, the return form is a list, and each element is of the Element type, followed by the name of the node, such as html, body, div, ul, li, a, etc. All nodes are included in the list.

Of course, the node name can also be specified for matching here. If you want to get all linodes, the example is as follows:

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li')
print(result)
print(result[0])
[<Element li at 0x1f6c01e6f88>, <Element li at 0x1f6c0205cc8>, <Element li at 0x1f6c0205d08>, <Element li at 0x1f6c0205d48>, <Element li at 0x1f6c0205d88>]
<Element li at 0x1f6c01e6f88>

Here you can see that the extraction result is in the form of a list, where each element is an Element object. If you want to take out one of the objects, you can directly use square brackets to add an index, such as [0].

3. Child node

We can find the child nodes or descendant nodes of the element through / or / /. If you now want to select all the direct a child nodes of the li node, you can do this:

from lxml import etree
 
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)
[<Element a at 0x1f6c01e6fc8>, <Element a at 0x1f6c0205588>, <Element a at 0x1f6c0205e08>, <Element a at 0x1f6c0205e48>, <Element a at 0x1f6c0205e88>]

/Used to select direct child nodes, if you want to get all descendant nodes, you can use it //. For example, to get ulall descendants of a anode, you can do this:

from lxml import etree
 
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//ul//a')
print(result)
[<Element a at 0x1f6c02058c8>, <Element a at 0x1f6c0205908>, <Element a at 0x1f6c02070c8>, <Element a at 0x1f6c0207108>, <Element a at 0x1f6c0207148>]

But if it is used here //ul/a, no results can be obtained. Because it is /used to obtain direct child nodes, and ulthere are no direct achild nodes under the node, only linodes, so no matching results can be obtained. The code is as follows:

from lxml import etree
 
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//ul/a')
print(result)
[]

4. Parent node

Used ..to achieve

from lxml import etree
html = etree.parse('test.html', etree.HTMLParser())
result = html.xpath('//a[@href = "link4.html"]/../@class')
print(result)
['item-1']

We can also get the parent node through parent:::

from lxml import etree
html = etree.parse('test.html', etree.HTMLParser())
result = html.xpath('//a[@href = "link4.html"]/parent::*/@class')
print(result)
['item-1']

5. Attribute matching

@Symbols are used for attribute filtering, such as selecting the li node whose class is item-1:

from lxml import etree

html = etree.parse('test.html', etree.HTMLParser())

result = html.xpath('//li[@class = "item-0"]')
print(result)
[<Element li at 0x1f6c038cd08>, <Element li at 0x1f6c038c788>]

6. Text acquisition

We use the text() method in XPath to get the text in the li node:

from lxml import etree

html = etree.parse('test.html', etree.HTMLParser())

result = html.xpath('//li[@class = "item-0"]/a/text()')
print(result)
['first item', 'fifth item']

Let's take a look at //the results of using selection:

from lxml import etree

html = etree.parse('test.html', etree.HTMLParser())

result = html.xpath('//li[@class = "item-0"]//text()')
print(result)
['first item', 'fifth item', '\r\n     ']

Here is to select the text of all descendant nodes, the first two are the text inside the a node of li's child node, and the other is the text inside the last li node, that is, the line break.

7. Attribute acquisition

Use @symbols to get attributes

from lxml import etree
 
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)
['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

An attribute of some nodes may have multiple values. Here, the class attribute of the li node in the HTML text has two values ​​li and li-first. At this time, if you want to use the previous attribute matching to obtain it, it will not match:

from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[@class="li"]/a/text()')
print(result)
[]

We can use contains()functions to get:

from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)
['first item']

If a node has multiple attributes, you can use andto connect

from lxml import etree
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)
['first item']

In addition, there are many operators, such as or, mod, etc.:

Insert picture description here

8. Select in order

We selected the first li node, passed the number 1 in the square brackets, selected the last li node, passed last() in the square brackets, selected the third li node from the bottom, and passed last()-2 in the square brackets

from lxml import etree
 
text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/a/text()')
print(result)
result = html.xpath('//li[last()]/a/text()')
print(result)
result = html.xpath('//li[position()<3]/a/text()')
print(result)
result = html.xpath('//li[last()-2]/a/text()')
print(result)
['first item']
['fifth item']
['first item', 'second item']
['third item']

Guess you like

Origin blog.csdn.net/qq_43328040/article/details/108808251