Python3 [Analysis library XPath]

1. Introduction to XPath

  To analyze the hierarchical relationship of web pages, XPath's selection function is very powerful, it provides a very simple and clear path selection expression.

In addition, it also provides more than 100 built-in functions for string, numeric, and time matching, as well as processing of nodes and sequences.

Almost all positioning nodes can be selected using XPath.

Official website: https://www.w3.org/TR/xpath

 

1. XPath common rules:

                             

 

2. Basic use

from lxml import etree

text = '''
<div>
    <ul>
        <li class="one"><a href="link1">1</a></li>
        <li class="two"><a href="link2">2</a></li>
        <li class="three"><a href="link3">3</a></li>
        <li class="four"><a href="link4">4</a></li>
        <li class="five"><a href="link5">5</a>
    </ul>
</div>

'''
#将文本转换为网页类型,并修复补全
html = etree.The HTML (text) 

convert the page into text type, as bytes#(HTML)PrintHTML etree.parse = ( 'demo.html', etree.HTMLParser ())
#The entire complement of web page structure, the file open path#




= Result etree.tostring (HTML) 

# into str Type 
Result = result.decode ( " UTF-. 8 " ) 

Print (Result)

1. Match selection (all nodes)

from lxml import etree


text = '''
<div>
    <ul>
        <li class="one"><a href="link1">1</a></li>
        <li class="two"><a href="link2">2</a></li>
        <li class="three"><a href="link3">3</a></li>
        <li class="four"><a href="link4">4</a></li>
        <li class="five"><a href="link5">5</a>
    </ul>
</div>

'''
#将文本转换为网页类型,并修复补全
html = etree.The HTML (text) 

(Result)Print)'// *'
Result = html.xpath (selected content matching#

 

 

 

2. Child nodes

from lxml import etree


text = '''
<div>
    <ul>
        <li class="one"><a href="link1">1</a></li>
        <li class="two"><a href="link2">2</a></li>
        <li class="three"><a href="link3">3</a></li>
        <li class="four"><a href="link4">4</a></li>
        <li class="five"><a href="link5">5</a>
    </ul>
</div>

'''
#将文本转换为网页类型,并修复补全
html = etree.The HTML (text) 

(Result)Print)'// Li / A'
Result = html.xpath (selected content matching#

Here "/" represents direct child nodes, "//" represents all descendant nodes

 

 

 3. Parent node

Parent node: Use " .. ", you can also use parent :: to represent the parent

from lxml import etree


text = '''
<div>
    <ul>
        <li class="one"><a href="link1">1</a></li>
        <li class="two"><a href="link2">2</a></li>
        <li class="three"><a href="link3">3</a></li>
        <li class="four"><a href="link4">4</a></li>
        <li class="five"><a href="link5">5</a>
    </ul>
</div>

'''
#将文本转换为网页类型,并修复补全
html = etree.The HTML (text) 

#)'//a[@href="link4"]/../@class'
Result = html.xpath (attribute is a parent class attribute tag link4
#selected content matching#

@表示属性
result1 = html.xpath('//a[@href="link4"]/parent::*/@class')

print(result)
print(result1)

 

 

 

4. Text Acquisition

from lxml import etree


text = '''
<div>
    <ul>
        <li class="one"><a href="link1">1</a></li>
        <li class="two"><a href="link2">2</a></li>
        <li class="three"><a href="link3">3</a></li>
        <li class="four"><a href="link4">4</a></li>
        <li class="five"><a href="link5">5</a>
    </ul>
</div>

'''
#将文本转换为网页类型,并修复补全
html = etree.The HTML (text) 

Print)'// a [@ the href = "link4"] / text ()'
Result = html.xpath (attribute is a parent class attribute tag link4
#selected content matching#

(result)

 

 

5. Attribute multi-value matching

from lxml import etree


text = '''
<div>
    <ul>
        <li class="one"><a href="link1">1</a></li>
        <li class="two"><a href="link2">2</a></li>
        <li class="three two"><a href="link3">3</a></li>
        <li class="four"><a href="link4">4</a></li>
        <li class="five"><a href="link5">5</a>
    </ul>
</div>

'''
#将文本转换为网页类型,并修复补全
html = etree.The HTML (text) 

)'// Li [the contains (@class, "Three")] / A / text ()'
Result = html.xpath (the contains (@ property, value)
#selected content matching#

print(result)

 

6. Multi-attribute matching

 Multiple attributes determine a node, then you need to match multiple attributes

from lxml import etree


text = '''
<div>
    <ul>
        <li class="one"><a href="link1">1</a></li>
        <li class="two three" name="item"><a href="link2">2</a></li>
        <li class="three two"><a href="link3">3</a></li>
        <li class="four"><a href="link4">4</a></li>
        <li class="five"><a href="link5">5</a>
    </ul>
</div>

'''
// Li [the contains (@class, "Three ") and @ name =" item "] / a / text ()'
Result = html.xpath (the contains (@ property, value)
#selected content matching#etree.HTML (text)
HTML =Converts text page type, and fix complement#

')

print(result)

 

7. Choose in order

from lxml import etree


text = '''
<div>
    <ul>
        <li class="one"><a href="link1">1</a></li>
        <li class="two three" name="item"><a href="link2">2</a></li>
        <li class="three two"><a href="link3">3</a></li>
        <li class="four"><a href="link4">4</a></li>
        <li class="five"><a href="link5">5</a>
    </ul>
</div>

'''
#)'// Li [. 1] / A / text ()'
RESULT1 = html.xpath (first match Li#selected content matching#etree.HTML (text)
HTML =Converts text page type, and fix complement#





Finally, a countdown 2 
result2 = html.xpath ( ' // Li [Last () - 2] / A / text () ' ) 

# last 
result3 = html.xpath ( ' // Li [Last ()] / A / text () ' ) 

# less than. 3 
result4 = html.xpath ( ' // Li [position () <. 3] / A / text () ' ) 


# built-in functions 100, http: //www.w3school.com.cn/ xpath / xpath_functions.asp 
print (result1)
 print (result2)
 print (result3)
 print (result4)

 

 

8. Node axis selection

 

 

 

 

# Attribute of a tag link4 parent class attribute

Guess you like

Origin www.cnblogs.com/Crown-V/p/12725652.html
Recommended