Python3爬虫从零开始:Xpath的使用

    之前我们提取页面信息时使用的是正则表达式,但这比较繁琐,容易出错。XPath提供了简洁明了得路径选择表达式及大量内建函数。可以定位到几乎所有我们想要定位的节点。

XPath需要安装lxml库,安装方法

常用规则

  • nodename             选取此节点的所有子节点
  • /                             从当前节点选取直接子节点
  • //                            从当前节点选取子孙节点
  • .                              选取当前节点
  • ..                             选取当前节点的父节点
  • @                           选取属性

实例1:

from lxml import etree

text = '''

<div>

    <ul>

        <li class = "item-0"><a href = "link1.html">first item</a></li>

        <li class = "item-1"><a href = "link2.html">second item<a></li>

        <li class = "item-inactive"><a href = "link3.html">third item</a></li>
    
        <li class = "item-1"><a href = "link4.html">forth item</a><li>

        <li class = "item-0"><a href = "link5.html">fifth item</a></li>

    </ul>

</div>

'''

html = etree.HTML(text)    #调用HTML类进行初始化

result = etree.tostring(html)    #将其转化为字符串类型

print(result.decode('utf-8'))

print(type(html))

print(type(result))

结果:

分析:可以看出,etree模块为我们修正了错误的地方,即可以自动修正HTML文本。

注意:这里在PyCharm里面from lxml import etree会报红,但是没有影响,是可以正常运行的。

实例2:

from lxml import etree



html = etree.parse('test.html',etree.HTMLParser())

print(type(html))

result = etree.tostring(html)

print(type(result))

print(result.decode('utf-8'))

对应的test.html

结果:

注意:这里去掉etree.HTMLParse()会报错:

即text.html中<li>标签错误

修改后,正确输出:

即etree.HTMLParse()的作用等同于之前etree.HTML()的作用,对html进行了修正。具体函数作用有待研究。

实例3:子节点选取

from lxml import etree



html = etree.parse('test.html',etree.HTMLParser())

print("html type",type(html))

result1 = html.xpath('//*') #选取所有节点

print("选取所有节点:",result1)

result = html.xpath('//li') #选取所有li子孙节点

print("选取所有li子节点:",result)

print("提取选取后的对象[0]:",result[0])#提取其中对象,可以通过中括号加索引

result2 = html.xpath('//li/a') #选取li节点的直接a直接点

print("所有li子节点的所有直接a节点",result2)

结果:

实例4:文本获取和属性获取

from lxml import etree



html = etree.parse('test.html',etree.HTMLParser())

re = etree.tostring(html)

result1 = html.xpath('//a[@href="link4.html"]') #属性匹配

result2 = html.xpath('//a[@href="link4.html"]/../@class') #父节点

print("reuslt1:",result1)

print("result2:", result2)

result3 = html.xpath('//a[@href="link4.html"]/text()') #文本获取

print("result3:",result3)

result4 = html.xpath('//a/@href') #属性获取 注意区别与属性匹配

print('result4:',result4)

结果:

实例5:属性多只匹配

from lxml import etree

text = """

<li class = "li li-first" ><a href = "link.html">first item</a></li>

<li class = "li li-first" name = "item" ><a href = "link.html">second item</a></li>

"""

html = etree.HTML(text)

result1 = html.xpath('//li[@class="li"]/a/text()') #匹配失败

result2 = html.xpath('//li[@class="li li-first"]/a/text()') #匹配正确

result3 = html.xpath('//li[contains(@class,"li")]/a/text()') #利用contains()函数进行属性多值匹配

result4 = html.xpath('//li[contains(@class,"li") and @name = "item"]/a/text()') #多属性匹配

print("result1:",result1)

print("result2:",result2)

print("result3:",result3)

print("result4:",result4)

结果:

实例6:按序选择

from lxml import etree

text = """

<div>

<ul>

<li class="item-0"><a href="link1.html">first item</a></li>

<li class="item-1"><a href="link2.html">second item</a></li>

<li class="item-inactive"><a href="link3.html">third item0</a></li>

<li class="item-1"><a href="link4.html">forth item</a></li>

<li class="item-0"><a href="link5.html">fifth item</a></li>

</ul>

</div>

"""

html = etree.HTML(text)

result1 = html.xpath('//li[1]/a/text()')

print('result1',result1)

result2 = html.xpath('//li[last()]/a/text()')

print('result2:',result2)

result3 = html.xpath('//li[position()<3]/a/text()')

print('result3:',result3)

result4 = html.xpath('//li[last()-2]/a/text()')

print('result4:',result4)

结果:

除了last(),position()外,更多函数参考:http://www.w3school.com.cn/xpath/xpath_functions.asp#node

实例7:节点轴选择

from lxml import etree

text = """

<div>

<ul>

<li class="item-0"><a href="link1.html"><span>first item</span></a></li>

<li class="item-1"><a href="link2.html">second item</a></li>

<li class="item-inactive"><a href="link3.html">third item0</a></li>

<li class="item-1"><a href="link4.html">forth item</a></li>

<li class="item-0"><a href="link5.html">fifth item</a></li>

</ul>

</div>

"""

html = etree.HTML(text)

result1 = html.xpath('//li[1]/ancestor::*') #调用ancestor轴,获取所有祖先节点

print('result1:',result1)

result2 = html.xpath('//li[1]/ancestor::div') #调用ancestor轴,限定获取div祖先节点

print('result2:',result2)

result3 = html.xpath('//li[1]/attribute::*') #调用attribute轴,获取所有属性值

print('result3:',result3)

result4 = html.xpath('//li[1]/child::a[@href="link1.html"]') #调用child轴并限定条件(这里加不加限定条件一样,只有一个子节点)

print('result4:',result4)

result5 = html.xpath('//li[1]/descendant::span') #调用descendant轴,获取子孙节点并限定条件

print('result5:',result5)

result6 = html.xpath('//li[1]/following::*') #调用following轴,获取当前节点后的所有节点并限定索引

print('result6:',result6)

result7 = html.xpath('//li[1]/following::*[2]') #调用following轴,获取当前节点后的所有节点并限定索引

print('result7:',result7)

result8 = html.xpath('//li[1]/following-sibling::*') #调用following-sibling轴,获取当前节点之后的所有同级节点

print('result8:',result8)

结果:

更多Python lxml库用法:

http://lxml.de/

更多Xpath用法:http://www.w3school.com.cn/xpath/index.asp

猜你喜欢

转载自blog.csdn.net/qq_26736193/article/details/83216518
今日推荐