XPath库的使用

XPath,全称 XML Path Language,即ⅹML路径语言,它是一门在XML文档中查找信息的语言。它最初是用来搜寻XML文档的,但是它同样适用于HTML文档的搜索。
XPath的选择功能十分强大,它提供了非常简洁明了的路径选择表达式。另外,它还提供了超过00个内建函数,用于字符串、数值、时间的匹配以及节点、序列的处理等。几乎所有我们想要定位的节点,都可以用 XPath来选择。
示例如下：

from lxml import etree

text = '''<ul class="m-list">
<li><a class="title" title="宿舍违规电器领回办理流程图" id="157" href="/houqin/gyxsgy/index.php/Home/News/showarticle/id/157">宿舍违规电器领回办理流程图</a><span class="date">[2017-01-10]</span></li>
<li><a class="title" title="学生申请假期住校办理流程图" id="156" href="/houqin/gyxsgy/index.php/Home/News/showarticle/id/156">学生申请假期住校办理流程图</a><span class="date">[2017-01-10]</span></li>
<li><a class="title" title="学生公寓区宣传品申请张贴办理流程图" id="155" href="/houqin/gyxsgy/index.php/Home/News/showarticle/id/155">学生公寓区宣传品申请张贴办理流程图</a><span class="date">[2017-01-10]</span></li>
<li><a class="title" title="学生公寓物品外借办理流程图" id="154" href="/houqin/gyxsgy/index.php/Home/News/showarticle/id/154">学生公寓物品外借办理流程图</a><span class="date">[2017-01-10]</span></li>
<li><a class="title" title="学生申请住宿办理流程图" id="153" href="/houqin/gyxsgy/index.php/Home/News/showarticle/id/153">学生申请住宿办理流程图</a><span class="date">[2017-01-10]</span></li>
<li><a class="title" title="学生宿舍调整办理流程图" id="152" href="/houqin/gyxsgy/index.php/Home/News/showarticle/id/152">学生宿舍调整办理流程图</a><span class="date">[2017-01-10]</span></li>                    
</ul>
'''
html = etree.HTML(text)
result = etree.tostring(html,encoding="utf-8",pretty_print=True,method="html")
print(result.decode('utf-8'))

这里首先导入lxml库的 etree模块,然后声明了一段HTML文本,调用HTML类进行初始化,这样就成功构造了一个 XPath解析对象。这里需要注意的是,HTML文本是不完整的,但是 etree模块可以自动修正HTML文本。
这里我们调用 tostring()方法即可输出修正后的HTML代码,但是结果是 bytes类型。这里利用decode()方法将其转成str类型。
如果响应html文件中存在中文，那么上面的代码运行就会输出乱码，解决办法：http://blog.sina.com.cn/s/blog_9e103b930102x1jx.html

节点选择

Xpath常用规则截图

当然也可以指定节点名称

from lxml import etree

html = etree.parse("text.html",parser=etree.HTMLParser(encoding='utf-8'))
ts_result = etree.tostring(html,encoding='utf-8',pretty_print=True,method="html")
all_result = html.xpath('//*')        #选取所有所有节点
li_result = html.xpath('//li')        #选取所有li节点
li_a_result = html.xpath('//li/a')    #选取所有li节点的所有直接a子节点
ul_a_result = html.xpath('//ul//a')   #选取所有节点下的所有子孙a节点
class_result = html.xpath('//li/../@class')          #获取li节点的父节点的class属性
herf_result = html.xpath('//li/a/@href')             #获取a节点的class属性
title_result = html.xpath('//a[@class ="title"]')    #选取class为title的a节点
print(li_result)
print(li_result[0])
print(li_a_result)
print(ul_a_result)
print(class_result)
print(title_result)

文本获取

用XPath中的text()方法获取节点中的文本

from lxml import etree

html = etree.parse("text.html",parser=etree.HTMLParser(encoding='utf-8'))
li_text = html.xpath('//li/a/text()')
print(li_text)
#使用contains()函数进行属性多值匹配
a_text = html.xpath('//a[contains(@class,"test")]/text()')
print(a_text)
#用and或者or进行多属性匹配
li_text = html.xpath('//li/a[@class = "title" or @calss = "date"]/text()')
print(li_text)
##按序选择
result = html.xpath('//li[1]/a/text()') #选取第一个li节点的内容
print(result)
result = html.xpath('//li[last()]/a/text()') #选取最后一个li节点的内容
print(result)
result = html.xpath('//li[position()<3]/a/text()') #选取位置小于3的li节点
print(result)
result = html.xpath('//li[last()-2]/a/text()')  #选取倒数第三个li节点
print(result)

##节点轴选择
result = html.xpath('//li[1]/ancestor::*')
print(result)
result = html.xpath('//li[1]/ancestor::div')
print(result)
result = html.xpath('//li[1]/attribute::*')
print(result)
result = html.xpath ('//li[1]/child::a[@class="tittle"]')
print(result)
result= html.xpath('//li[ 1]/descendant::span')
print(result)
result =html.xpath('//li[1]/following::*[2]')
print (result)
result= html.xpath('//li[1]/following-sibling::*')
print(result)

第一次选择时,我们调用了ancestor轴,可以获取所有祖先节点。其后需要跟两个冒号,然后是节点的选择器,这里我们直接使用*,表示匹配所有节点
第二次选择时,我们又加了限定条件,这次在冒号后面加了div,这样得到的结果就只有div这个祖先节点了。
第三次选择时,我们调用了attribute轴,可以获取所有属性值,其后跟的选择器还是*,这代表获取节点的所有属性,返回值就是li节点的所有属性值。
第四次选择时,我们调用了chid轴,可以获取所有直接子节点。这里我们又加了限定条件,选取class属性为title的a节点。
第五次选择时,我们调用了descendant轴,可以获取所有子孙节点。这里我们又加了限定条件获取span节点,所以返回的结果只包含span节点而不包含a节点。
第六次选择时,我们调用了folling轴,可以获取当前节点之后的所有节点。这里我们虽然使用的是*匹配,但又加了索引选择,所以只获取了第二个后续节点。
第七次选择时,我们调用了following-sibling轴,可以获取当前节点之后的所有同级节点。这里我们使用*匹配,所以获取了所有后续同级节点。