爬虫--解析库的使用 XPath、BeautifulSoup、pyquery

1. XPath

XPath , 全称XML Path Language ,即XML 路径语言,它是一门在XML 文档中查找信息的语言。它最初是用来搜寻XML 文档的,但是它同样适用于HTML 文档的搜索。

XPath 的选择功能十分强大,它提供了非常简洁明了的路径选择表达式。另外,它还提供了超过100 个内建函数,用于字符串、数值、时间的匹配以及节点、序列的处理等。几乎所有我们想要定位的节点,都可以用XPath 来选择。

from lxml import etree

text = '''
<div>
<ul>
<li class="item-O"><a href="linkl.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
html = etree.HTML(text)
result= etree.tostring(html)
print(result.decode('utf-8'))

<html><body><div>
<ul>
<li class="item-O"><a href="linkl.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</li></ul>
</div>
</body></html>

7. 父节点

from lxml import etree

html = etree.parse('F:\\spider\\XPath\\test.html', etree.HTMLParser())
result = etree.tostring(html)
res = html.xpath('//a[@href="link4.html"]/../@class')  # 或者下面的写法
# res = html.xpath('//a[@href="link4.html"]/parent::*/@class')
print(result.decode('utf-8'))
print('\n', res)

9. 文本获取

from lxml import etree

html = etree.parse('F:\\spider\\XPath\\test.html', etree.HTMLParser())
res = html.xpath('//li[@class="item-0"]/a/text()')  # 文本获取
print(res)
# ['first item', 'fifth item']

这里我们是逐层选取的,先选取了li 节点,又利用/选取了其直接子节点儿然后再选取其文本,得到的结果恰好是符合我们预期的两个结果。

扫描二维码关注公众号,回复: 3369967 查看本文章
from lxml import etree

html = etree.parse('F:\\spider\\XPath\\test.html', etree.HTMLParser())
res = html.xpath('//li[@class="item-0"]//text()')  # 文本获取
print(res)
# ['first item', 'fifth item', '\r\n\t']

不出所料,这里的返回结果是3 个。可想而知,这里是选取所有子孙节点的文本,其中前两个就是li 的子节点a 节点内部的文本,另外一个就是最后一个li 节点内部的文本,即换行符。

10 . 属性获取
我们知道用text ()可以获取节点内部文本,那么节点属性该怎样获取呢?其实还是用符号就可以。例如,我们想获取所有li 节点下所有a 节点的href 属性,代码如下:

from lxml import etree
# 属性获取
html = etree.parse('F:\\spider\\XPath\\test.html', etree.HTMLParser())
res = html.xpath('//li/a/@href')  # 文本获取
print(res)
# ['linkl.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

13. 按序选择

有时候,我们在选择的时候某些属性可能同时匹配了多个节点,但是只想要其中的某个节点,如第二个节点或者最后一个节点。这时可以利用中括号传入索引的方法获取特定次序的节点。

from lxml import etree

text = '''
<div>
<ul>
<li class="item-O"><a href="linkl.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
html = etree.parse("F:/spider/XPath/test.html", etree.HTMLParser())
res = html.xpath("//li/a/text()")  # 注意:这里的下标从1开始
res1 = html.xpath("//li[1]/a/text()")
res2 = html.xpath("//li[last()]/a/text()")
res3 = html.xpath("//li[position()<3]/a/text()")
res5 = html.xpath("//li[last()-2]/a/text()")
print(res, res1, res2, res3, res5, sep='\n')
# ['first item', 'second item', 'third item', 'fourth item', 'fifth item']
# ['first item']
# ['fifth item']
# ['first item', 'second item']
# ['third item']

第一次选择时,我们选取了第一个li 节点,中括号中传入数字1即可。注意,这里和代码中不同,序号是以1 开头的,不是以0 开头。

14. 节点轴选择

XPath 提供了很多节点轴选择方法,包括获取子元素、兄弟元素、父元素、祖先元素等。

from lxml import etree

text = '''
<div>
<ul>
<li class="item-O"><a href="link1.html"><span>first item</span></a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
html = etree.HTML(text)
res = html.xpath("//li[1]/ancestor::*")  # 注意:这里的下标从1开始
print(res)
# [<Element html at 0x1991a74efc8>, <Element body at 0x1991a720dc8>, <Element div at 0x1991a760488>, <Element ul at 0x1991a760848>]
res = html.xpath("//li[1]/ancestor::div")
print(res)
# [<Element div at 0x1991a6c0188>]
res = html.xpath("//li[1]/attribute::*")
print(res)
# ['item-O']
res = html.xpath('//li[1]/child::a[@href="link1.html"]')
print(res)
# [<Element a at 0x1991a760308>]
res = html.xpath("//li[1]/descendant::span")
print(res)
# [<Element span at 0x1991a7688c8>]
res = html.xpath('//li[1]/following::*[1]')
print(res)
# [<Element li at 0x1991a6c02c8>]
res = html.xpath('//li[1]/following-sibling::*')
print(res)
# [<Element li at 0x1991a74ed48>, <Element li at 0x1991a74e748>, <Element li at 0x1991a720dc8>, <Element li at 0x1991a720bc8>]

2. BeautifulSoup

一个强大的解析工具,它借助网页的结构和属性等特性来解析网页。有了它,我们不用再去写一些复杂的正则表达式,只需要简单的几条语句,就可以完成网页中某个元素的提取。

Beaut 1也l Soup 自动将输入文档转换为Unicode 编码,输出文档转换为UTF-8 编码。你不需妥考虑、编码方式,除非文档没有指定一个编码方式,这时你仅仅需妥说明一下原始编码方式就可以了。

from bs4 import BeautifulSoup

soup = BeautifulSoup('<div>Nice...</div>', 'lxml')
print(soup.div.string)
# Nice...
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="linkl"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2"> Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story"> ... </p>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())  # 修饰、美化 -- 以标准的缩进格式输出
print()
print(soup.title)
print(soup.title.string)
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title" name="dromouse">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="linkl">
#     <!-- Elsie -->
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link3">
#     Tillie
#    </a>
#    ;
# and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>
# 
# <title>The Dormouse's story</title>
# The Dormouse's story

节点选择器

直接调用节点的名称就可以选择节点元素,再调用st ring 属性就可以得到节点内的文本了。

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="linkl"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2"> Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story"> ... </p>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(soup.title.string)
print(type(soup.title))
print(soup.head)
print(soup.p)
# <title>The Dormouse's story</title>
# The Dormouse's story
# <class 'bs4.element.Tag'>
# <head><title>The Dormouse's story</title></head>
# <p class="title" name="dromouse"><b>The Dormouse's story</b></p>

bs4.ele ment.Tag是Beautiful Soup 中一个重要的数据结构。经过选择器选择后,选择结果都是这种Tag 类型。Tag 具有一些属性,比如string属性,调用该属性,可以得到节点的文本内容,所以接下来的输出结果正是节点的文本内容。

最后选择了p 节点。不过这次情况比较特殊,只有第一个p 节点的内容,后面的几个p 节点并没有选到。

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story"> ... </p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.p.name)  # 返回 p
print(soup.p["name"])  # 返回 dromouse
print(soup.p.attrs)  # {'class': ['title'], 'name': 'dromouse'}
print(soup.p.attrs['name'])  # 返回 dromouse

如果返回结果是单个节点,那么可以直接调用string 、attrs 等属性获得其文本和属性;如果返回结果是多个节点的生成器,则可以转为列表后取出某个元素,然后再调用string 、a ttrs 等属性获取其对应节点的文本和属性。

3. pyquery

如果你对Web 有所涉及,如果你比较喜欢用css 选择器,如果你对jQuery 有所了解,那么这里有一个更适合你的解析库一-pyquery 。

猜你喜欢

转载自blog.csdn.net/m0_38024592/article/details/82771171