【学习笔记】 bs4库

Beautiful Soup是支持多种解析器的，我们这里使用的lxml.这同样是官方文档里建议使用的解析器。

这里我们先介绍Beautiful Soup 类的元素：

1.Tag

>>>from bs4 import BeautifulSoup
>>>soup = BeautifulSoup(html_doc, 'lxml')
>>>print(soup.p)

<p class="title"><b>The Dormouse's story</b></p>

我们可以通过

Tag.name 得到标签的名字

Tag.attrs 得到标签的属性(这里返回的是字典形式，Tag的属性可以添加删除或者修改，操作方法与字典一样)

>>>p.attrs
{'class': ['title']}

>>>p.attrs['class']
['title']

Tag.string 得到标签内非属性字符串

>>>p.string
"The Dormouse's story"

Comment 文档注释部分

>>>markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
>>>soup = BeautifulSoup(markup)
>>>comment = soup.b.string
>>>type(comment)
<class 'bs4.element.Comment'>

2.标签遍历：

Tag.contents 将所有子节点存入列表(我们可以通过len(Tag.contents)获得子节点的数量)

>>>soup.p.contents
[<b>The Dormouse's story</b>]

>>>type(soup.p.contents)
<class 'list'>

Tag.children 子节点的迭代类型，用于循环遍历

>>>for i in soup.body.children:
>>>    print(repr(i))
'\n'
<p class="title"><b>The Dormouse's story</b></p>
'\n'
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
'\n'
<p class="story">...</p>
'\n'

Tag.descendants 子孙节点的迭代类型，用于循环遍历

>>>for i in soup.body.descendants:
>>>    print(repr(i))
'\n'
<p class="title"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
"The Dormouse's story"
'\n'
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
'Once upon a time there were three little sisters; and their names were\n'
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
'Elsie'
',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
'Lacie'
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'

同样我们也可以进行上行遍历

Tag.parent 得到节点的父标签

>>>soup.title.parent
<head><title>The Dormouse's story</title></head>

Tag.parents 节点所有父标签，用于循环遍历：

#原文档给出的例子
link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
# p
# body
# html
# [document]
# None

平行遍历，我们来看官方给出的例子

>>>sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
>>>print(sibling_soup.prettify())
# <html>
#  <body>
#   <a>
#    <b>
#     text1
#    </b>
#    <c>
#     text2
#    </c>
#   </a>
#  </body>
# </html>

这里的b和c是平行节点标签

>>>sibling_soup.b.next_sibling
# <c>text2</c>

>>>sibling_soup.c.previous_sibling
# <b>text1</b>

如果一个标签是其平行节点标签中的第一个，则没有previous_sibling属性，同理最后一个则没有next_sibling属性。

和以前相同，我们也可以用.next_siblings 和previous_siblings 进行循环遍历。

.next_element 指向下一个被解析对象

.previous_element 指向上一个被解析对象

.next_elements 指向下面所有被解析对象，用于循环遍历

.previous_elements 指向上面所有被解析对象，用于循环遍历

用法了.previous/next_sibling(s)类似，这里就不过多的介绍了。

Find_all()

我们可以直接用 find_all('a')来找到所有的a标签

>>>soup = BeautifulSoup(html_doc, 'lxml')
>>>for i in range(0, len(soup)):
>>>    print(soup[i])
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

我们也可以通过列表的方式输入需要搜索的所有标签

>>>soup = soup.find_all(['a','b'])

这样返回的结果既有a标签又有b标签

同时find_all()也支持多种搜索方法：

soup.find_all(keyword:' ') 这里也可以keyword=True 例如 href=True 则返回所有包含href属性的标签

soup.find_all(tag, class_=' ') class是py中的保留字，在bs4里用class_代替

soup.find_all(class_=re.compile()) 同样，这里也支持正则表达式。

关于class 如果是多值属性，这里给出官方例子

>>>css_soup = BeautifulSoup('<p class="body strikeout"></p>')
>>>css_soup.find_all("p", class_="strikeout")
[<p class="body strikeout"></p>]

>>>css_soup.find_all("p", class_="body")
[<p class="body strikeout"></p>]
>>>css_soup.find_all("p", class_="body strikeout")
[<p class="body strikeout"></p>]

完全匹配class的值时，如果css类名顺序与实际不符，将搜索不到结果。这里也给出官方例子

>>>soup.find_all("a", attrs={"class": "sister"})
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

这里注意一下，返回的类型是<class 'bs4.element.ResultSet'>，这里是字典外套了一个列表，例如：

>>>css_soup = soup.find_all("a", attrs={"class": "sister"})
>>>print(css_soup[0]['href'])
'http://example.com/elsie'

【学习笔记】 bs4库

猜你喜欢