这里先给出官方文档链接。
Beautiful Soup是支持多种解析器的,我们这里使用的lxml.这同样是官方文档里建议使用的解析器。
这里我们先介绍Beautiful Soup 类的元素:
1.Tag
>>>from bs4 import BeautifulSoup
>>>soup = BeautifulSoup(html_doc, 'lxml')
>>>print(soup.p)
<p class="title"><b>The Dormouse's story</b></p>
我们可以通过
Tag.name 得到标签的名字
Tag.attrs 得到标签的属性(这里返回的是字典形式,Tag的属性可以添加删除或者修改,操作方法与字典一样)
>>>p.attrs
{'class': ['title']}
>>>p.attrs['class']
['title']
Tag.string 得到标签内非属性字符串
>>>p.string
"The Dormouse's story"
Comment 文档注释部分
>>>markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
>>>soup = BeautifulSoup(markup)
>>>comment = soup.b.string
>>>type(comment)
<class 'bs4.element.Comment'>
2.标签遍历:
Tag.contents 将所有子节点存入列表(我们可以通过len(Tag.contents)获得子节点的数量)
>>>soup.p.contents
[<b>The Dormouse's story</b>]
>>>type(soup.p.contents)
<class 'list'>
Tag.children 子节点的迭代类型,用于循环遍历
>>>for i in soup.body.children:
>>> print(repr(i))
'\n'
<p class="title"><b>The Dormouse's story</b></p>
'\n'
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
'\n'
<p class="story">...</p>
'\n'
Tag.descendants 子孙节点的迭代类型,用于循环遍历
>>>for i in soup.body.descendants:
>>> print(repr(i))
'\n'
<p class="title"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
"The Dormouse's story"
'\n'
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
'Once upon a time there were three little sisters; and their names were\n'
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
'Elsie'
',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
'Lacie'
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'
同样我们也可以进行上行遍历
Tag.parent 得到节点的父标签
>>>soup.title.parent
<head><title>The Dormouse's story</title></head>
Tag.parents 节点所有父标签,用于循环遍历:
#原文档给出的例子
link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
for parent in link.parents:
if parent is None:
print(parent)
else:
print(parent.name)
# p
# body
# html
# [document]
# None
平行遍历,我们来看官方给出的例子
>>>sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
>>>print(sibling_soup.prettify())
# <html>
# <body>
# <a>
# <b>
# text1
# </b>
# <c>
# text2
# </c>
# </a>
# </body>
# </html>
这里的b和c是平行节点标签
>>>sibling_soup.b.next_sibling
# <c>text2</c>
>>>sibling_soup.c.previous_sibling
# <b>text1</b>
如果一个标签是其平行节点标签中的第一个,则没有previous_sibling属性,同理最后一个则没有next_sibling属性。
和以前相同,我们也可以用.next_siblings 和previous_siblings 进行循环遍历。
.next_element 指向下一个被解析对象
.previous_element 指向上一个被解析对象
.next_elements 指向下面所有被解析对象,用于循环遍历
.previous_elements 指向上面所有被解析对象,用于循环遍历
用法了.previous/next_sibling(s)类似,这里就不过多的介绍了。
Find_all()
我们可以直接用 find_all('a')来找到所有的a标签
>>>soup = BeautifulSoup(html_doc, 'lxml')
>>>for i in range(0, len(soup)):
>>> print(soup[i])
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
我们也可以通过列表的方式输入需要搜索的所有标签
>>>soup = soup.find_all(['a','b'])
这样返回的结果既有a标签又有b标签
同时find_all()也支持多种搜索方法:
soup.find_all(keyword:' ') 这里也可以keyword=True 例如 href=True 则返回所有包含href属性的标签
soup.find_all(tag, class_=' ') class是py中的保留字,在bs4里用class_代替
soup.find_all(class_=re.compile()) 同样,这里也支持正则表达式。
关于class 如果是多值属性,这里给出官方例子
>>>css_soup = BeautifulSoup('<p class="body strikeout"></p>')
>>>css_soup.find_all("p", class_="strikeout")
[<p class="body strikeout"></p>]
>>>css_soup.find_all("p", class_="body")
[<p class="body strikeout"></p>]
>>>css_soup.find_all("p", class_="body strikeout")
[<p class="body strikeout"></p>]
完全匹配class的值时,如果css类名顺序与实际不符,将搜索不到结果。这里也给出官方例子
>>>soup.find_all("a", attrs={"class": "sister"})
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
这里注意一下,返回的类型是<class 'bs4.element.ResultSet'>,这里是字典外套了一个列表,例如:
>>>css_soup = soup.find_all("a", attrs={"class": "sister"})
>>>print(css_soup[0]['href'])
'http://example.com/elsie'