python爬虫二:bs4库中的BeautifulSoup模块

转:https://zhuanlan.zhihu.com/p/26701898

# -*- coding: utf-8 -*-
#BS4库导入bs4模块
from bs4 import BeautifulSoup
html="html<html><head><title>The Dormouse's story</title></head><body><p class='title'><b>The Dormouse's story</b></p><a class='sister' href='http://example.com/lacie' id='link1'> Lacie </a><a class='sister' href='http://example.com/lacie' id='link2'> Lacie </a><a class='sister' href='http://example.com/lacie' id='link3'> Lacie </a>"
soup = BeautifulSoup(html,'html.parser')
print soup.prettify()
'''
可以看到bs4库将网页文件变成了一个soup的类型，

事实上，bs4库 是解析、遍历、维护、“标签树“的功能库。

通俗一点说就是： bs4库把html源代码重新进行了格式化，

从而方便我们对其中的节点、标签、属性等进行操作。

 <a class='sister' href='http://example.com/lacie' id='link2'> Lacie </a>
'''

#找到文档的title
print soup.title  #<title>The Dormouse's story</title>

#获取title的name
print soup.title.name  #title

#获取title 的字符串
print soup.title.string #The Dormouse's story

#获取父节点的name属性
print soup.title.parent.name #head

#获取文档的第一个段落
print soup.p  #<p class="title"><b>The Dormouse's story</b></p>

#获取p的class属性
print soup.p['class'] #[u'title']

#获取文件的a标签
print soup.a #<a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a>

#获取文档的所有a标签
print soup.findAll('a') #[<a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a>]

#找到id值为link2的a标签
print soup.find(id='link2') #<a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a>



#bs4高用法
#1.获取html中所有的a标签

for link in soup.findAll('a'):
    print link.get("href")

'''
http://example.com/lacie
http://example.com/lacie
http://example.com/lacie
'''

#2.获取所有的文件内容
print soup.get_text()#htmlThe Dormouse's storyThe Dormouse's story Lacie  Lacie  Lacie

python爬虫二:bs4库中的BeautifulSoup模块

猜你喜欢