Beautiful Soup的使用

1. 安装

pip bs4或pip beautifulsoup4

2. 使用

  1. 创建Beautiful Soup 对象

from bs4 import BeautifulSoup
soup=BeautifulSoup(str,‘lxml’)//str在下面的测试代码中

  1. 四大对象种类
    Beautiful Soup 将复杂HTML 文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种
  • Tag
  • NavigatableString
  • BeautifulSoup
  • Comment
    2.1 Tag
    Tag,就是HTML中的一个标签,比如
    , 等,注意:相同的标签只能获取第一个符合要求的标签
  • 获取标签

soup=BeautifulSoup(str,‘lxml’)
print(soup.title)

  • 获取标签属性

print(soup.div.attrs)
print(soup.div.get(‘class’))
print(soup.div[‘class’])
print(soup.a[‘href’])

2.2 NavigatableString 获取内容

print(soup.strong.string)
print(soup.strong.text)

2.3 BeautifulSoup
BeautifulSoup 对象表示的是一个文档的全部内容。大部分时候,可以把它当作Tag对象,支持遍历,索索等方法

2.4 Comment

if(type(soup.strong.string)==Comment):
//print(soup.strong.text)
print(soup.strong.prettify())
else:
print(type(soup.strong.string))

  1. 搜索文档树
    3.1 过滤器find_all()

  2. 代码

from bs4 import BeautifulSoup
from bs4.element import Comment
str='''//这里举例用的,一般使用bs都是用爬来的代码,无需我们自己写
<title id="title">西邮</title>
<div class="info" float="left">welcome to xupt</div>
<div class="info" float="right">
   <span>Good Study</span>
   <a href="www.baidu.com"></a>
   <strong><!--注释!--></strong>
</div>
'''
soup=BeautifulSoup(str,'lxml')
printf('--------------测试--------------')//下面注释就是的输出结果
print(soup.title)	//<title id="title">西邮</title>
print(soup.div)	//<div class="info" float="left">welcome to xupt</div>
print(soup.div.attrs)/	/{'class': ['info'], 'float': 'left'}
print(soup.div.get('class'))	//['info']
print(soup.div['class'])	//['info']
print(soup.div.text)	//welcome to xupt
print(soup.div.string)	//welcome to xupt

print(soup.a['href'])	//www.baidu.com

print(soup.strong.string)	//注释!
print(type(soup.strong.string))		//<class 'bs4.element.Comment'>

if(type(soup.strong.string)==Comment):
   //print(soup.strong.text)
   print(soup.strong.prettify())
else:
   print(type(soup.strong.string))
 /*结果
 <strong>
<!--注释!-->
</strong>
*/
print('------------------find_all------------------')
print(soup.find_all('title'))	//[<title id="title">西邮</title>]
print(soup.find_all(id='title'))	//[<title id="title">西邮</title>]
print(soup.find_all(class_='info'))	
/*
[<div class="info" float="left">welcome to xupt</div>, <div class="info" float="right">
<span>Good Study</span>
<a href="www.baidu.com"></a>
<strong><!--注释!--></strong>
</div>]
*/
print(soup.find_all("div",attrs={'float':'left'}))		//[<div class="info" float="left">welcome to xupt</div>]
print('-------------------css()--------------------')
print(soup.select('title'))	//[<title id="title">西邮</title>]
print(soup.select('#title'))	//[<title id="title">西邮</title>]
print(soup.select('.info'))
/*
[<div class="info" float="left">welcome to xupt</div>, <div class="info" float="right">
<span>Good Study</span>
<a href="www.baidu.com"></a>
<strong><!--注释!--></strong>
</div>]
*/
print(soup.select('div span'))	//[<span>Good Study</span>]
print(soup.select('div > span'))	//[<span>Good Study</span>]
print(soup.select('div')[1].select('a'))	//[<a href="www.baidu.com"></a>]
print(soup.select('title')[0].text)	//西邮

猜你喜欢

转载自blog.csdn.net/qq_41386300/article/details/83386570