python爬虫(六)BeautifulSoup库

概念

在这里插入图片描述

安装:

安装: 命令行输入pip install beautifulsoup4

BeautifulSoup支持的解析器

在这里插入图片描述

基本用法

from bs4 import BeautifulSoup
html='''
<html><head><title>The Dormousae's story</title></head>
<body>
<p class="title" name="drimouse"><b>The Dormousae's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/title" class="sister" id="link3">Tillie</a>;
and they lived at the boottom of a well.</p>
<p class="story">...</p>
'''
soup=BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title.string)

对于html我们可以看到到,并不是一个完整的HTML字符串,通过soup=BeautifulSoup(html,‘lxml’),对BeautifulSoup对象初始化,soup.prettify()方法可以把药解析的字符串以标准的缩进格式输出,
soup.title.string打印除title节点的内容。

标签选择器

选择元素:
# html与上述的一致
soup=BeautifulSoup(html,'lxml')
print(soup.title)# 打印title标签以及其中的内容
print(type(soup.title))#<class 'bs4.element.Tag'>
print(soup.head)# 打印head标签以及其中的内容
print(soup.p)# 只会打印第一个p节点以及其中的内容

获取名称
from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.title.name)
#打印出节点的名称title
获取属性
from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.p.attrs)#{'class': ['title'], 'name': 'drimouse'}
print(soup.p.attrs['name'])#drimouse
print(soup.p['name'])#drimouse
获取内容
from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.title.string)
嵌套选择:
print(soup.title.string)#
print(soup.head.title.string)
print(soup.head.title)
print(type(soup.head.title))
print(type(soup.head.title.string))
# 打印结果依次为:
The Dormousae's story
The Dormousae's story
<title>The Dormousae's story</title>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
关联选择:

在做选择的时候,有时候不能做到一步就选到想要的节点元素,需要选中某一个节点元素,然后以它为基准再去选择它的子节点,父节点,兄弟节点等
(1)子节点和子孙节点:

from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.p.contents)#获取子节点
# [<b>The Dormousae's story</b>]

方法2:

from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.p.children)# 迭代器类型
for i,child in enumerate(soup.p.children):
	print(i,child)

打印的结果为:
<list_iterator object at 0x000001BABACB9EF0>
0 The Dormousae’s story

子孙节点:


from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.p.descendants)#获取子孙节点
for i,child in enumerate(soup.descendants):
	print(i,child)

(2)获取父节点和祖先节点

soup=BeautifulSoup(html,'lxml')
print(soup.a.parent)#获取父节点
print(soup.a.parents)#返回迭代器
print(list(enumerate(soup.a.parents)))#获取祖先节点

(3)兄弟节点:

from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(list(enumerate(soup.a.next_siblings)))#获取后面的兄弟节点
print(list(enumerate(soup.a.previous_siblings)))#获取前面的兄弟节点

打印结果:
[(0, ‘,\n’), (1, Lacie), (2, ’ and\n’), (3, Tillie), (4, ‘;\nand they lived at the boottom of a well.’)]

[(0, ‘Once upon a time there were three little sisters;and their names were\n’)]

方法选择器:

前面所说的都是通过属性来选择的,这种方法比较快,但是如果遇到比较复杂的选择的话,就比较麻烦,不灵活,BeautifulSoup库还提供了find_all(),以及find()方法

find_all(name,attrs,recursive,text,**kwargs)

可根据标签名,属性,内容查找文档

html='''
<div class="panel">
	<div class="panel-heading">
		<h4>Hello</h4>
	</div>
	<div class="panel-body">
		<ul class="list" id="list-1">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
			<li class="element">Jay</li>
		</ul>
		<ul class="list list-small" id="list-2">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
		</ul>
	</div>
</div>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
for ul in soup.find_all('ul'):
	print(ul.find_all('li'))

打印结果
在这里插入图片描述
attrs属性:

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))

等价于

print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))# 不能直接使用class,在python中class时关键字

text文本

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find_all(text='Foo'))
find方法

find(name,attrs,recursive,text,**kwargs)
find返回单个元素,find_all返回所有元素

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))

CSS选择器

通过select直接传入CSS选择器即可完成选择
(1)获取属性

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
	print(ul['id'])
	print(ul.attrs['id'])

(2)获取内容

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
for li in soup.select('li'):
	print(li.get_text())

总结:

总结:推荐使用lxml解析库,必要时使用html.parser
标签选择筛选功能弱但是速度快
建议使用find(),find_all()查询匹配单个结果或者多个结果
如果对CSS选择器熟悉建议使用select()
记住常用的获取属性值和文本的方法

发布了63 篇原创文章 · 获赞 12 · 访问量 4048

猜你喜欢

转载自blog.csdn.net/qq_45353823/article/details/104215426
今日推荐