python爬虫——BeautifulSoup库

BeautifulSoup库

一、简介

灵活又方便的网页解析库,处理高效,支持多种解析器。

利用它不用编写正则表达式即可方便地实现网页信息的提取。

二、详解

1.解析库

解析器 使用方法 优势 劣势
Python标准库 BeautifulSoup(markup,"html.parser") Python的内置标准库,执行速度适中,文档容错能力强 Python 2.7.3 or 3.2.2 版本前中文容错能力弱
lxml HTML解析器 BeautifulSoup(markup,'lxml') 速度快,文档容错能力强 需要安装C语言库
lxml XML解析器 BeautifulSoup(markup,'xml') 速度快,唯一支持XML的解析器 需要安装C语言库
html5lib BeautifulSoup(markup,'html5lib') 最好的容错性,以浏览器的方式解析文档,生成HTML5格式的文档 速度慢,不依赖扩展

2.基本使用

html = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/else" class="sister" id="link1"><!--Elsie--></a>
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a>and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title.string)

3.标签选择器

3.1选择元素

html = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/else" class="sister" id="link1"><!--Elsie--></a>
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a>and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

如果一个标签存在多个,只返回第一个。

3.2获取名称

html = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/else" class="sister" id="link1"><!--Elsie--></a>
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a>and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)

3.3获取属性

html = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/else" class="sister" id="link1"><!--Elsie--></a>
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a>and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])
3.4获取内容
html = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/else" class="sister" id="link1"><!--Elsie--></a>
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a>and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.string)

3.5嵌套选择

html = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/else" class="sister" id="link1"><!--Elsie--></a>
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a>and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.head.title.string)

3.6子节点和子孙节点

html = """
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="story">
   Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/else" id="link1">
    <!--Elsie-->
   </a>
   <a class="sister" href="http://example.com/lacle" id="link2">
    Lacle
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
 """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)
html = """
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="story">
   Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/else" id="link1">
    <!--Elsie-->
   </a>
   <a class="sister" href="http://example.com/lacle" id="link2">
    Lacle
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
 """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.children)
for i,child in enumerate(soup.p.children):
	print(i,child)
html = """
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="story">
   Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/else" id="link1">
    <!--Elsie-->
   </a>
   <a class="sister" href="http://example.com/lacle" id="link2">
    Lacle
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
 """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):#所有的子孙节点
	print(i,child)

父节点和祖先节点

html = """
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="story">
   Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/else" id="link1">
    <!--Elsie-->
   </a>
   <a class="sister" href="http://example.com/lacle" id="link2">
    Lacle
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
 """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent)
html = """
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="story">
   Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/else" id="link1">
    <!--Elsie-->
   </a>
   <a class="sister" href="http://example.com/lacle" id="link2">
    Lacle
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
 """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.parents)))#祖先节点

兄弟节点

html = """
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="story">
   Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/else" id="link1">
    <!--Elsie-->
   </a>
   <a class="sister" href="http://example.com/lacle" id="link2">
    Lacle
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
 """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.next_siblings)))
print(list(enumerate(soup.a.previous_siblings)))

标准选择器

find_all(name,attrs,recursive,text,**kwargs)

可以根据标签名,属性,内容查找文档。

html = """
<div class="panel">
	<div class="panel-heading">
		<h4>hello</h4>
	</div>
	<div class="panel-body">
		<ul class="list" id="list-1">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
			<li class="element">Jay</li>
		</ul>
		<ul class="list list-small" id="list-2">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
		</ul>
	</div>
<div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))#标签名name
print(type(soup.find_all('ul')[0]))
html = """
<div class="panel">
	<div class="panel-heading">
		<h4>hello</h4>
	</div>
	<div class="panel-body">
		<ul class="list" id="list-1">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
			<li class="element">Jay</li>
		</ul>
		<ul class="list list-small" id="list-2">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
		</ul>
	</div>
<div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
	print(ul.find_all('li'))

html = """
<div class="panel">
	<div class="panel-heading">
		<h4>hello</h4>
	</div>
	<div class="panel-body">
		<ul class="list" id="list-1" name="elements">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
			<li class="element">Jay</li>
		</ul>
		<ul class="list list-small" id="list-2">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
		</ul>
	</div>
<div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':'list-1'}))#属性attr
print(soup.find_all(attrs={'name':'elements'}))

html = """
<div class="panel">
	<div class="panel-heading">
		<h4>hello</h4>
	</div>
	<div class="panel-body">
		<ul class="list" id="list-1" name="elements">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
			<li class="element">Jay</li>
		</ul>
		<ul class="list list-small" id="list-2">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
		</ul>
	</div>
<div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(id='list-1'))#对于特殊类型的属性,可以直接这么写
print(soup.find_all(class_='element'))
html = """
<div class="panel">
	<div class="panel-heading">
		<h4>hello</h4>
	</div>
	<div class="panel-body">
		<ul class="list" id="list-1" name="elements">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
			<li class="element">Jay</li>
		</ul>
		<ul class="list list-small" id="list-2">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
		</ul>
	</div>
<div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text='Foo'))#利用text进行选择
find(name,attrs,recursive,text,**kwargs)

返回单个元素,会返回列表里面第一个值。

html = """
<div class="panel">
	<div class="panel-heading">
		<h4>hello</h4>
	</div>
	<div class="panel-body">
		<ul class="list" id="list-1" name="elements">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
			<li class="element">Jay</li>
		</ul>
		<ul class="list list-small" id="list-2">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
		</ul>
	</div>
<div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))#不存在返回None
find_parents()#返回所有祖先节点
find_parent()#返回直接父节点
find_next_siblings()#返回后面所有兄弟节点
find_next_sibling()#返回后面第一个兄弟节点
find_previous_siblings()#返回前面所有兄弟节点
find_previous_sibling()#返回前面第一个兄弟节点
find_all_next()#返回节点后所有符合条件的节点
find_next()#返回第一个符合条件的节点
find_all_previous()#返回节点前所有符合条件的节点
find_previous()#返回节点前第一个符合条件的节点

CSS选择器

通过select()直接传入CSS选择器即可完成选择。

html = """
<div class="panel">
	<div class="panel-heading">
		<h4>hello</h4>
	</div>
	<div class="panel-body">
		<ul class="list" id="list-1" name="elements">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
			<li class="element">Jay</li>
		</ul>
		<ul class="list list-small" id="list-2">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
		</ul>
	</div>
<div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.select('.panel .panel-heading')))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))

html = """
<div class="panel">
	<div class="panel-heading">
		<h4>hello</h4>
	</div>
	<div class="panel-body">
		<ul class="list" id="list-1" name="elements">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
			<li class="element">Jay</li>
		</ul>
		<ul class="list list-small" id="list-2">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
		</ul>
	</div>
<div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
	print(ul.select('li'))

获取属性

html = """
<div class="panel">
	<div class="panel-heading">
		<h4>hello</h4>
	</div>
	<div class="panel-body">
		<ul class="list" id="list-1" name="elements">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
			<li class="element">Jay</li>
		</ul>
		<ul class="list list-small" id="list-2">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
		</ul>
	</div>
<div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
	print(ul['id'])
	print(ul.attrs['id'])

获取内容

html = """
<div class="panel">
	<div class="panel-heading">
		<h4>hello</h4>
	</div>
	<div class="panel-body">
		<ul class="list" id="list-1" name="elements">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
			<li class="element">Jay</li>
		</ul>
		<ul class="list list-small" id="list-2">
			<li class="element">Foo</li>
			<li class="element">Bar</li>
		</ul>
	</div>
<div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
	print(li.get_text())

总结

1.推荐使用lxml解析库,必要时使用html.parser

2.标签选择器筛选功能弱但是速度快。

3.建议使用find(),find_all()匹配单个结果或者多个结果。

4.如果对css选择器熟悉选择建议使用select()。

5.记住常用的获取属性和文本值的方法。


猜你喜欢

转载自blog.csdn.net/qq_38344394/article/details/80937316