1.常用解析器
python标准库 | BeautifulSoup(markup, "html.parser") | Python的内置标准库、执行速度适中、文档容错能力强 | Python 2.7.3及Python 3.2.2之前的版本文档容错能力差 |
lxml html解析器 | BeautifulSoup(markup, "lxml") | 速度快、文档容错能力强 | 需要安装C语言库 |
lxml xml解析器 | BeautifulSoup(markup, "xml") | 速度快、唯一支持XML的解析器 | 需要安装C语言库 |
html5lib | BeautifulSoup(markup, "html5lib") | 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |
2.基本用法
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
- 节点选择器
soup = BeautifulSoup(html,'lxml')
print(soup.a)#选择a节点<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
只会返回第一个匹配的选择对象,其余节点忽略
print(soup.p.attrs) print(soup.p.attrs['class'])
{'class': ['title'], 'name': 'dromouse'} ['title']
或者
print(soup.p['name']) print(soup.p['class'])
dromouse ['title']
print(soup.p.string)
其他用法
soup.p.content子节点列表
soup.p.children子节点迭代器
soup.p.descendants 所有子节点
soup.a.parent/parents
soup.a.next_sibling
soup.a.previous_sibling
- 方法选择器
find_all( name , attrs , recursive , string , **kwargs )
find( name , attrs , recursive , string , **kwargs )
from bs4 import BeautifulSoup html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' soup = BeautifulSoup(html,'lxml') print(soup.find_all(name='ul'))
[<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>]
soup = BeautifulSoup(html,'lxml') print(soup.find_all(attrs={'id':list-2}))
<ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>
print(soup.find_all(id='list-1')) print(soup.find_all(class_='list'))
import re from bs4 import BeautifulSoup html=''' <div class="panel"> <div class="panel-body"> <a>Hello, this is a link</a> <a>Hello, this is a link, too</a> </div> </div> ''' soup = BeautifulSoup(html, 'lxml') #print(soup.find_all(text='Hello, this is a link')) print(soup.find_all(text=re.compile('link')))
['Hello, this is a link', 'Hello, this is a link, too']
find_parents()和find_parent():前者返回所有祖先节点,后者返回直接父节点。
find_next_siblings()和find_next_sibling():前者返回后面所有的兄弟节点,后者返回后面第一个兄弟节点。
find_previous_siblings()和find_previous_sibling():前者返回前面所有的兄弟节点,后者返回前面第一个兄弟节点。
find_all_next()和find_next():前者返回节点后所有符合条件的节点,后者返回第一个符合条件的节点。
find_all_previous()和find_previous():前者返回节点后所有符合条件的节点,后者返回第一个符合条件的节点。
- css选择器
soup = BeautifulSoup(html,'lxml') result = soup.select('#picture a')
def get_house_info(html): soup = BeautifulSoup(html,'lxml') names = soup.select('.houseList .list .plotListwrap .plotTit') type = soup.select('.houseList .list .plotListwrap .plotFangType') addr = soup.select('.houseList .list .plotListwrap dd p')[1::3] selling = soup.select('.houseList .list .plotListwrap .sellOrRenthy li')[::3] selled = soup.select('.houseList .list .plotListwrap .sellOrRenthy li')[1::3] year = soup.select('.houseList .list .plotListwrap .sellOrRenthy li')[2::3] price = soup.select('.houseList .list .listRiconwrap .priceAverage') ratio = soup.select('.houseList .list .listRiconwrap .ratio') for i in range(len(names)): house = { 'name' : names[i].text.strip(), 'type' : type[i].text.strip(), 'addr': addr[i].text.strip(), 'selling': selling[i].text.strip(), 'selled': selled[i].text.strip(), 'year': year[i].text.strip(), 'price': price[i].text.strip(), 'ratio': ratio[i].text.strip(), } save_to_mongo(house)