1.常用解析器

python标准库	BeautifulSoup(markup, "html.parser")	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3及Python 3.2.2之前的版本文档容错能力差
lxml html解析器	BeautifulSoup(markup, "lxml")	速度快、文档容错能力强	需要安装C语言库
lxml xml解析器	BeautifulSoup(markup, "xml")	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, "html5lib")	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

2.基本用法

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

节点选择器

soup = BeautifulSoup(html,'lxml')
print(soup.a)#选择a节点

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

　　　　只会返回第一个匹配的选择对象，其余节点忽略

print(soup.p.attrs)
print(soup.p.attrs['class'])

{'class': ['title'], 'name': 'dromouse'}
['title']


或者

print(soup.p['name'])
print(soup.p['class'])

dromouse
['title']

print(soup.p.string)

The Dormouse's story

其他用法

　　soup.p.content子节点列表

　　soup.p.children子节点迭代器

　　soup.p.descendants 所有子节点

　　soup.a.parent/parents

　　soup.a.next_sibling

　　soup.a.previous_sibling

方法选择器

　　find_all( name , attrs , recursive , string , **kwargs )

　　find( name , attrs , recursive , string , **kwargs )

from bs4 import BeautifulSoup

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name='ul'))

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]

soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':list-2}))

        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>

print(soup.find_all(id='list-1'))
print(soup.find_all(class_='list'))

import re
from bs4 import BeautifulSoup

html='''
<div class="panel">
    <div class="panel-body">
        <a>Hello, this is a link</a>
        <a>Hello, this is a link, too</a>
    </div>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
#print(soup.find_all(text='Hello, this is a link'))
print(soup.find_all(text=re.compile('link')))

['Hello, this is a link', 'Hello, this is a link, too']

　　find_parents()和find_parent()：前者返回所有祖先节点，后者返回直接父节点。

　　find_next_siblings()和find_next_sibling()：前者返回后面所有的兄弟节点，后者返回后面第一个兄弟节点。

　　find_previous_siblings()和find_previous_sibling()：前者返回前面所有的兄弟节点，后者返回前面第一个兄弟节点。

　　find_all_next()和find_next()：前者返回节点后所有符合条件的节点，后者返回第一个符合条件的节点。

　　find_all_previous()和find_previous()：前者返回节点后所有符合条件的节点，后者返回第一个符合条件的节点。

css选择器

    soup = BeautifulSoup(html,'lxml')
    result = soup.select('#picture a')

def get_house_info(html):
    soup = BeautifulSoup(html,'lxml')
    names = soup.select('.houseList .list .plotListwrap .plotTit')
    type = soup.select('.houseList .list .plotListwrap .plotFangType')
    addr = soup.select('.houseList .list .plotListwrap dd p')[1::3]
    selling = soup.select('.houseList .list .plotListwrap .sellOrRenthy li')[::3]
    selled = soup.select('.houseList .list .plotListwrap .sellOrRenthy li')[1::3]
    year = soup.select('.houseList .list .plotListwrap .sellOrRenthy li')[2::3]
    price = soup.select('.houseList .list .listRiconwrap .priceAverage')
    ratio = soup.select('.houseList .list .listRiconwrap .ratio')

    for i in range(len(names)):
        house = {
            'name' : names[i].text.strip(),
            'type' : type[i].text.strip(),
            'addr': addr[i].text.strip(),
            'selling': selling[i].text.strip(),
            'selled': selled[i].text.strip(),
            'year': year[i].text.strip(),
            'price': price[i].text.strip(),
            'ratio': ratio[i].text.strip(),
        }
        save_to_mongo(house)

beautifulsoup 解析库

1.常用解析器

2.基本用法

猜你喜欢