beautifulsoup 解析库

1.常用解析器

python标准库 BeautifulSoup(markup, "html.parser") Python的内置标准库、执行速度适中、文档容错能力强 Python 2.7.3及Python 3.2.2之前的版本文档容错能力差
lxml html解析器 BeautifulSoup(markup, "lxml") 速度快、文档容错能力强 需要安装C语言库
lxml xml解析器 BeautifulSoup(markup, "xml") 速度快、唯一支持XML的解析器 需要安装C语言库
html5lib BeautifulSoup(markup, "html5lib") 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 速度慢、不依赖外部扩展

2.基本用法

    

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
  • 节点选择器
    soup = BeautifulSoup(html,'lxml')
    print(soup.a)#选择a节点
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

    只会返回第一个匹配的选择对象,其余节点忽略

    

print(soup.p.attrs)
print(soup.p.attrs['class'])
{'class': ['title'], 'name': 'dromouse'}
['title']

或者
print(soup.p['name'])
print(soup.p['class'])
dromouse
['title']
 
print(soup.p.string)
The Dormouse's story
 其他用法
  soup.p.content子节点列表
  soup.p.children子节点迭代器
  soup.p.descendants 所有子节点
 
  soup.a.parent/parents
 
  soup.a.next_sibling
  soup.a.previous_sibling
  • 方法选择器

  find_all( name , attrs , recursive , string , **kwargs )

  find( name , attrs , recursive , string , **kwargs )

  

from bs4 import BeautifulSoup

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name='ul'))
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':list-2}))
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='list'))
import re
from bs4 import BeautifulSoup

html='''
<div class="panel">
    <div class="panel-body">
        <a>Hello, this is a link</a>
        <a>Hello, this is a link, too</a>
    </div>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
#print(soup.find_all(text='Hello, this is a link'))
print(soup.find_all(text=re.compile('link')))
['Hello, this is a link', 'Hello, this is a link, too']

  find_parents()和find_parent():前者返回所有祖先节点,后者返回直接父节点。

  find_next_siblings()和find_next_sibling():前者返回后面所有的兄弟节点,后者返回后面第一个兄弟节点。

  find_previous_siblings()和find_previous_sibling():前者返回前面所有的兄弟节点,后者返回前面第一个兄弟节点。

  find_all_next()和find_next():前者返回节点后所有符合条件的节点,后者返回第一个符合条件的节点。

  find_all_previous()和find_previous():前者返回节点后所有符合条件的节点,后者返回第一个符合条件的节点。

  • css选择器

    

    soup = BeautifulSoup(html,'lxml')
    result = soup.select('#picture a')
def get_house_info(html):
    soup = BeautifulSoup(html,'lxml')
    names = soup.select('.houseList .list .plotListwrap .plotTit')
    type = soup.select('.houseList .list .plotListwrap .plotFangType')
    addr = soup.select('.houseList .list .plotListwrap dd p')[1::3]
    selling = soup.select('.houseList .list .plotListwrap .sellOrRenthy li')[::3]
    selled = soup.select('.houseList .list .plotListwrap .sellOrRenthy li')[1::3]
    year = soup.select('.houseList .list .plotListwrap .sellOrRenthy li')[2::3]
    price = soup.select('.houseList .list .listRiconwrap .priceAverage')
    ratio = soup.select('.houseList .list .listRiconwrap .ratio')

    for i in range(len(names)):
        house = {
            'name' : names[i].text.strip(),
            'type' : type[i].text.strip(),
            'addr': addr[i].text.strip(),
            'selling': selling[i].text.strip(),
            'selled': selled[i].text.strip(),
            'year': year[i].text.strip(),
            'price': price[i].text.strip(),
            'ratio': ratio[i].text.strip(),
        }
        save_to_mongo(house)

猜你喜欢

转载自www.cnblogs.com/yitiaodahe/p/9216374.html