Data of the road - Python Reptile - BeautifulSoup library

A, BeautifulSoup Introduction

Beautiful Soup is a Python HTML or XML parsing library, you can use it to easily extract data from web pages. Beautiful Soup automatically converted to Unicode encoding input document, the document is converted to an output UTF-8 encoding .

Two, BeautifulSoup simple case

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.p)
print(soup.p["class"])
print(soup.a)
print(soup.find_all('a'))
print(soup.find(id='link3'))

Three, Beautiful Soup parser support

Parser

Instructions

Advantage

Disadvantaged

Python Standard Library

BeautifulSoup(markup, "html.parser")

Python's standard library built, execution speed is moderate, strong fault tolerance documents

Python 2.7.3 and Python 3.2.2 version of the document before the fault tolerance of difference

lxml HTML parser

BeautifulSoup(markup, "lxml")

Fast, strong fault tolerance documents

You need to install the C language library

lxml XML parser

BeautifulSoup(markup, "xml")

Speed, the only support for XML parser

You need to install the C language library

html5lib

BeautifulSoup(markup, "html5lib")

The best fault tolerance to the way the browser parses the document, document generation HTML5 format

Slow, does not rely on external expansion

Four, BeautifulSoup basic usage

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

1. tag selector

In this soup. Tag names we can get this tag

Import BS4 from the BeautifulSoup 
Soup = the BeautifulSoup (HTML, 'lxml' ) 
Print (soup.title) 
Print (soup.head)  Print (soup.p) # p If there are multiple tags, only the output of the first

2. Get the name tag selector ·

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title.name)

3. The tag selector · acquire property

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

4. The child node and the descendant node

Import BS4 from the BeautifulSoup 
Soup = the BeautifulSoup (HTML, 'lxml' ) 
Print (soup.p.contents) # Getting child nodes 

print (soup.p.children) # Getting child nodes 
for I, Child in the enumerate (soup.p.children ): 
    Print (i, Child) 
     Print (soup.p.descendants) # Gets descendants node  for i, Child in the enumerate (soup.p.descendants):  Print (i, Child)

5. parent, ancestor node, sibling

Import BS4 from the BeautifulSoup 
Soup = the BeautifulSoup (HTML, 'lxml' ) 

Print (soup.a.parent) to get the parent node # 
print (list (enumerate (soup.a.parents) )) # ancestor node acquires 

print (list (enumerate ( soup.a.next_siblings))) # get the next sibling 
print (list (enumerate (soup.a.previous_siblings) )) # Gets the previous sibling

Fifth, the method selector

find_all () to find documents based on name tags, attributes, content 
find_all (narne, attrs, recursive This, text, ** kwargs) 

# tag name query 
Print (soup.findall (name = 'ul')) 
Print (of the type (soup.find_all (name = 'UL') [0])) 

# attribute query 
print (soup dry ind_all (attrs =. { 'ID': 'List-. 1'})) 
Print . (Soup in ind_all (attrs = { 'name' : 'elements'})) 
 # text query  Print (soup.find_all (text = the re.compile ( 'Link')))  find_all () returns all elements #  find () # returns a single element find_parents () # returns all ancestor nodes find_parent () # returns the direct parent find_next_siblings () # returns all subsequent siblings find_next_sibling () # returns after the first sibling find_previous_siblings () # returns all the previous siblings find_previous_sibling () # returns before the first sibling node

find_all_next () # returns the node after all eligible nodes find_next () # returns the first qualifying node find_all_previous () # returns the node after all eligible nodes find_previous () # returns the first qualifying node

Six, CSS selectors

By select () directly into the CSS selector to complete the selection

html= '''
<div class='panel'>
    <div class='panel-heading'>
        <h4>Hello</h4>
    </div>    
    <div class='panel-body'>
        <ul class='list' id='list-1'>
            <li class='element'>Foo</li>
            <li class='element'>Bar>
            <li class='element'>Jay</li>
        </ul>
        <ul class='list list-small' id='list-2'>
            <li class='element'>Foo</li>
            <li class='element'>Bar</li>
        </ul>
    </div>
</div>
'''

1. Select the tab

from  bs4 import  BeautifulSoup 
soup = BeautifulSoup(html, ’lxml' ) 
print(soup.select('.panel.panel-heading'))    
print(soup.select('ul li'))
print(soup.select('#list-2.element'))

2. Select Properties

from  bs4 import  BeautifulSoup 
soup = BeautifulSoup(html, ’lxml' ) 
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

3. Select the text

from  bs4 import  BeautifulSoup 
soup = BeautifulSoup(html, ’lxml' ) 
for ul in soup.select('li'):
    print(ul.get_text())

Guess you like

Origin www.cnblogs.com/Iceredtea/p/11286170.html