A, BeautifulSoup Introduction
Beautiful Soup is a Python HTML or XML parsing library, you can use it to easily extract data from web pages. Beautiful Soup automatically converted to Unicode encoding input document, the document is converted to an output UTF-8 encoding .
Two, BeautifulSoup simple case
from bs4 import BeautifulSoup html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' soup = BeautifulSoup(html,'lxml') print(soup.prettify()) print(soup.title) print(soup.title.name) print(soup.title.string) print(soup.title.parent.name) print(soup.p) print(soup.p["class"]) print(soup.a) print(soup.find_all('a')) print(soup.find(id='link3'))
Three, Beautiful Soup parser support
Parser |
Instructions |
Advantage |
Disadvantaged |
Python Standard Library |
BeautifulSoup(markup, "html.parser") |
Python's standard library built, execution speed is moderate, strong fault tolerance documents |
Python 2.7.3 and Python 3.2.2 version of the document before the fault tolerance of difference |
lxml HTML parser |
BeautifulSoup(markup, "lxml") |
Fast, strong fault tolerance documents |
You need to install the C language library |
lxml XML parser |
BeautifulSoup(markup, "xml") |
Speed, the only support for XML parser |
You need to install the C language library |
html5lib |
BeautifulSoup(markup, "html5lib") |
The best fault tolerance to the way the browser parses the document, document generation HTML5 format |
Slow, does not rely on external expansion |
Four, BeautifulSoup basic usage
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
1. tag selector
In this soup. Tag names we can get this tag
Import BS4 from the BeautifulSoup
Soup = the BeautifulSoup (HTML, 'lxml' )
Print (soup.title)
Print (soup.head) Print (soup.p) # p If there are multiple tags, only the output of the first
2. Get the name tag selector ·
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title.name)
3. The tag selector · acquire property
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])
4. The child node and the descendant node
Import BS4 from the BeautifulSoup
Soup = the BeautifulSoup (HTML, 'lxml' )
Print (soup.p.contents) # Getting child nodes
print (soup.p.children) # Getting child nodes
for I, Child in the enumerate (soup.p.children ):
Print (i, Child)
Print (soup.p.descendants) # Gets descendants node for i, Child in the enumerate (soup.p.descendants): Print (i, Child)
5. parent, ancestor node, sibling
Import BS4 from the BeautifulSoup
Soup = the BeautifulSoup (HTML, 'lxml' )
Print (soup.a.parent) to get the parent node #
print (list (enumerate (soup.a.parents) )) # ancestor node acquires
print (list (enumerate ( soup.a.next_siblings))) # get the next sibling
print (list (enumerate (soup.a.previous_siblings) )) # Gets the previous sibling
Fifth, the method selector
find_all () to find documents based on name tags, attributes, content
find_all (narne, attrs, recursive This, text, ** kwargs)
# tag name query
Print (soup.findall (name = 'ul'))
Print (of the type (soup.find_all (name = 'UL') [0]))
# attribute query
print (soup dry ind_all (attrs =. { 'ID': 'List-. 1'}))
Print . (Soup in ind_all (attrs = { 'name' : 'elements'}))
# text query Print (soup.find_all (text = the re.compile ( 'Link'))) find_all () returns all elements # find () # returns a single element find_parents () # returns all ancestor nodes find_parent () # returns the direct parent find_next_siblings () # returns all subsequent siblings find_next_sibling () # returns after the first sibling find_previous_siblings () # returns all the previous siblings find_previous_sibling () # returns before the first sibling node
find_all_next () # returns the node after all eligible nodes find_next () # returns the first qualifying node find_all_previous () # returns the node after all eligible nodes find_previous () # returns the first qualifying node
Six, CSS selectors
By select () directly into the CSS selector to complete the selection
html= ''' <div class='panel'> <div class='panel-heading'> <h4>Hello</h4> </div> <div class='panel-body'> <ul class='list' id='list-1'> <li class='element'>Foo</li> <li class='element'>Bar> <li class='element'>Jay</li> </ul> <ul class='list list-small' id='list-2'> <li class='element'>Foo</li> <li class='element'>Bar</li> </ul> </div> </div> '''
1. Select the tab
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, ’lxml' )
print(soup.select('.panel.panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2.element'))
2. Select Properties
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, ’lxml' )
for ul in soup.select('ul'):
print(ul['id'])
print(ul.attrs['id'])
3. Select the text
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, ’lxml' )
for ul in soup.select('li'):
print(ul.get_text())