Learning crawlers is inseparable from data analysis and analysis. The BeautifulSoup module in python is an excellent html parser. Here are the main functions of bs4.
Install
install bs4
pip3 install beautifulsoup4
install lxml parser
pip3 install lxml
Installing lxml parser may result in xmlCheckVersion error. At this time, you can download the corresponding lxml.whl from the Internet and use whl to install it.
get html
First obtain an html page from the request library, or a local static Html page, and use bs4 to parse
soup = BeautifulSoup(html_doc, lxml)
//or
url="www.xxx.com"
r=requests.get(url)
soup = BeautifulSoup(r.text, lxml)
parsing function
Getting started quickly, naturally, look at what useful analytical functions bs4 has. Here are some of the most commonly used methods for such a piece of html
print(soup.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link3">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>
Call bs4 to parse
soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
Four type of Objects
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.
tag
tag <b class="boldest">Extremely bold</b>
tag type <class 'bs4.element.Tag'>
tag.name 'b'
tag.name = “blockquote” <blockquote class="boldest">Extremely bold</blockquote>
tag <b id="boldest">
tag ['id'] 'boldest'
tag ['attribute'] = 1 <b attribute="1" id="verybold"></b>
del tag ['id']
NavigableString
A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:
Sorry for not being uniform, but it is more intuitive to write in large code blocks
tag
# <blockquote>Extremely bold</blockquote>
tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>
tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>
NavigableString is nasty, which dosen’t have most of the attributes and functions that tag has.And sometimes it will appear in your search tree randomly(of course not randomly),so i provide a simple way to ignore them later in this article.
BeautifulSoup
The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree
Of course, bs4 Object has no name and attribute
navigate and iterate
using tag name
only get the first tag by that name
soup.head
# <head><title>The Dormouse's story</title></head>
soup.title = soup.head.title
# <title>The Dormouse's story</title>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Special attribute
A tag’s direct children are available in a list called .contents:
head_tag
# <head><title>The Dormouse's story</title></head>
head_tag.contents
# [<title>The Dormouse's story</title>]
title_tag.contents
# [u'The Dormouse's story']
.descendent considers grandson and so on
for child in head_tag.descendants:
print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story
If a tag has only one child, and that child is a NavigableString, the child is made available as .string:
title_tag.string = title_tag.content
# u'The Dormouse's story'
going up
title_tag
# <title>The Dormouse's story</title>
title_tag.parent
# <head><title>The Dormouse's story</title></head>
going sideways
Searching the tree
Signature: find_all(name, attrs, recursive, string, limit, **kwargs)
filter
Check whether it contains an attribute: has_attr()
if info.has_attr('property') and not info.has_attr('content'):
some caveats
NavigableString is really sucks.We can igonre it by this way:
from bs4 import BeautifulSoup, NavigableString, Tag
for _minisite in minisite_list:
if isinstance(_minisite, NavigableString):
continue
if isinstance(_minisite, Tag):
print(_minisite)