Learning crawlers is inseparable from data analysis and analysis. The BeautifulSoup module in python is an excellent html parser. Here are the main functions of bs4.

Install

install bs4

pip3 install beautifulsoup4

install lxml parser

pip3 install lxml

Installing lxml parser may result in xmlCheckVersion error. At this time, you can download the corresponding lxml.whl from the Internet and use whl to install it.

get html

First obtain an html page from the request library, or a local static Html page, and use bs4 to parse

soup = BeautifulSoup(html_doc, lxml)
//or
url="www.xxx.com"
r=requests.get(url)
soup = BeautifulSoup(r.text, lxml)

parsing function

Getting started quickly, naturally, look at what useful analytical functions bs4 has. Here are some of the most commonly used methods for such a piece of html

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link3">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

Call bs4 to parse

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

Four type of Objects

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.

tag

tag Extremely bold
tag type <class 'bs4.element.Tag'>
tag.name 'b'
tag.name = “blockquote” <blockquote class="boldest">Extremely bold</blockquote>

tag 
tag ['id'] 'boldest'
tag ['attribute'] = 1 
del tag ['id']

NavigableString

A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:

Sorry for not being uniform, but it is more intuitive to write in large code blocks

tag 
# <blockquote>Extremely bold</blockquote>
tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>
tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>

NavigableString is nasty, which dosen’t have most of the attributes and functions that tag has.And sometimes it will appear in your search tree randomly(of course not randomly),so i provide a simple way to ignore them later in this article.

BeautifulSoup

The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree

Of course, bs4 Object has no name and attribute

navigate and iterate

using tag name

only get the first tag by that name

soup.head
# <head><title>The Dormouse's story</title></head>

soup.title = soup.head.title
# <title>The Dormouse's story</title>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Special attribute

A tag’s direct children are available in a list called .contents:

head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents
# [<title>The Dormouse's story</title>]

title_tag.contents
# [u'The Dormouse's story']

.descendent considers grandson and so on

for child in head_tag.descendants:
    print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story

If a tag has only one child, and that child is a NavigableString, the child is made available as .string:

title_tag.string = title_tag.content
# u'The Dormouse's story'

going up

title_tag
# <title>The Dormouse's story</title>
title_tag.parent
# <head><title>The Dormouse's story</title></head>

going sideways

Searching the tree

Signature: find_all(name, attrs, recursive, string, limit, **kwargs)

filter

Check whether it contains an attribute: has_attr()

if info.has_attr('property') and not info.has_attr('content'):

some caveats

NavigableString is really sucks.We can igonre it by this way:

from bs4 import BeautifulSoup, NavigableString, Tag
for _minisite in minisite_list:
    if isinstance(_minisite, NavigableString):
        continue
    if isinstance(_minisite, Tag):
        print(_minisite)

Getting started with python bs4 module