table of Contents
Description of Beautiful Soup
Beautiful Soup is a Python library that can extract data from HTML or XML files. It can realize idiomatic document navigation, find and modify the way of document through your favorite converter. BeautifulSoup will save you hours or even days of working time.
The above is taken from the official website
Beautiful Soup installation
$ easy_install beautifulsoup4
# or
$ pip install beautifulsoup4
Install the parser
Beautiful Soup not only supports the HTML parser in the Python standard library, but also supports many third-party parsers, such as lxml
, html5lib
etc. You can choose from the following three methods to install lxml:
$ apt-get install Python-lxml
$ easy_install lxml
$ pip install lxml
Another alternative parser is implemented in pure Python, html5lib
and html5lib
the parsing method is the same as the browser. You can choose the following methods to install html5lib:
$ apt-get install Python-html5lib
$ easy_install html5lib
$ pip install html5lib
The advantages and disadvantages of each parser are as follows: It is
recommended to be used lxml
as a parser because it is more efficient. In versions before Python 2.7.3 and versions before 3.2.2 in Python 3 , you must install lxml
or html5lib
, because those Python versions have built-in HTML in the standard library The analytical method is not stable enough.
Note: If an HTML or XML document is not in the correct format, the results returned in different parsers may be different, check
Beautiful Soup use
Beautiful Soup is very simple to use. Pass a document (a string or a file handle) into the BeautifulSoup construction method to get a document object. After having this object, we can do something about the document. Do the operation. Most of the incoming text is crawled by crawlers, so the combination of Beautiful Soup and requests library has a better experience.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")
First, the document is converted to Unicode, and all instances of HTML are converted to Unicode encoding
BeautifulSoup("Sacré bleu!")
<html><head></head><body>Sacré bleu!</body></html>
Then, Beautiful Soup chooses the most suitable parser to parse this document.If you manually specify a parser, Beautiful Soup will choose the specified parser to parse the document.
Type of object
Beautiful Soup complex HTML documents converted into a complex tree structure, each node is Python objects, all objects can be grouped into four kinds: Tag
, NavigableString
, BeautifulSoup
, Comment
.
Wherein:
-
Tag is a tag of HTML, such as div, p, h1~h6 tags, etc. It is also the most used object.
-
NavigableString refers to the text inside the label, and the literal translation is a traversable string.
-
BeautifulSoup refers to the entire content of a document, which can be treated as a Tag.
-
Comment is a special NavigableString whose output content does not include comment content.
Tag
The Tag object is the same as the tag in the XML or HTML native document:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>
Important attributes in tag
The most important attributes in the tag are: name
andattributes
Name
Each tag has its own name, which can .name
be obtained by:
tag.name
# u'b'
If the tag is changed name
, it will affect all HTML documents generated by the current Beautiful Soup object:
tag.name = "blockquote"
tag
# <blockquote class="boldest">Extremely bold</blockquote>
Attributes
A tag may have many attributes . tag <b class="boldest">
and one “class”
attribute. . tag
The operation method of the attribute with the value of "boldest" is the same as that tag
of the dictionary: the attribute can be added, deleted or modified
tag['class']
# u'boldest'
# tag的属性可以被添加,删除或修改
tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>
del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>
tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None
You can also directly "click" to get the attributes, for example: .attrs:
tag.attrs
# {u'class': u'boldest'}
Strings are often contained in tags and .Beautiful Soup
use NavigableString
classes to wrap tag
the strings:
tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>
A NavigableString
string in Python Unicode
same string, and also supports some of the features contained in the document tree and traversing the tree by searching for documents. unicode()
May be a method directly NavigableString
converted into the target Unicode
string:
unicode_string = unicode(tag.string)
unicode_string
# u'Extremely bold'
type(unicode_string)
# <type 'unicode'>
The string contained in the tag cannot be edited, but it can be replaced with other strings, using replace_with()
:
tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>
BeautifulSoup
The BeautifulSoup object represents the entire content of a document. Most of the time, you can think of it as Tag 对象
it supports traversing the document tree and searching most of the methods described in the document tree.
Because the object is not a real BeautifulSoup HTML or XML tag, so it does not name
and attribute
property. But sometimes view its .name
properties is very convenient, so BeautifulSoup object contains a value “[document]”
of special property.name
Notes and special strings:
Tag
, NavigableString
, BeautifulSoup
covering almost all the contents of the xml and html, but there are some special objects: As the comments section of the document, which requires Comment
objects use a special format output comments section:
markup = "<b><!--这是一段注释--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>
# Comment 对象会使用特殊的格式输出注释部分
comment
# u'这是一段注释'
Traverse the document tree
First define a string of HTML text, and do the analysis below:
html_doc = """
<html><head><title>index</title></head>
<p class="title"><b>商城</b></p>
<p class="story">这是我的第三个商城,欢迎来参观
<a href="http://cityShop.com/elsie" class="city" id="home">home</a>
<a href="http://cityShop.com/lacie" class="city" id="design">design</a>
<a href="http://cityShop.com/tillie" class="city" id="products">products</a>
欢迎来参观.</p>
<p class="welcome">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
Child node
A Tag may contain multiple strings or other tags, these are the child nodes of this Tag. Beautiful Soup provides many operations and attributes to traverse the child nodes.
Note: The string node in Beautiful Soup does not support these attributes because the string has no child nodes
Get the name of the tag
The easiest way to manipulate the document tree is to tell it the name of the tag you want to get. If you want to get the <head>
tag, just use soup.head
:
soup.head
# <head><title>index</title></head>
soup.title
# <title>商城</title>
You can tag
call this method multiple times in the document tree . For example, get <body>
the first tag in the <b>
tag:
soup.body.b
# <b>商城</b>
Only the first tag of the current name can be obtained by clicking the attribute:
soup.a
# <a href="http://cityShop.com/elsie" class="city" id="home">home</a>
If you want to get all the <a>
tags, or get more content than one tag by name, you need to use find_all()
:
soup.find_all('a')
# [<a href="http://cityShop.com/elsie" class="city" id="home">home</a>
# <a href="http://cityShop.com/lacie" class="city" id="design">design</a>
# <a href="http://cityShop.com/tillie" class="city" id="products">products</a>]
.contents 和 .children
The tag .contents
attribute can tag
output the child nodes in a list:
head_tag = soup.head
head_tag
# <head><title>index</title></head>
head_tag.contents
[<title>index</title>]
title_tag = head_tag.contents[0]
title_tag
# <title>index</title>
title_tag.contents
# [u'index']
The child nodes of tag .children 生成器,
can tag
be looped through:
for child in title_tag.children:
print(child)
# index
.children
Only the direct node of the tag can be obtained, and no descendant nodes ,.descendants
can be obtained.
.contents
And .children
direct child node attribute contains only tag, for example, <head>
the label is only one direct child nodes <title>
, but <title>
the label also contains a sub-node: string “index”
, the string in this case “index”
also a <head>
descendant node label. .descendants
Property can all tag的子孙节点
be recursive loop:
for child in head_tag.descendants:
print(child)
# <title>index</title>
# index
Parent node
Each tag or string has a parent node: it is contained in a tag
.parent
By .parent
acquiring the parent node of an element attribute .title
of the parent tag is head
, html
the parent tag is BeautifulSoup object, and the object is the parent tag BeautifulSoup None.
title_tag = soup.title
title_tag
# <title>index</title>
title_tag.parent
# <head><title>index</title></head>
# 文档title的字符串也有父节点:<title>标签
title_tag.string.parent
# <title>index</title>
# 文档的顶层节点比如<html>的父节点是 BeautifulSoup 对象:
html_tag = soup.html
type(html_tag.parent)
# <class 'bs4.BeautifulSoup'>
# BeautifulSoup 对象的 .parent 是None:
print(soup.parent)
# None
.parents
Element by .parents
all nodes recursively obtained fathers element attributes, the following example uses the .parents
method traverses <a>
all nodes to the root node tag.
link = soup.a
link
# <a href="http://cityShop.com/elsie" class="city" id="home">home</a>
for parent in link.parents:
if parent is None:
print(parent)
else:
print(parent.name)
#结果:
# p
# body
# html
# [document]
# None
Sibling node
The .next_sibling and .previous_sibling
sibling nodes are sibling elements at the same level. In the document tree, use the .next_sibling
and .previous_sibling
attribute to query the sibling nodes:
soup = BeautifulSoup(html_doc, "lxml");
p_tag=soup.p
print(p_tag.next_sibling)
print(p_tag.next_sibling.next_sibling)
# 输出结果
<p class="story">这是我的第三个商城,欢迎来参观
<a href="http://cityShop.com/elsie" class="city" id="home">home</a>
<a href="http://cityShop.com/lacie" class="city" id="design">design</a>
<a href="http://cityShop.com/tillie" class="city" id="products">products</a>
欢迎来参观.</p>
p 的第一个 next_sibling 是p 和 p 之间的换行符。
.next_siblings and .previous_siblings
through .next_siblings
and .previous_siblings
property can be iterated output brothers of the current node:
soup = BeautifulSoup(html_doc, "lxml");
p_tag=soup.p
for p_tag in p_tag.previous_siblings:
print( p_tag)
# 输出结果
<p class="story">这是我的第三个商城,欢迎来参观
<a href="http://cityShop.com/elsie" class="city" id="home">home</a>
<a href="http://cityShop.com/lacie" class="city" id="design">design</a>
<a href="http://cityShop.com/tillie" class="city" id="products">products</a>
欢迎来参观.</p>
# 输出结果
<p class="title"><b>商城</b></p>
Forward and backward
By .next_element
and .previous_element
objects before or after obtaining a specified tag a parsed, pay attention to this and siblings are somewhat different, sibling refers to the same father node has a child node, and this before or after a document is in accordance with a Calculated in the order of analysis.
For example, in a text instance html_doc
, the head
sibling is body
(without regard to line breaks), because they have a common parent html
, but head
the next node is title
. That is soup.head.next_sibling=title soup.head.next_element=title`.
soup = BeautifulSoup(html_doc, "lxml");
head_tag=soup.head
print(head_tag.next_element)
title_tag=soup.title
print(title_tag.next_element)
# 输出结果
<title>index</title>
index
Also note that the title
next label is not resolved body
, but title
the content within the tag, because html
the resolution order is open title
tab, and then parses the content, and finally close title
the label.
.previous_element
The property is just the .next_element
opposite, it points to the object currently being parsed 前一个解析对象
:
last_a_tag.previous_element
# u' and\n'
last_a_tag.previous_element.next_element
# <a href="http://cityShop.com/tillie" class="city" id="products">products</a>
In addition, the same can .next_elements
and .previous_elements
to iterate document tree. Newline characters also occupy the parsing order, which is consistent with the effect of iterating sibling nodes.
for element in last_a_tag.next_elements:
print(repr(element))
# u'Tillie'
# u';\欢迎来参观.'
# u'\n\n'
# <p class="story">...</p>
# u'...'
# u'\n'
# None
Search the document tree
Beautiful Soup defines many search methods, of which the more important and commonly used are: find()
and find_all()
.
Define a document instance:
html_doc = """
<html><head><title>index</title></head>
<p class="title"><b>商城首页</b></p>
<p class="story">这是我的第三个商城,欢迎来参观
<a href="http://cityShop.com/elsie" class="city" id="home">home</a>
<a href="http://cityShop.com/lacie" class="city" id="design">design</a>
<a href="http://cityShop.com/tillie" class="city" id="products">products</a>
欢迎来参观.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
Use a find_all()
similar method to find the content of the document you want to find:
soup.find_all('b')
# 结果
# [<b>商城首页</b>]
If you pass in a regular expression as a parameter, Beautiful Soup will match()
match the content through the regular expression . In the following example, all b
the tags that start with are found , which means that the <body>
sum <b>
tags should be found:
import re
for tag in soup.find_all(re.compile("^b")):
print(tag.name)
# body
# b
Find all ”t”
tags contained in the name :
for tag in soup.find_all(re.compile("t")):
print(tag.name)
# 结果
# html
# title
Pass in the list parameters, and Beautiful Soup will return the content that matches any element in the list. The following code finds all <a>
tags and <b>
tags in the document :
soup.find_all(["a", "b"])
# [<b>商城首页</b>
# <a href="http://cityShop.com/elsie" class="city" id="home">home</a>
# <a href="http://cityShop.com/lacie" class="city" id="design">design</a>
# <a href="http://cityShop.com/tillie" class="city" id="products">products</a>]
More detailed documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#