Read catalog
- 1, Beautiful Soup4 mounting configuration
- 2, the basic usage of BeautifulSoup
- (1) Node Selector (tag)
- (2) method selector
- (3) CSS selectors
- (4) tag modification method
Python's Beautiful Soup is an HTML or XML parsing library, we can use it to easily extract data from web pages, it has a powerful API and a variety of analytical methods.
Beautiful Soup of three characteristics:
- Beautiful Soup provides some simple methods and python-style function, used to browse, search, and modify the parse tree, it is a toolbox, the need to provide users with data captured by parsing the document
- Beautiful Soup automatically convert into stable Unicode encoding, the output document is converted to UTF-8 encoding, without regard to coding, unless the document does not specify the encoding, then only you need to specify the original encoding
- Beautiful Soup is located on the popular Python parser (such as lxml and html5lib), allows you to try different strategies to resolve or transaction speed for flexibility.
1, Beautiful Soup4 mounting configuration
Beautiful Soup4 by PyPi release, it can be installed through the System Management Pack tools package named beautifulsoup4
$easy_install beautifulsoup4 或者 $pip install beautifulsoup4
You may also be installed by downloading Source Package:
#wget https://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz #tar xf beautifulsoup4-4.1.0.tar.gz #cd beautifulsoup4 #python setup.py install
Beautiful Soup is actually dependent on when parsing parser, in addition to its support for python standard library of HTML parser also supports third-party parser as lxml
Beautiful Soup parser support, as well as their advantages and disadvantages:
Parser | Instructions | Advantage | Disadvantaged |
---|---|---|---|
Python Standard Library | BeautifulSoup(markup,"html.parser") |
|
|
lxml HTML parser | BeautifulSoup(markup,"lxml") |
|
|
lxml XML parser | BeautifulSoup(markup,["lxml", "xml"]) BeautifulSoup(markup,"xml") |
|
|
html5lib | BeautifulSoup(markup,"html5lib") |
|
|
Installation parser:
$pip install lxml $pip install html5lib
Lxml recommended as a parser, because of the higher efficiency. In previous versions and Python3 Python2.7.3 in the previous 3.2.2 version, you must install lxml or html5lib, because those versions of the Python standard library built-in HTML parsing method is not stable enough
2, the basic usage of BeautifulSoup
By passing a file handle or a range of characters, the construction method can be obtained BeautifulSoup a target document, select the appropriate parser to parse the document as specified manually specify the selected parser to parse the document, Beautiful Soup complex HTML complex document into a tree structure, each node is python objects, all objects can be grouped into four kinds: Tag, NavigableString, BeautifulSoup, Comment
Note: BeautifulSoup package version 4 is the introduction of the bs4
from bs4 import BeautifulSoup #下面代码示例都是用此文档测试 html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ markup="<b><!--Hey, buddy. Want to buy a used parser?--></b>" soup=BeautifulSoup(html_doc,"lxml") = BeautifulSoup soup1 (Markup, "lxml") Tag = soup.a navstr = tag.string the Comment = soup1.b.string Print (of the type (Tag)) #tag label object print (type (comment)) #Comment object that contains the document Notes contents print (type (navstr)) #NavigableString object wraps the string contents print (type (soup)) #BeautifulSoup object for the entire contents of the document # <class' bs4.element.Tag '> <class' bs4.element.Comment '> <class' bs4.element.NavigableString'> <class' bs4.BeautifulSoup '>
(1) Node Selector (tag)
The name of the node can be called directly select an element node, the node can be nested select the type of the returned objects are bs4.element.Tag
= the BeautifulSoup Soup (html_doc, 'lxml') Print (soup.head) # Get the head tag print (soup.pb) # p at node acquires the node b print (soup.a.string) # Get the text under a label, Gets only the first
Gets the name attribute node name:
soup.body.name
attrs attribute acquisition node attributes, may be in the form of direct access to the dictionary, the result may be returned list or string type, depending on the node type
Get all the properties soup.p.attrs # p nodes soup.p.attrs [ 'class'] # p-node class attributes acquired soup.p [ 'class'] # p direct access node class attributes
string property to get the text nodes contained elements:
soup.p.string # Gets the text contents of the first node p
contents directly attribute node's children, returns the contents as a list
soup.body.contents # is a direct child node, excluding descendant nodes
Direct child node of the node are children property acquired, only to return the type of generator
soup.body.children
attribute acquisition descendants descendant node returns generator
soup.body.descendants
parent parent attribute acquisition, acquisition Parents ancestor node, returns generator
soup.b.parent soup.b.parents
next_sibling property returns a next sibling node, previous_sibling return to a sibling node, a node is also noted that line breaks, it is sometimes acquired sibling nodes usually a string or a blank
soup.a.next_sibling soup.a.previous_sibling
next_siblings previous_sibling and returns all siblings are front and rear, to return the generator
soup.a.next_siblings soup.a.previous_siblings
Obtaining a next object to be parsed and previous_element next_element properties, or a
soup.a.next_element soup.a.previous_element
next_elements and previous_elements iterator forward or rear access the document parsing content
soup.a.next_elements soup.a.previous_elements
(2) method selector
Are selected by node attributes, this method is very fast used previously, but during more complex choice is not flexible enough, but fortunately Beautiful Soup also provides a number of query methods for us, as fang_all () and find () Wait
find_all (name, attrs, recursive, text, ** kwargs): Discover all eligible elements, where the parameters
name represents can find all the names for the name of the label (tag), may also be a filter, regular expression, or list is True
attrs represents incoming attributes can be specified as id property used in the form of a dictionary by attrs parameter, attrs = { 'id': '123'}, since python class attribute is a keyword, all the needs in the class in the query that is followed by an underscore class _ = 'element', the result returned is a list of tag types
text parameters to match the text of the node, may be passed in the form of a string may be a regular expression object
recursive said, If you want to search the direct child can set the parameter to false: recursive = Flase
limit parameter can be used to limit the number of results returned, similar to the keywords in SQL limit
import re from bs4 import BeautifulSoup html_doc = """ #下面示例都是用此文本内容测试 <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> ddd <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> <span>中文</span> """ soup=BeautifulSoup(html_doc,'lxml') Print (type (Soup)) ['Tillie'] print (soup.find_all ( 'span') ) # tag lookup print (soup.find_all ( 'a', id = 'link1')) # attribute tagging filter print (soup.find_all ( 'a', attrs = { ' class': 'sister', ' id': 'link3'})) # Multiattribute print (soup.find_all ( 'p', class _ = 'title')) #class particularity, the incoming parameter is * kwargs * Print (soup.find_all (text = the re.compile ( 'Tillie'))) # text filtering print (soup.find_all ( 'a', limit = 2)) # limit the number of output # <class 'bs4.BeautifulSoup' > [<span> Chinese </ span>] [<a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a>] [<= class A " SISTER "the href =" http://example.com/tillie "ID =" link3 "> Tillie </a>] [<p class="title"><b>The Dormouse's story</b></p>] [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
find (name, attrs, recursive, text, ** kwargs): it returns a single element, i.e. the first matching element, the type of tag type is still
Parameters with find_all () as
There are also a number of query method, find_all its use and described previously () method is identical, but different query scope, parameters, too
find_parents (name, attrs, recursive, text, ** kwargs) and find_parent (name, attrs, recursive, text, ** kwargs): the former returns all ancestor nodes, which returns the direct parent
find_next_siblings (name, attrs, recursive, text, ** kwargs) and find_next_sibling (name, attrs, recursive, text, ** kwargs): on the tag back to iterate current node, returns all siblings behind the former, the latter returns after the first sibling
find_previous_siblings (name, attrs, recursive, text, ** kwargs) and find_previous_sibling (name, attrs, recursive, text, ** kwargs): tag previous node of the current iteration, the former returns all the previous siblings, which returns in front of the first sibling
find_all_next (name, attrs, recursive, text, ** kwargs) and find_next (name, attrs, recursive, text, ** kwargs): on the tag and the string after the current tag iterates former returns all eligible node, which returns the first matching node
find_all_previous () and find_previous (): for string tag and tag before the current iteration, all eligible node, which returns the first matching node after the return of the former node
(3) CSS selectors
Beautiful Soup also provides a CSS selector, if not familiar with multi CSS selectors can refer http://www.w3school.com.cn/cssref/css_selectors.asp
In .select Tag or BeautifulSoup object () method, passing a string parameter, you can use CSS selector syntax find tag:
In [10]: soup.select('title') Out[10]: [<title>The Dormouse's story</title>]
Find layer by layer by tag Tags:
In [12]: soup.select('body a') Out[12]: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Finding direct children under the label of a tag Tags:
In [13]: soup.select('head > title') Out[13]: [<title>The Dormouse's story</title>]
Find sibling Tags:
In [14]: soup.select('#link1 ~ .sister') Out[14]: [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Find by CSS class name:
In [15]: soup.select('.title') Out[15]: [<p class="title"><b>The Dormouse's story</b></p>] In [16]: soup.select('[class~=title]') Out[16]: [<p class="title"><b>The Dormouse's story</b></p>]
Find by id tag of:
In [17]: soup.select('#link1') Out[17]: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] In [18]: soup.select('a#link2') Out[18]: [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
To find whether there is a property by:
In [20]: soup.select('a[href]') Out[20]: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
By value of the property to find a match:
In [22]: soup.select('a[href="http://example.com/elsie"]') Out[22]: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] In [23]: soup.select('a[href^="http://example.com/"]') #匹配值的开头 Out[23]: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] In [24]: soup.select('a[href$="tillie"]') #匹配值的结尾 Out[24]: [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] In [25]: soup.select('a[href*=".com/el"]') #模糊匹配 Out[25]: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
Find tag node, method selector to find and implement ways to find CSS selectors basically three ways similar, tag all relative to the other two fastest way to find, but provides a more convenient method of selecting a more complex search methods, the use of more if use.
(4) tag modification method
Beautiful Soup's strengths is the search function of the document, modify the function uses only a brief scene is not much, to learn more visit the Beautiful Soup methods to modify the official document viewing.
Beautiful Soup can effect a change in the value of the property tag sign, add or remove properties and content, here are some of the common methods
The In [26 is]: = Markup '<a href="http://www.baidu.com/"> baidu </a>' the In [28]: = the BeautifulSoup Soup (Markup, 'lxml') the In [29] : soup.a.string = 'Baidu' the In [30]: soup.a Out [30]: <a href="http://www.baidu.com/"> Baidu </a> # if a next node including child will also be overwritten
Tag.append () method would like to add a tag content, like Python's list of .append () method:
In [30]: soup.a Out[30]: <a href="http://www.baidu.com/">百度</a> In [31]: soup.a.append('一下') In [32]: soup.a Out[32]: <a href="http://www.baidu.com/">百度一下</a>
new_tag()方法用于创建一个tag标签
In [33]: soup=BeautifulSoup('<b></b>','lxml') In [34]: new_tag=soup.new_tag('a',href="http://www.python.org") #创建tag,第一个参数必须为tag的名称 In [35]: soup.b.append(new_tag) #添加到b节点下 In [36]: new_tag.string='python' #为tag设置值 In [37]: soup.b Out[37]: <b><a href="http://www.python.org">python</a></b>
其他方法:
insert()将元素插入到指定的位置
inert_before()在当前tag或文本节点前插入内容
insert_after()在当前tag或文本节点后插入内容
clear()移除当前tag的内容
extract()将当前tag移除文档数,并作为方法结果返回
prettify()将Beautiful Soup的文档数格式化后以Unicode编码输出,tag节点也可以调用
get_text()输出tag中包含的文本内容,包括子孙tag中的内容
soup.original_encoding 属性记录了自动识别的编码结果
from_encoding:参数在创建BeautifulSoup对象是可以用来指定编码,减少猜测编码的运行速度
#解析部分文档,可以使用SoupStrainer类来创建一个内容过滤器,它接受同搜索方法相同的参数
from bs4 import BeautifulSoup,SoupStrainer html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> ddd <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> <span>中文</span> """ only_a_tags = SoupStrainer('a') #顾虑器 soup=BeautifulSoup(html_doc,'lxml',parse_only=only_a_tags) print(soup.prettify()) # <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a>
#Beautiful Soup异常处理:
HTMLParser.HTMLParseError:malformed start tag
HTMLParser.HTMLParseError:bad end tag 这个两个异常都是解析器引起的,解决方法是安装lxml或者html5lib
更多内容...
官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Chinese document: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh