python of BeautifulSoup4

Read catalog

  • 1, Beautiful Soup4 mounting configuration
  • 2, the basic usage of BeautifulSoup
  • (1) Node Selector (tag)
  • (2) method selector
  • (3) CSS selectors
  • (4) tag modification method

Python's Beautiful Soup is an HTML or XML parsing library, we can use it to easily extract data from web pages, it has a powerful API and a variety of analytical methods.

Beautiful Soup of three characteristics:

  • Beautiful Soup provides some simple methods and python-style function, used to browse, search, and modify the parse tree, it is a toolbox, the need to provide users with data captured by parsing the document
  • Beautiful Soup automatically convert into stable Unicode encoding, the output document is converted to UTF-8 encoding, without regard to coding, unless the document does not specify the encoding, then only you need to specify the original encoding
  • Beautiful Soup is located on the popular Python parser (such as lxml and html5lib), allows you to try different strategies to resolve or transaction speed for flexibility.
 

1, Beautiful Soup4 mounting configuration

Beautiful Soup4 by PyPi release, it can be installed through the System Management Pack tools package named beautifulsoup4

$easy_install beautifulsoup4
或者
$pip install beautifulsoup4

You may also be installed by downloading Source Package:

Copy the code
#wget  https://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz
#tar xf beautifulsoup4-4.1.0.tar.gz
#cd beautifulsoup4
#python setup.py install
Copy the code

Beautiful Soup is actually dependent on when parsing parser, in addition to its support for python standard library of HTML parser also supports third-party parser as lxml

Beautiful Soup parser support, as well as their advantages and disadvantages:

Parser Instructions Advantage Disadvantaged
Python Standard Library BeautifulSoup(markup,"html.parser")
  • Python's standard library built
  • Execution rate is moderate
  • Documents fault-tolerant capability
  • Version of Python 2.7.3 or 3.2.2) before the document fault tolerance poor
lxml HTML parser BeautifulSoup(markup,"lxml")
  • high speed
  • Documents fault-tolerant capability
  • You need to install the C language library
lxml XML parser

BeautifulSoup(markup,["lxml", "xml"])

BeautifulSoup(markup,"xml")

  • high speed
  • The only support XML parser
  • You need to install the C language library
html5lib BeautifulSoup(markup,"html5lib")
  • The best fault tolerance
  • Browser way to parse the document
  • Generating documentation HTML5 format
  • Slow
  • Do not rely on external expansion

Installation parser:

$pip install lxml
$pip install html5lib

Lxml recommended as a parser, because of the higher efficiency. In previous versions and Python3 Python2.7.3 in the previous 3.2.2 version, you must install lxml or html5lib, because those versions of the Python standard library built-in HTML parsing method is not stable enough

 

2, the basic usage of BeautifulSoup

By passing a file handle or a range of characters, the construction method can be obtained BeautifulSoup a target document, select the appropriate parser to parse the document as specified manually specify the selected parser to parse the document, Beautiful Soup complex HTML complex document into a tree structure, each node is python objects, all objects can be grouped into four kinds: Tag, NavigableString, BeautifulSoup, Comment

Note: BeautifulSoup package version 4 is the introduction of the bs4

Copy the code
from bs4 import BeautifulSoup

#下面代码示例都是用此文档测试
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
markup="<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup=BeautifulSoup(html_doc,"lxml")
= BeautifulSoup soup1 (Markup, "lxml") 
Tag = soup.a 
navstr = tag.string 
the Comment = soup1.b.string 
Print (of the type (Tag)) #tag label object 
print (type (comment)) #Comment object that contains the document Notes contents 
print (type (navstr)) #NavigableString object wraps the string contents 
print (type (soup)) #BeautifulSoup object for the entire contents of the document 

# 
<class' bs4.element.Tag '> 
<class' bs4.element.Comment '> 
<class' bs4.element.NavigableString'> 
<class' bs4.BeautifulSoup '>
Copy the code
 

(1) Node Selector (tag)

The name of the node can be called directly select an element node, the node can be nested select the type of the returned objects are bs4.element.Tag

= the BeautifulSoup Soup (html_doc, 'lxml') 
Print (soup.head) # Get the head tag 
print (soup.pb) # p at node acquires the node b 
print (soup.a.string) # Get the text under a label, Gets only the first

Gets the name attribute node name:

soup.body.name

attrs attribute acquisition node attributes, may be in the form of direct access to the dictionary, the result may be returned list or string type, depending on the node type

Get all the properties soup.p.attrs # p nodes 
soup.p.attrs [ 'class'] # p-node class attributes acquired 
soup.p [ 'class'] # p direct access node class attributes

string property to get the text nodes contained elements:

soup.p.string # Gets the text contents of the first node p

contents directly attribute node's children, returns the contents as a list

soup.body.contents # is a direct child node, excluding descendant nodes

Direct child node of the node are children property acquired, only to return the type of generator

soup.body.children

attribute acquisition descendants descendant node returns generator

soup.body.descendants

parent parent attribute acquisition, acquisition Parents ancestor node, returns generator

soup.b.parent
soup.b.parents

next_sibling property returns a next sibling node, previous_sibling return to a sibling node, a node is also noted that line breaks, it is sometimes acquired sibling nodes usually a string or a blank

soup.a.next_sibling
soup.a.previous_sibling

next_siblings previous_sibling and returns all siblings are front and rear, to return the generator

soup.a.next_siblings
soup.a.previous_siblings

Obtaining a next object to be parsed and previous_element next_element properties, or a

soup.a.next_element
soup.a.previous_element

next_elements and previous_elements iterator forward or rear access the document parsing content

soup.a.next_elements
soup.a.previous_elements
 

(2) method selector

Are selected by node attributes, this method is very fast used previously, but during more complex choice is not flexible enough, but fortunately Beautiful Soup also provides a number of query methods for us, as fang_all () and find () Wait

find_all (name, attrs, recursive, text, ** kwargs): Discover all eligible elements, where the parameters

name represents can find all the names for the name of the label (tag), may also be a filter, regular expression, or list is True

attrs represents incoming attributes can be specified as id property used in the form of a dictionary by attrs parameter, attrs = { 'id': '123'}, since python class attribute is a keyword, all the needs in the class in the query that is followed by an underscore class _ = 'element', the result returned is a list of tag types

text parameters to match the text of the node, may be passed in the form of a string may be a regular expression object

recursive said, If you want to search the direct child can set the parameter to false: recursive = Flase

limit parameter can be used to limit the number of results returned, similar to the keywords in SQL limit

Copy the code
import re
from bs4 import BeautifulSoup

html_doc = """ #下面示例都是用此文本内容测试
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
    ddd
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<span>中文</span>
"""

soup=BeautifulSoup(html_doc,'lxml')
Print (type (Soup))
['Tillie']
print (soup.find_all ( 'span') ) # tag lookup 
print (soup.find_all ( 'a', id = 'link1')) # attribute tagging filter 
print (soup.find_all ( 'a', attrs = { ' class': 'sister', ' id': 'link3'})) # Multiattribute 
print (soup.find_all ( 'p', class _ = 'title')) #class particularity, the incoming parameter is * kwargs * 
Print (soup.find_all (text = the re.compile ( 'Tillie'))) # text filtering 
print (soup.find_all ( 'a', limit = 2)) # limit the number of output 


# 
<class 'bs4.BeautifulSoup' > 
[<span> Chinese </ span>] 
[<a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a>] 
[<= class A " SISTER "the href =" http://example.com/tillie "ID =" link3 "> Tillie </a>] 
[<p class="title"><b>The Dormouse's story</b></p>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
Copy the code

find (name, attrs, recursive, text, ** kwargs): it returns a single element, i.e. the first matching element, the type of tag type is still

Parameters with find_all () as

There are also a number of query method, find_all its use and described previously () method is identical, but different query scope, parameters, too

find_parents (name, attrs, recursive, text, ** kwargs) and find_parent (name, attrs, recursive, text, ** kwargs): the former returns all ancestor nodes, which returns the direct parent

find_next_siblings (name, attrs, recursive, text, ** kwargs) and find_next_sibling (name, attrs, recursive, text, ** kwargs): on the tag back to iterate current node, returns all siblings behind the former, the latter returns after the first sibling

find_previous_siblings (name, attrs, recursive, text, ** kwargs) and find_previous_sibling (name, attrs, recursive, text, ** kwargs): tag previous node of the current iteration, the former returns all the previous siblings, which returns in front of the first sibling

find_all_next (name, attrs, recursive, text, ** kwargs) and find_next (name, attrs, recursive, text, ** kwargs): on the tag and the string after the current tag iterates former returns all eligible node, which returns the first matching node

find_all_previous () and find_previous (): for string tag and tag before the current iteration, all eligible node, which returns the first matching node after the return of the former node

 

(3) CSS selectors

Beautiful Soup also provides a CSS selector, if not familiar with multi CSS selectors can refer http://www.w3school.com.cn/cssref/css_selectors.asp

In .select Tag or BeautifulSoup object () method, passing a string parameter, you can use CSS selector syntax find tag:

In [10]: soup.select('title')
Out[10]: [<title>The Dormouse's story</title>]

Find layer by layer by tag Tags:

Copy the code
In [12]: soup.select('body a')
Out[12]: 
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Copy the code

Finding direct children under the label of a tag Tags:

In [13]: soup.select('head > title')
Out[13]: [<title>The Dormouse's story</title>]

Find sibling Tags:

In [14]: soup.select('#link1 ~ .sister')
Out[14]: 
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Find by CSS class name:

Copy the code
In [15]: soup.select('.title')
Out[15]: [<p class="title"><b>The Dormouse's story</b></p>]

In [16]: soup.select('[class~=title]')
Out[16]: [<p class="title"><b>The Dormouse's story</b></p>]
Copy the code

Find by id tag of:

Copy the code
In [17]: soup.select('#link1')
Out[17]: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [18]: soup.select('a#link2')
Out[18]: [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
Copy the code

To find whether there is a property by:

Copy the code
In [20]: soup.select('a[href]')
Out[20]: 
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Copy the code

By value of the property to find a match:

Copy the code
In [22]: soup.select('a[href="http://example.com/elsie"]')
Out[22]: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [23]: soup.select('a[href^="http://example.com/"]')  #匹配值的开头
Out[23]: 
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [24]: soup.select('a[href$="tillie"]')  #匹配值的结尾
Out[24]: [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [25]: soup.select('a[href*=".com/el"]')  #模糊匹配
Out[25]: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
Copy the code

Find tag node, method selector to find and implement ways to find CSS selectors basically three ways similar, tag all relative to the other two fastest way to find, but provides a more convenient method of selecting a more complex search methods, the use of more if use.

 

(4) tag modification method

Beautiful Soup's strengths is the search function of the document, modify the function uses only a brief scene is not much, to learn more visit the Beautiful Soup methods to modify the official document viewing.

Beautiful Soup can effect a change in the value of the property tag sign, add or remove properties and content, here are some of the common methods

Copy the code
The In [26 is]: = Markup '<a href="http://www.baidu.com/"> baidu </a>' 
the In [28]: = the BeautifulSoup Soup (Markup, 'lxml') 
the In [29] : soup.a.string = 'Baidu' 
the In [30]: soup.a 
Out [30]: <a href="http://www.baidu.com/"> Baidu </a> 
# if a next node including child will also be overwritten
Copy the code

Tag.append () method would like to add a tag content, like Python's list of .append () method:

Copy the code
In [30]: soup.a
Out[30]: <a href="http://www.baidu.com/">百度</a>

In [31]: soup.a.append('一下')

In [32]: soup.a
Out[32]: <a href="http://www.baidu.com/">百度一下</a>
Copy the code

new_tag()方法用于创建一个tag标签

Copy the code
In [33]: soup=BeautifulSoup('<b></b>','lxml')

In [34]: new_tag=soup.new_tag('a',href="http://www.python.org") #创建tag,第一个参数必须为tag的名称

In [35]: soup.b.append(new_tag) #添加到b节点下

In [36]: new_tag.string='python' #为tag设置值

In [37]: soup.b
Out[37]: <b><a href="http://www.python.org">python</a></b>
Copy the code

其他方法:

insert()将元素插入到指定的位置

inert_before()在当前tag或文本节点前插入内容

insert_after()在当前tag或文本节点后插入内容

clear()移除当前tag的内容

extract()将当前tag移除文档数,并作为方法结果返回

prettify()将Beautiful Soup的文档数格式化后以Unicode编码输出,tag节点也可以调用

get_text()输出tag中包含的文本内容,包括子孙tag中的内容

soup.original_encoding 属性记录了自动识别的编码结果

from_encoding:参数在创建BeautifulSoup对象是可以用来指定编码,减少猜测编码的运行速度

#解析部分文档,可以使用SoupStrainer类来创建一个内容过滤器,它接受同搜索方法相同的参数

Copy the code
from bs4 import BeautifulSoup,SoupStrainer

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
    ddd
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<span>中文</span>
"""


only_a_tags = SoupStrainer('a')  #顾虑器

soup=BeautifulSoup(html_doc,'lxml',parse_only=only_a_tags)

print(soup.prettify())

#
<a class="sister" href="http://example.com/elsie" id="link1">
 Elsie
</a>
<a class="sister" href="http://example.com/lacie" id="link2">
 Lacie
</a>
<a class="sister" href="http://example.com/tillie" id="link3">
 Tillie
</a>
Copy the code

#Beautiful Soup异常处理:

HTMLParser.HTMLParseError:malformed     start    tag  

HTMLParser.HTMLParseError:bad   end   tag 这个两个异常都是解析器引起的,解决方法是安装lxml或者html5lib

更多内容...

官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Chinese document: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh

PyPI:https://pypi.org/project/beautifulsoup4/

Guess you like

Origin www.cnblogs.com/xingxia/p/python_beautifulsoup.html