Using Python library reptile BeautifulSoup4

BeautifulSoup4 library

And lxml Like, Beautiful Soup is also an HTML / XML parser, the main function is how to parse and extract HTML / XML data.
lxml only partial traversal, and Beautiful Soup is based on the HTML DOM (Document Object Model), and will load the entire document, parse the whole DOM tree, so the time and memory overhead will be much larger, so the performance is lower than lxml.
BeautifulSoup for parsing HTML is relatively simple, API is very user-friendly, supports CSS selectors, Python standard library of HTML parsers also support XML parser lxml.
Beautiful Soup 3 has stopped development, we recommend projects now use Beautiful Soup 4.

Installation and documentation:

  1. Installation:pip install bs4 .
  2. Chinese document: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

Several major analytical tool comparison:

Analysis Tool Parsing speed Use of difficulty
BeautifulSoup slowest the easiest
lxml fast simple
Regular The fastest The hardest

Simple to use:

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建 Beautiful Soup 对象
# 使用lxml来进行解析 soup = BeautifulSoup(html,"lxml") print(soup.prettify()) 

Four commonly used objects:

Beautiful Soup complex HTML documents converted into a complex tree structure, each node is Python objects, all objects can be grouped into four kinds:

  1. Tag
  2. NavigatableString
  3. BeautifulSoup
  4. Comment

1. Tag:

Tag Popular speak is one of the HTML tags. Sample code is as follows:

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#创建 Beautiful Soup 对象
soup = BeautifulSoup(html,'lxml') print soup.title # <title>The Dormouse's story</title> print soup.head # <head><title>The Dormouse's story</title></head> print soup.a # <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> print soup.p # <p class="title" name="dromouse"><b>The Dormouse's story</b></p> print type(soup.p) # <class 'bs4.element.Tag'> 

We can use soup name tag content easily get these labels, these types of objects are bs4.element.Tag. Note, however, that it is the first look at everything in it to meet the requirements of the label. If you want to query all of the labels, it will be introduced later.
For Tag, it has two important attributes, which are name and attrs. Sample code is as follows:

print soup.name
# [document] #soup 对象本身比较特殊,它的 name 即为 [document]

print soup.head.name
# head #对于其他内部标签,输出的值便为标签本身的名称

print soup.p.attrs # {'class': ['title'], 'name': 'dromouse'} # 在这里,我们把 p 标签的所有属性打印输出了出来,得到的类型是一个字典。 print soup.p['class'] # soup.p.get('class') # ['title'] #还可以利用get方法,传入属性的名称,二者是等价的 soup.p['class'] = "newClass" print soup.p # 可以对这些属性和内容等等进行修改 # <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p> 

2. NavigableString:

If get the labels want to get content label. So you can tag.stringget the label text. Sample code is as follows:

print soup.p.string
# The Dormouse's story

print type(soup.p.string)
# <class 'bs4.element.NavigableString'>thon

3. BeautifulSoup:

BeautifulSoup objects represents the entire contents of a document. Most of the time, you can treat it as Tag object, which supports most of the way through the document tree and search the document tree described.
Because the object is not really BeautifulSoup HTML or XML the tag, so it has no name and attribute properties. but sometimes it's .name property view is very convenient, so BeautifulSoup object contains a value of "[document]" special attributes .name

soup.name
# '[document]'

4. Comment:

Tag, NavigableString, BeautifulSoup covering almost all the content in html and xml, but there are some special objects are easy to worry about the content of the comment section of the document:

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

Comment object is a special type of NavigableString objects:

comment
# 'Hey, buddy. Want to buy a used parser'

Traversing the document tree:

1. contents和children:

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'lxml')

head_tag = soup.head
# 返回所有子节点的列表 print(head_tag.contents) # 返回所有子节点的迭代器 for child in head_tag.children: print(child) 

2. strings 和 stripped_strings

If the tag contains a plurality of strings [2], may be used to cycle .strings obtain:

for string in soup.strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u'\n\n'
    # u"The Dormouse's story" # u'\n\n' # u'Once upon a time there were three little sisters; and their names were\n' # u'Elsie' # u',\n' # u'Lacie' # u' and\n' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # u'...' # u'\n' 

The output string may contain a lot of space or blank lines, use .stripped_strings can remove excess white space:

for string in soup.stripped_strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u"The Dormouse's story"
    # u'Once upon a time there were three little sisters; and their names were' # u'Elsie' # u',' # u'Lacie' # u'and' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'...' 

Search document tree:

1. find methods and find_all:

Search the document tree, it is generally used more than two methods, one find, one find_all. findThe first method is to find a label to meet the conditions immediately after the return, return only one element. find_allThe method is to satisfy all the conditions are selected to label, then return back. Using these two methods, the most common usage is out nameand the attrparameters to meet the requirements of the label to find out.

soup.find_all("a",attrs={"id":"link2"})

Or is passed directly attribute name as keyword arguments:

soup.find_all("a",id='link2')

2. select method:

Using the above method can easily identify the elements. But sometimes the use of cssselectors could be done more easily. Use cssselector syntax, you should use selectthe method. Following are a few common cssselector methods:

(1) Find by tag name:

print(soup.select('a'))

(2) Find by class name:

By class name, it should be added in front of a class .. For example, to find the class=sisterlabel. Sample code is as follows:

print(soup.select('.sister'))

(3) Find by id:

Find by id should be a plus sign in front of the # id name. Sample code is as follows:

print(soup.select("#link1"))

(4) combination to find:

When that is combined to find the file and write class, the label name and the class name, id name to the combination principle is the same, such as finding p tags, id equal to the contents link1, the two need to be separated by spaces:

print(soup.select("p #link1"))

Finding direct child tag, use the> separator:

print(soup.select("head > title"))

(5) Find by property:

Find properties can have added elements, attributes need to be enclosed in brackets, pay attention to the label property and belong to the same node, so the middle can not add a space, otherwise they will be unable to match. Sample code is as follows:

print(soup.select('a[href="http://example.com/elsie"]'))

(6) access to content

The results above are select method returns a list form, the form can traverse device, and then get_text () method to obtain its contents.

soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text() for title in soup.select('title'): print title.get_text()



Guess you like

Origin www.cnblogs.com/csnd/p/11469315.html