一.Beautiful Soup
1 Introduction
Beautiful Soup is a library of python, the most important function is to grab data from a web page. Its characteristics are as follows (these three features is the bs strong reasons, from the official manual)
a. Beautiful Soup provide some simple, Python type functions for handling navigation, search, modify functions parse tree. It is a toolkit to provide needed data captured by the user to parse the document, because simple, so do not need much code to write a complete application.
b. Beautiful Soup automatically converted to Unicode encoding input document, the document is converted to an output utf-8 encoded. You do not need to consider encoding, unless the document does not specify a code, then, Beautiful Soup can not automatically identify the encoding. Then, you just need to explain the original coding on it.
c. Beautiful Soup has become and lxml, html6lib as good as the python interpreter, provide users with different analytical strategies or strong rate flexibility.
2.Beautiful Soup parser support
(1) python standard library (default): python standard library built, moderate speed, high fault tolerance documents
Usage: BeautifulSoup (data, "html.parser")
(2) lxml HTML Parser: fast, strong fault tolerance documents
Usage: BeautifulSoup (data, "lxml")
(3) lxml XML parsers: speed, the only support for XML parser
Usage: BeautifulSoup (markup, [ "lxml", "xml"]); BeautifulSoup (markup, "xml")
(4) html5lib parser: best fault tolerance; to parse the document browser; HTML5 generate a document format; slow
Two objects to create a soup
Here is a case on the official manual
html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="story"><!--哈哈--></p> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> """
(1) introducing the library bs4
from bs4 imort import beautifulSoup
(2) create an object beautifulsoup
Here python use the default parser that html.parser
soup = BeautifulSoup (html_doc) # is equivalent to soup = BeautifulSoup (markup, "html.parser")
This run, then there will be a reminder, as follows:
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.
As implied above, to avoid this prompt if you choose a parser, as follows
soup = BeautifulSoup(html_doc, “lxml”)
Formatted output, there are function completion
result = soup.prettify()
print(result)
result, the print output format
<html> <head> <title> The Dormouse's story </title> </head> <body> 。。。。 </body> </html>
III. Four types of objects
Beautiful Soup complex HTML documents converted into a complex tree structure, each node is python object, all objects may be grouped into the following 4:
Tag ;BeautifulSoup; Comment; NavigableString
(1)Tag
Tag What is that? Popular speak is one of the HTML tags, such as:
<head> <title> The Dormouse's story </title> </head>
Here the head, title, etc. are tag, the code for the operation of the following
soup.head = Result Print (type (Result)) as a result of printing # <class 'bs4.element.Tag'> Print (Result) # print result <head> <title> The Dormouse 's story </ title> < / head>
soup.title = Result Print (type (Result)) as a result of printing # <class 'bs4.element.Tag'> Print (Result) Print # is <title> The Dormouse's story < / title>
Two important attributes of Tag: name; attr
name
print (soup.name) # print result [Document] Print (soup.head.name) print result title #
soup special object itself, it is the name [Document], to other internal value tags, name tags is then output itself, such as title appeal
attrs
print (soup.a.attrs) # print result { 'href': 'http://example.com/elsie', 'class': [ 'sister'], 'id': 'link1'}
Here, we have all the attributes of a label printed out, get a dictionary type
If we want to get a property separately, as follows (for an example href)
print (soup.a [ 'href']) # print result [ 'http://example.com/elsie']
print (soup.a.get ( 'href')) # print the results [ 'http://example.com/elsie']
(2) BeautifulSoup
BeautifulSoup objects represents the entire contents of a document. Most of the time, you can treat it as Tag object is a special Tag, we can get its type, name, and property are to feel
print(type(soup.name)) # <class 'str'>
Print (soup.name) # [Document]
Print (soup.attrs) # {} empty dictionary
(3) NavigableString
Now that we've got the contents of the tag, then the question is, we want to get the text inside the label how to do it? Very simple, with .string can, for example,
print (soup.a.string) # print result Elsie
Print (type (soup.a.string)) # print result as <class 'bs4.element.NavigableString'>, can be lent type
Note: only obtain contents of the first tag (html_doc above have a plurality of tags, but only to obtain a first label)
So we easily get to the content label inside, think about if you want to use regular expressions much trouble. It is a type of NavigableString, translated strings may be called traversal
(4) Comment
Comment object is a special type of NavigableString objects, in fact, the contents of the output still does not include the comment symbol, but if you do not handle it properly, may cause unexpected trouble our text processing.
We find a label with comments
print(soup.p) print(soup.p.string) print(type(soup.p.string))
Operating results as follows:
<p class = "story"> <-! ha -> </ p> ha <class 'bs4.element.Comment'>
p tag content is actually a comment, but if we use .string to output its contents, we find that it has removed the comment symbol, was also found by the appeal printing result, it is a Comment type, so we use preferably do something before the determination, the following code is determined
if type(soup.p.string)==bs4.element.Comment: print(soup.p.string)
The above code, we first determine its type, whether the type Comment, and then to other operations, such as printing output.
Reference: https://cuiqingcai.com/1319.html