Business reptile study notes day7 ------- analytical method of bs4

一.Beautiful Soup

1 Introduction

Beautiful Soup is a library of python, the most important function is to grab data from a web page. Its characteristics are as follows (these three features is the bs strong reasons, from the official manual)

a. Beautiful Soup provide some simple, Python type functions for handling navigation, search, modify functions parse tree. It is a toolkit to provide needed data captured by the user to parse the document, because simple, so do not need much code to write a complete application.

b. Beautiful Soup automatically converted to Unicode encoding input document, the document is converted to an output utf-8 encoded. You do not need to consider encoding, unless the document does not specify a code, then, Beautiful Soup can not automatically identify the encoding. Then, you just need to explain the original coding on it.

c. Beautiful Soup has become and lxml, html6lib as good as the python interpreter, provide users with different analytical strategies or strong rate flexibility.

2.Beautiful Soup parser support

(1) python standard library (default): python standard library built, moderate speed, high fault tolerance documents

Usage: BeautifulSoup (data, "html.parser")

(2) lxml HTML Parser: fast, strong fault tolerance documents

Usage: BeautifulSoup (data, "lxml")

(3) lxml XML parsers: speed, the only support for XML parser

Usage: BeautifulSoup (markup, [ "lxml", "xml"]); BeautifulSoup (markup, "xml")

(4) html5lib parser: best fault tolerance; to parse the document browser; HTML5 generate a document format; slow

 

Two objects to create a soup

Here is a case on the official manual

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story"><!--哈哈--></p>
<p class="title"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

"""

(1) introducing the library bs4

from bs4 imort import beautifulSoup

  (2) create an object beautifulsoup

 Here python use the default parser that html.parser

soup = BeautifulSoup (html_doc) # is equivalent to soup = BeautifulSoup (markup, "html.parser")

This run, then there will be a reminder, as follows:

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently 
To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.

 As implied above, to avoid this prompt if you choose a parser, as follows

soup = BeautifulSoup(html_doc, “lxml”)  

 Formatted output, there are function completion

result = soup.prettify()
print(result)

 result, the print output format

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  。。。。
 </body>
</html>

III. Four types of objects

Beautiful Soup complex HTML documents converted into a complex tree structure, each node is python object, all objects may be grouped into the following 4:

  Tag ;BeautifulSoup; Comment; NavigableString  

(1)Tag

Tag What is that? Popular speak is one of the HTML tags, such as:

 <head>
  <title>
   The Dormouse's story
  </title>
 </head>

  Here the head, title, etc. are tag, the code for the operation of the following

soup.head = Result 
Print (type (Result)) as a result of printing # <class 'bs4.element.Tag'> 
Print (Result) # print result <head> <title> The Dormouse 's story </ title> < / head>

 

soup.title = Result 
Print (type (Result)) as a result of printing # <class 'bs4.element.Tag'> 
Print (Result) Print # is <title> The Dormouse's story < / title>

Two important attributes of Tag: name; attr

name

print (soup.name) # print result [Document] 
Print (soup.head.name) print result title #

 soup special object itself, it is the name [Document], to other internal value tags, name tags is then output itself, such as title appeal

attrs

print (soup.a.attrs) # print result { 'href': 'http://example.com/elsie', 'class': [ 'sister'], 'id': 'link1'}

Here, we have all the attributes of a label printed out, get a dictionary type

If we want to get a property separately, as follows (for an example href)

print (soup.a [ 'href']) # print result [ 'http://example.com/elsie']
print (soup.a.get ( 'href')) # print the results [ 'http://example.com/elsie']

 

(2) BeautifulSoup

BeautifulSoup  objects represents the entire contents of a document. Most of the time, you can treat it as  Tag  object is a special Tag, we can get its type, name, and property are to feel

print(type(soup.name))  # <class 'str'>
Print (soup.name) # [Document] 
Print (soup.attrs) # {} empty dictionary

(3) NavigableString

Now that we've got the contents of the tag, then the question is, we want to get the text inside the label how to do it? Very simple, with .string can, for example,

print (soup.a.string) # print result Elsie 
Print (type (soup.a.string)) # print result as <class 'bs4.element.NavigableString'>, can be lent type

Note: only obtain contents of the first tag (html_doc above have a plurality of tags, but only to obtain a first label)

So we easily get to the content label inside, think about if you want to use regular expressions much trouble. It is a type of NavigableString, translated strings may be called traversal

(4) Comment

Comment  object is a special type of  NavigableString  objects, in fact, the contents of the output still does not include the comment symbol, but if you do not handle it properly, may cause unexpected trouble our text processing.

We find a label with comments

print(soup.p)
print(soup.p.string)
print(type(soup.p.string))

Operating results as follows:

<p class = "story"> <-! ha -> </ p> 
ha 
<class 'bs4.element.Comment'>

p tag content is actually a comment, but if we use .string to output its contents, we find that it has removed the comment symbol, was also found by the appeal printing result, it is a Comment type, so we use preferably do something before the determination, the following code is determined  

if type(soup.p.string)==bs4.element.Comment:
    print(soup.p.string)

 The above code, we first determine its type, whether the type Comment, and then to other operations, such as printing output.

 

 

 

 

Reference: https://cuiqingcai.com/1319.html

Guess you like

Origin www.cnblogs.com/jj1106/p/11231275.html