【Python】BeautifulSoup

Introduction

We know that a web page is composed of HTML documents. HTML documents are a structured document with certain rules, and its structure can simplify information extraction.

Beautiful Soup 4.4.0 documentation

My understanding is this: through a section of the HTML document BeautifulSoup()构造方法, and then operate the object parsed into an object.

Beautiful Soup is a Python library that can extract data from HTML or XML files. The name is from "Alice in Wonderland" , the code below is from the official document and is a section of "Alice in Wonderland" .

Insert picture description here
Using BeautifulSoup to parse this code, you can get a BeautifulSoup object, which can be output according to the structure of the standard indentation format:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

# 输出
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

A few simple ways to browse structured data:
soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

Find all the links of `<a>` tags from the documentation:

Insert picture description here

Get all text content from the document:
Insert picture description here


Parser

Beautiful Soup actually relies on a parser when parsing. In addition to supporting the HTML parser in the Python standard library, it also supports some third-party parsers (such as lxml).

Parser Instructions
Python standard library BeautifulSoup(markup, “html.parser”)
xml HTML parser BeautifulSoup(markup, “lxml”)
lxml XML parser BeautifulSoup(markup, “xml”)
html5lib BeautifulSoup(markup, “html5lib”)

Object

Beautiful Soup converts a complex HTML document into a complex tree structure, each node is a Python object, and all objects can be summarized into 4 types:

  • Tag
  • NavigableString
  • BeautifulSoup
  • Comment

Tag

TagIn HTMLis the meaning of the label
Insert picture description here
Output:
Insert picture description here

The most important attributes in the tag:

  • name
  • attributes

Name

Each tag has its own name, which can be obtained through .name:

If you change the tag name, it will affect all HTML documents generated by the current Beautiful Soup object:
Insert picture description here

Attributes

A tag may have many attributes. A tag has a "class" attribute with a value of "boldest". The operation method of the tag attribute is the same as the dictionary:

You can also directly "click" to get the attributes, for example: .attrs:

Insert picture description here


The attributes of the tag can be added, deleted or modified. Again, the operation method of the attributes of the tag is the same as the dictionary
Insert picture description here

NavigableString

Strings that can be traversed
Strings are often contained in tags. Beautiful Soup uses the NavigableString class to wrap the strings in the
Insert picture description here


tag : the strings contained in the tag cannot be edited, but can be replaced with other strings, use replace_with() method:
Insert picture description here

BeautifulSoup

The BeautifulSoup object represents the entire content of a document. Most of the time, it can be regarded as a Tag object, and it supports most of the methods described in traversing the document tree and searching the document tree.

Because the BeautifulSoup object is not a real HTML or XML tag, it does not have name and attribute attributes. But sometimes it is convenient to view its .name attribute, so the BeautifulSoup object contains a value
Insert picture description here


Comment

Process the comment part of the document
Insert picture description here



Traverse the document tree

Use the following example to demonstrate how to find another piece of content from
Insert picture description here
one piece of the document. A Tag may contain multiple strings or other tags. These are the child nodes of this Tag. Beautiful Soup provides many operations and traversal of child nodes. Attributes.


tag name

Directly through the soup.tag的名字operation of the document tree. If you want to get <head>the label, as long as Soup.head:
Insert picture description here
you can call this method multiple times in the document tree tag.
soup.tag名字.tag名字.····

Only the first tag of the current name can be obtained by clicking the attribute

If you want to get all the <a>tags, or through a tag name to get more content than when you need to use the method described in the Searching the tree, such as:find_all()
Insert picture description here

.contents 和 .children

The .contents attribute of the tag can output the child nodes of the tag as a list:
Insert picture description here

Through the tag's .children generator, you can loop the child nodes of the tag:
Insert picture description here

.descendants

The .descendants attribute can recursively loop all the descendants of tags:
Insert picture description here

.string

If the tag has only one child node of type NavigableString or if a tag has only one child node , then this tag can be used.string to get the child node:
Insert picture description here
if the tag contains multiple child nodes, the tag cannot be determined, and .stringthe output result isNone

.strings 和 stripped_strings

If the tag contains multiple strings, you can use .strings to get it cyclically: the
Insert picture description here
output string may contain a lot of spaces or blank lines, and use .stripped_strings to remove extra blank content.
Lines containing all spaces will be ignored, and spaces at the beginning and end of paragraphs will be deleted.

.parent

Get the parent node of an element through the .parent property. In the document of the example "Alice", the label is标签的父节点.<br/> <img src="https://img-blog.csdnimg.cn/20200910145334723.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NTQ2ODg0NQ==,size_16,color_FFFFFF,t_70#pic_center" alt="在这里插入图片描述"/><br/> <br/>

.parents

Through the .parents property of the element, all the parent nodes of the element can be obtained recursively.The following example uses the .parents method to traverse all the nodes from the label to the root node.
Insert picture description here

.next_sibling 和 .previous_sibling

In the document tree, use the .next_sibling and .previous_sibling attributes to query sibling nodes:


.next_siblings 和 .previous_siblings

Through the .next_siblings and .previous_siblings properties, you can iteratively output the current node's sibling nodes:



Search the document tree


find_all

Only return the first matched object
Syntax:

find_all( name , attrs , recursive , string , **kwargs )
  • name searches for all tags whose name is name, and the string object will be automatically ignored.
  • The keyword search will use this parameter as the attribute of the specified name tag to search.
    Insert picture description here

find

Return all matched results, which is different from find (find only returns the first result found)
syntax:

find( name , attrs , recursive , string , **kwargs )  

find_parents() 和 find_parent()

find_parents( name , attrs , recursive , string , **kwargs )

find_parent( name , attrs , recursive , string , **kwargs )


Modify the document tree

Modify tag name and attributes

Insert picture description here

Modify .string

Assigning a value to the .string attribute of the tag is equivalent to replacing the original content with the current content:
Insert picture description here
Note: If the current tag contains other tags, then assigning a value to the .string attribute will overwrite all the original content including sub tag


new_tag ()

The best way to create a tag is to call the factory method BeautifulSoup.new_tag():

Insert picture description here
The first parameter is the name of the tag, which is required, other parameters are optional


append()

The Tag.append() method adds content to the tag, just like the .append() method of Python lists:
Insert picture description here

insert()

The Tag.insert() method is similar to the Tag.append() method. The difference is that the new element is not added to the parent node. The content attribute is at the end, but the element is inserted at the specified position. The same as the Python list.insert( ) The usage of the method is the same as below:

Insert picture description here

insert_before() 和 insert_after()

The insert_before() method inserts content before the current tag or text node: The
insert_after() method inserts content after the current tag or text node:
Insert picture description here

Other functions

clear() Remove the content of the current tag:
extract() Remove the current tag from the document tree and return it as the method result:
decompose() Remove the current node from the document tree and completely destroy it:
replace_with() Remove a piece of content in the document tree and replace it with a new tag or text node:
wrap() You can wrap the specified tag element [8] and return the wrapped result
unwrap() Contrary to the wrap() method. All tag tags in the tag will be removed. This method is often used to unpack the tag:

Output

Formatted output

prettify()The method formats the document tree of Beautiful Soup and outputs it in Unicode encoding. Each XML/HTML tag has Insert picture description here
its own line. The BeautifulSoup object and its tag node can call the prettify() method:
Insert picture description here

Compressed output

If you want to get the result string, not pay attention to the format, you can use Python for a BeautifulSoup Tag object or objects unicode()or str()methods:
Insert picture description here

get_text()

If you want to get tag contains text, you can use get_text()the method, this method to get all the text version of the content tag contains content, including descendants of tag, and the result is returned as Unicode strings:

The separator of the text content of the tag can be specified by parameters:

You can also remove the blank space before and after the obtained text content:
Insert picture description here

Copy the Beautiful Soup object

The copy.copy() method can copy any Tag or NavigableString object

import copy
p_copy = copy.copy(soup.p)
print p_copy
# <p>I want <b>pizza</b> and more <b>pizza</b>!</p>

The copied object is equal to the object, but points to a different memory address

print soup.p == p_copy
# True

print soup.p is p_copy
# False

Guess you like

Origin blog.csdn.net/weixin_45468845/article/details/108498707