Introduction
We know that a web page is composed of HTML documents. HTML documents are a structured document with certain rules, and its structure can simplify information extraction.
Beautiful Soup 4.4.0 documentation
My understanding is this: through a section of the HTML document BeautifulSoup()构造方法
, and then operate the object parsed into an object.
Beautiful Soup is a Python library that can extract data from HTML or XML files. The name is from "Alice in Wonderland" , the code below is from the official document and is a section of "Alice in Wonderland" .
Using BeautifulSoup to parse this code, you can get a BeautifulSoup object, which can be output according to the structure of the standard indentation format:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
# 输出
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link2">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>
A few simple ways to browse structured data:
soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Find all the links of `<a>` tags from the documentation:
Get all text content from the document:
Parser
Beautiful Soup actually relies on a parser when parsing. In addition to supporting the HTML parser in the Python standard library, it also supports some third-party parsers (such as lxml).
Parser | Instructions |
---|---|
Python standard library | BeautifulSoup(markup, “html.parser”) |
xml HTML parser | BeautifulSoup(markup, “lxml”) |
lxml XML parser | BeautifulSoup(markup, “xml”) |
html5lib | BeautifulSoup(markup, “html5lib”) |
Object
Beautiful Soup converts a complex HTML document into a complex tree structure, each node is a Python object, and all objects can be summarized into 4 types:
- Tag
- NavigableString
- BeautifulSoup
- Comment
Tag
Tag
In HTML
is the meaning of the label
Output:
The most important attributes in the tag:
- name
- attributes
Name
Each tag has its own name, which can be obtained through .name:
If you change the tag name, it will affect all HTML documents generated by the current Beautiful Soup object:
Attributes
A tag may have many attributes. A tag has a "class" attribute with a value of "boldest". The operation method of the tag attribute is the same as the dictionary:
You can also directly "click" to get the attributes, for example: .attrs:
The attributes of the tag can be added, deleted or modified. Again, the operation method of the attributes of the tag is the same as the dictionary
NavigableString
Strings that can be traversed
Strings are often contained in tags. Beautiful Soup uses the NavigableString class to wrap the strings in the
tag : the strings contained in the tag cannot be edited, but can be replaced with other strings, use replace_with() method:
BeautifulSoup
The BeautifulSoup object represents the entire content of a document. Most of the time, it can be regarded as a Tag object, and it supports most of the methods described in traversing the document tree and searching the document tree.
Because the BeautifulSoup object is not a real HTML or XML tag, it does not have name and attribute attributes. But sometimes it is convenient to view its .name attribute, so the BeautifulSoup object contains a value
Comment
Process the comment part of the document
Traverse the document tree
Use the following example to demonstrate how to find another piece of content from
one piece of the document. A Tag may contain multiple strings or other tags. These are the child nodes of this Tag. Beautiful Soup provides many operations and traversal of child nodes. Attributes.
tag name
Directly through the soup.tag的名字
operation of the document tree. If you want to get <head>
the label, as long as Soup.head
:
you can call this method multiple times in the document tree tag.
soup.tag名字.tag名字.····
Only the first tag of the current name can be obtained by clicking the attribute
If you want to get all the <a>
tags, or through a tag name to get more content than when you need to use the method described in the Searching the tree, such as:find_all()
.contents 和 .children
The .contents attribute of the tag can output the child nodes of the tag as a list:
Through the tag's .children generator, you can loop the child nodes of the tag:
.descendants
The .descendants attribute can recursively loop all the descendants of tags:
.string
If the tag has only one child node of type NavigableString or if a tag has only one child node , then this tag can be used.string to get the child node:
if the tag contains multiple child nodes, the tag cannot be determined, and .string
the output result isNone
.strings 和 stripped_strings
If the tag contains multiple strings, you can use .strings to get it cyclically: the
output string may contain a lot of spaces or blank lines, and use .stripped_strings to remove extra blank content.
Lines containing all spaces will be ignored, and spaces at the beginning and end of paragraphs will be deleted.
.parent
Get the parent node of an element through the .parent property. In the document of the example "Alice", the label is
.parents
Through the .parents property of the element, all the parent nodes of the element can be obtained recursively.The following example uses the .parents method to traverse all the nodes from the label to the root node.
.next_sibling 和 .previous_sibling
In the document tree, use the .next_sibling and .previous_sibling attributes to query sibling nodes:
.next_siblings 和 .previous_siblings
Through the .next_siblings and .previous_siblings properties, you can iteratively output the current node's sibling nodes:
Search the document tree
find_all
Only return the first matched object
Syntax:
find_all( name , attrs , recursive , string , **kwargs )
- name searches for all tags whose name is name, and the string object will be automatically ignored.
- The keyword search will use this parameter as the attribute of the specified name tag to search.
find
Return all matched results, which is different from find (find only returns the first result found)
syntax:
find( name , attrs , recursive , string , **kwargs )
find_parents() 和 find_parent()
find_parents( name , attrs , recursive , string , **kwargs )
find_parent( name , attrs , recursive , string , **kwargs )
Modify the document tree
Modify tag name and attributes
Modify .string
Assigning a value to the .string attribute of the tag is equivalent to replacing the original content with the current content:
Note: If the current tag contains other tags, then assigning a value to the .string attribute will overwrite all the original content including sub tag
new_tag ()
The best way to create a tag is to call the factory method BeautifulSoup.new_tag():
The first parameter is the name of the tag, which is required, other parameters are optional
append()
The Tag.append() method adds content to the tag, just like the .append() method of Python lists:
insert()
The Tag.insert() method is similar to the Tag.append() method. The difference is that the new element is not added to the parent node. The content attribute is at the end, but the element is inserted at the specified position. The same as the Python list.insert( ) The usage of the method is the same as below:
insert_before() 和 insert_after()
The insert_before() method inserts content before the current tag or text node: The
insert_after() method inserts content after the current tag or text node:
Other functions
clear() | Remove the content of the current tag: |
extract() | Remove the current tag from the document tree and return it as the method result: |
decompose() | Remove the current node from the document tree and completely destroy it: |
replace_with() | Remove a piece of content in the document tree and replace it with a new tag or text node: |
wrap() | You can wrap the specified tag element [8] and return the wrapped result |
unwrap() | Contrary to the wrap() method. All tag tags in the tag will be removed. This method is often used to unpack the tag: |
Output
Formatted output
prettify()
The method formats the document tree of Beautiful Soup and outputs it in Unicode encoding. Each XML/HTML tag has
its own line. The BeautifulSoup object and its tag node can call the prettify() method:
Compressed output
If you want to get the result string, not pay attention to the format, you can use Python for a BeautifulSoup Tag object or objects unicode()
or str()
methods:
get_text()
If you want to get tag contains text, you can use get_text()
the method, this method to get all the text version of the content tag contains content, including descendants of tag, and the result is returned as Unicode strings:
The separator of the text content of the tag can be specified by parameters:
You can also remove the blank space before and after the obtained text content:
Copy the Beautiful Soup object
The copy.copy() method can copy any Tag or NavigableString object
import copy
p_copy = copy.copy(soup.p)
print p_copy
# <p>I want <b>pizza</b> and more <b>pizza</b>!</p>
The copied object is equal to the object, but points to a different memory address
print soup.p == p_copy
# True
print soup.p is p_copy
# False