Reposted from: Micro reading https://www.weidianyuedu.com
1. Powerful BeautifulSoup : BeautifulSoup is a Python library that can extract data from html or xml files. It enables the usual way of navigating, finding, and modifying documents through your favorite converter. In Python development, the search and extraction function of BeautifulSoup is mainly used, and the modification function is rarely used
1. Install BeautifulSoup
pip3 install beautifulsoup4
2. Install the third-party html parser lxml
pip3 install lxml
3. Install the html5lib parser implemented in pure Python
pip3 install html5lib
Second, the use of BeautifulSoup:
1. Import the bs4 library
frombs4 import BeautifulSoup #import bs4 library
2. Create a string containing html code
html_str= """
<html><head><title>TheDormouse"s story</title></head>
<body>
<pclass="title"><b>The Dormouse"s stopy</b></p>
<pclass="story">Once upon a time there were three littlesisters;and their names where
<ahref="http://example.com/elsie" class="sister"id="link1"><!--Elsie--></a>
"""
3. Create a BeautifulSoup object
(1) Create directly by string
soup= BeautifulSoup(html_str,"lxml")#html.parser is a parser, it can also be lxml
print(soup.prettify())------>output the contents of the soup object
(2) Create through existing files
soup= BeautifulSoup(open("/home/index.html"),features="html.parser")#html.parser is a parser, or lxml
4. Types of BeautifulSoup objects: BeautifulSoup converts complex HTML documents into a complex tree structure, each node is a Python object
(1) BeautifulSoup: It represents the entire content of a document. Most of the time, it can be regarded as a Tag object, which is a special Tag, because BeautifulSoup objects are not real HTML and XML , so there is no name and attribute attributes
(2) Tag: It is the same as the Tag in XML or HTML native document, and it is a tag in layman's terms
like:
Extract title: print(soup.title)
Extract a: print(soup.a)
extract p: print(soup.p)
There are two important attributes in Tag: name and attributes. Each Tag has its own name, obtained through .name
print(soup.title.name)
The method of manipulating the Tag property is the same as manipulating the dictionary
如:<pclass=’p1’>Hello World</p>
print(soup.p[‘class’])
You can also directly "click" to get the attributes, such as .attrs to get all the attributes in the Tag
print(soup.p.attrs)
(3) NavigableString: Get the text inside the tag.string
BeautifulSoup uses the NavigableString class to encapsulate the string in the Tag. A NavigableString string is the same as a Unicode string in Python. The NavigableString object can be directly converted into a Unicode string through the unicode() method
如:u_string= unicode(soup.p.string)
(4) Comment: For some special objects, if the mark.string is not clear, it may cause confusion in data extraction. Therefore, when extracting a string, the type can be judged:
if type(soup.a.string) == bs4.element.Comment:
print(soup.a.string)
5. Traversing documents
(1) Child nodes:
A. Direct child nodes can be accessed through .contents and .children
.contents---->Output the Tag child nodes as a list
print(soup.head.contents)
.children----->Returns a generator to loop through the Tag child nodes
for child in soup.head.children:
print(child)
B. Get the content of the child node
.string ---> If there is no tag in the tag, return the content; if there is only one unique tag in the tag, return the innermost content; if it contains multiple child nodes, Tag cannot determine which one should be returned by the .string method, returns None
.strings---->Mainly used in the case where the Tag contains multiple strings, it can be looped through
for str in soup.strings:
print(repr(str))
.stripped_string-----> can remove the spaces or blank lines contained in the string
for str in soup.stripped_strings:
print(repr(str))
(2) Parent node
A. Obtain the parent node of an element through the .parent attribute, such as:
print(soup.title.parent)
B. Through the .parents attribute, you can recursively get all the parent nodes of the element
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
(3) Brother nodes
.next_sibling----->Get the next sibling node of the node
.previous_sibling ----->Get the previous sibling node of this node
(4) Front and back nodes
.next_elements -----> Get all the nodes in front of this node
.previous_elements ----->Get all the nodes behind this node
6. Search the document tree
(1)find_all(name,attrs,recursive,text,**kwargs)
A. name parameter: Find the tag whose name is name
print(soup.find_all(‘‘’’b))
B. text parameter: Find the content of the string in the document
C. Recursive parameter: When retrieving all descendant nodes of the current Tag, if you only want to find direct child nodes, this parameter is set to False
7. CSS selector: use the soup.select() function
(1) Search by tag name
print(soup.select("title"))
(2) Search through the tag's class attribute value
print(soup.select(".sister"))
(3) Search through the id attribute value of Tag
print(soup.select("#sister"))
(4) Search by whether there is an attribute
print(soup.select("a[href]"))
(5) Find by attribute value
print(soup.select("a[href=""]"))