HTML parsing in Python

Reposted from: Micro reading https://www.weidianyuedu.com

1. Powerful BeautifulSoup : BeautifulSoup is a Python library that can extract data from html or xml files. It enables the usual way of navigating, finding, and modifying documents through your favorite converter. In Python development, the search and extraction function of BeautifulSoup is mainly used, and the modification function is rarely used

1. Install BeautifulSoup

pip3 install beautifulsoup4

2. Install the third-party html parser lxml

pip3 install lxml

3. Install the html5lib parser implemented in pure Python

pip3 install html5lib

Second, the use of BeautifulSoup:

1. Import the bs4 library

frombs4 import BeautifulSoup #import bs4 library

2. Create a string containing html code

html_str= """

<html><head><title>TheDormouse"s story</title></head>

<body>

<pclass="title"><b>The Dormouse"s stopy</b></p>

<pclass="story">Once upon a time there were three littlesisters;and their names where

<ahref="http://example.com/elsie" class="sister"id="link1"><!--Elsie--></a>

"""

3. Create a BeautifulSoup object

(1) Create directly by string

soup= BeautifulSoup(html_str,"lxml")#html.parser is a parser, it can also be lxml

print(soup.prettify())------>output the contents of the soup object

(2) Create through existing files

soup= BeautifulSoup(open("/home/index.html"),features="html.parser")#html.parser is a parser, or lxml

4. Types of BeautifulSoup objects: BeautifulSoup converts complex HTML documents into a complex tree structure, each node is a Python object

(1) BeautifulSoup: It represents the entire content of a document. Most of the time, it can be regarded as a Tag object, which is a special Tag, because BeautifulSoup objects are not real HTML and XML , so there is no name and attribute attributes

(2) Tag: It is the same as the Tag in XML or HTML native document, and it is a tag in layman's terms

like:

Extract title: print(soup.title)

Extract a: print(soup.a)

extract p: print(soup.p)

There are two important attributes in Tag: name and attributes. Each Tag has its own name, obtained through .name

print(soup.title.name)

The method of manipulating the Tag property is the same as manipulating the dictionary

如:<pclass=’p1’>Hello World</p>

print(soup.p[‘class’])

You can also directly "click" to get the attributes, such as .attrs to get all the attributes in the Tag

print(soup.p.attrs)

(3) NavigableString: Get the text inside the tag.string

BeautifulSoup uses the NavigableString class to encapsulate the string in the Tag. A NavigableString string is the same as a Unicode string in Python. The NavigableString object can be directly converted into a Unicode string through the unicode() method

如:u_string= unicode(soup.p.string)

(4) Comment: For some special objects, if the mark.string is not clear, it may cause confusion in data extraction. Therefore, when extracting a string, the type can be judged:

if type(soup.a.string) == bs4.element.Comment:

print(soup.a.string)

5. Traversing documents

(1) Child nodes:

A. Direct child nodes can be accessed through .contents and .children

.contents---->Output the Tag child nodes as a list

print(soup.head.contents)

.children----->Returns a generator to loop through the Tag child nodes

for child in soup.head.children:

print(child)

B. Get the content of the child node

.string ---> If there is no tag in the tag, return the content; if there is only one unique tag in the tag, return the innermost content; if it contains multiple child nodes, Tag cannot determine which one should be returned by the .string method, returns None

.strings---->Mainly used in the case where the Tag contains multiple strings, it can be looped through

for str in soup.strings:

print(repr(str))

.stripped_string-----> can remove the spaces or blank lines contained in the string

for str in soup.stripped_strings:

print(repr(str))

(2) Parent node

A. Obtain the parent node of an element through the .parent attribute, such as:

print(soup.title.parent)

B. Through the .parents attribute, you can recursively get all the parent nodes of the element

for parent in soup.a.parents:

if parent is None:

print(parent)

else:

print(parent.name)

(3) Brother nodes

.next_sibling----->Get the next sibling node of the node

.previous_sibling ----->Get the previous sibling node of this node

(4) Front and back nodes

.next_elements -----> Get all the nodes in front of this node

.previous_elements ----->Get all the nodes behind this node

6. Search the document tree

(1)find_all(name,attrs,recursive,text,**kwargs)

A. name parameter: Find the tag whose name is name

print(soup.find_all(‘‘’’b))

B. text parameter: Find the content of the string in the document

C. Recursive parameter: When retrieving all descendant nodes of the current Tag, if you only want to find direct child nodes, this parameter is set to False

7. CSS selector: use the soup.select() function

(1) Search by tag name

print(soup.select("title"))

(2) Search through the tag's class attribute value

print(soup.select(".sister"))

(3) Search through the id attribute value of Tag

print(soup.select("#sister"))

(4) Search by whether there is an attribute

print(soup.select("a[href]"))

(5) Find by attribute value

print(soup.select("a[href=""]"))

Guess you like

Origin blog.csdn.net/hdxx2022/article/details/129794285