Beautiful Soup4 data analysis and extraction

Beautiful Soup4

Overview

Beautiful Soup is a Python library for parsing HTML and XML documents, providing convenient data extraction and manipulation functions. It helps extract required data from web pages such as tags, text content, attributes, etc.

Beautiful Soup will automatically convert input documents to Unicode encoding and output documents to UTF-8 encoding.

Beautiful Soup is relatively simple to use to parse HTML. The API is very user-friendly and supports multiple parsers.

document:https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

document:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

main feature

Flexible parsing method:

Beautiful Soup supports multiple parsers, including the html.parser parser in the Python standard library, as well as third-party libraries lxml and html5lib. In this way, we can choose the appropriate parser for processing according to our needs.

Simple and intuitive API:

Beautiful Soup provides a concise and friendly API that makes parsing HTML documents very easy. We can use concise methods to select specific tags, get the text content in tags, extract attribute values, etc.

Powerful document traversal capabilities:

Through Beautiful Soup, we can traverse the entire HTML document tree, access, modify or delete each node, and even quickly locate the required node through nested selectors.

Tolerance for broken HTML:

Beautiful Soup can handle broken HTML documents, such as automatically correcting unclosed tags, automatically adding missing tags, etc., making data extraction more stable and reliable.

parser

Beautiful relies on the parser when parsing. In addition to supporting the HTML parser in the Python standard library, it also supports some third-party libraries.

Supported parsers:

parser Instructions Advantage Disadvantages
Python standard library BeautifulSoup(markup, “html.parser”) Python’s built-in standard library has moderate execution speed and strong document fault tolerance. Document fault tolerance is poor in versions before Python 2.7.3 or 3.2.2)
lxml HTML parser BeautifulSoup(markup, “lxml”) Fast speed and strong file fault tolerance Need to install C language library
lxml XML parser BeautifulSoup(markup, [“lxml-xml”]) BeautifulSoup(markup, “xml”) Fast and the only parser that supports XML Need to install C language library
html5lib BeautifulSoup(markup, “html5lib”) The best fault tolerance. Parse the document in the browser's way to generate the document in HTML5 format. Slow and does not rely on external extensions

It can be seen that:

The lxml parser can parse HTML and XML documents and is fast and fault-tolerant, so it is recommended to use it. If you use lxml, then when initializing BeautifulSoup, set the second parameter to lxml.

Basic usage of Beautiful Soup4

Installation library

pip install beautifulsoup4

pip install lxml

Create HTML file

By passing a document into the constructor of BeautifulSoup, you can get a document object. You can pass in a string or a file.

Create a test.htmlfile here to create a document object.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<div>
    <ul>
         <li class="class01"><span index="1">H1</span></li>
         <li class="class02"><span index="2" class="span2">H2</span></li>
         <li class="class03"><span index="3">H3</span></li>
     </ul>
 </div>
</body>
</html>

Basic usage

# 导入模块
from bs4 import BeautifulSoup

# 创建 beautifulsoup对象,有2种方式创建
# 1.通过字符串创建, 第二个参数用于指定解析器
# soup = BeautifulSoup("html", 'lxml')
# 2.通过文件创建
soup = BeautifulSoup(open('test.html'), 'lxml')
# 打印输出
# print(soup.prettify())


# 获取元素标签元素,默认返回第一个元素
print(soup.li)

# 使用 .contents 或 .children 获取子元素
# 返回列表
print(soup.ul.contents)
# 返回迭代器
print(soup.li.children)

# 获取元素内容
print(soup.title.get_text())

# 获取元素属性值,默认取第一个元素属性值
print(soup.li.get('class'))

The results of the operation are as follows:

<li class="class01"><span index="1">H1</span></li>

['\n', <li class="class01"><span index="1">H1</span></li>, '\n', <li class="class02"><span class="span2" 
index="2">H2</span></li>, '\n', <li class="class03"><span index="3">H3</span></li>, '\n']

<list_iterator object at 0x000001C18E475F10>

Title

['class01']

Object types of Beautiful Soup4

Beautiful Soup converts complex HTML documents into a complex tree structure. Each node is a Python object. All objects can be summarized into 4 types:Tag , NavigableString , BeautifulSoup , Comment

Tag object

In Beautiful Soup, Tag objects are objects used to represent tag elements in HTML or XML documents. Tag objects are the same as tags in native documents. The Tag object contains information such as the name, attributes, and content of the tag element, and provides a variety of methods to obtain, modify, and operate this information.

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

Common properties and methods of Tag objects

Attributes describe Example
name attribute Get the name of the label tag_name = tag.name
string attribute Get the text content within the tag tag_text = tag.string
attrs attribute Get the attributes of the label, returned in the form of a dictionary tag_attrs = tag.attrs
get() method Get the attribute value of the label based on the attribute name attr_value = tag.get(‘attribute_name’)
find() method Find and return the first child tag element that meets the condition child_tag = tag.find(‘tag_name’)
find_all() method Find and return all sub-tag elements that meet the conditions, returned in the form of a list child_tags = tag.find_all(‘tag_name’)
parent attribute Get the parent tag of the current tag parent_tag = tag.parent
parents attribute Gets all ancestor tags of the current tag, returned as a generator for parent in tag.parents:
print(parent)
children attribute Gets the direct subtags of the current tag, returned as a generator for child in tag.children:
print(child)
next_sibling attribute Get the next sibling tag of the current tag next_sibling_tag = tag.next_sibling
previous_sibling attribute Get the previous sibling tag of the current tag previous_sibling_tag = tag.previous_sibling

NavigableString object

NavigableString object is a data type in the Beautiful Soup library, used to represent plain text content in HTML or XML documents. It inherits from Python's basic string type, but has additional functions and features that make it suitable for processing textual content in documents.

Suppose we have the following HTML code snippet:

<p>This is a <b>beautiful</b> day.</p>

Use Beautiful Soup to parse it into a document object:

from bs4 import BeautifulSoup

html = '<p>This is a <b>beautiful</b> day.</p>'
soup = BeautifulSoup(html, 'html.parser')

Obtain

The content of the label, which is actually a NavigableString object:

p_tag = soup.find('p')
content = p_tag.string
print(content)  # 输出:This is a
print(type(content))  # 输出:<class 'bs4.element.NavigableString'>

You can also perform some operations on NavigableString objects, such as obtaining text content, replacing text, and removing whitespace characters:

# 获取文本内容
text = content.strip()
print(text)  # 输出:This is a

# 替换文本
p_tag.string.replace_with('Hello')
print(p_tag)  # 输出:<p>Hello<b>beautiful</b> day.</p>

# 去除空白字符
text_without_spaces = p_tag.string.strip()
print(text_without_spaces)  # 输出:Hello
Attributes describe Example
string attribute Used to obtain the text content of a NavigableString object text = navigable_string.string
replace_with() method Used to replace the current NavigableString object with another string or object navigable_string.replace_with(new_string)
strip() method Remove whitespace characters from both ends of a string stripped_text = navigable_string.strip()
parent attribute Get the parent node to which the NavigableString object belongs (usually a Tag object) parent_tag = navigable_string.parent
next_sibling attribute Get the next sibling node of the NavigableString object next_sibling = navigable_string.next_sibling
previous_sibling attribute Get the previous sibling node of the NavigableString object previous_sibling = navigable_string.previous_sibling

BeautifulSoup object

The BeautifulSoup object is the core object of the Beautiful Soup library and is used to parse and traverse HTML or XML documents.

Commonly used methods:

find(name, attrs, recursive, string, **kwargs):根据指定的标签名、属性、文本内容等查找第一个匹配的标签

find_all(name, attrs, recursive, string, limit, **kwargs): 根据指定的标签名、属性、文本内容等查找所有匹配的标签,并返回一个列表

select(css_selector): 使用CSS选择器语法查找匹配的标签,并返回一个列表

prettify():以美观的方式输出整个文档的内容,包括标签、文本和缩进

has_attr(name):检查当前标签是否具有指定的属性名,返回布尔值

get_text():获取当前标签和所有子标签的文本内容,并返回一个字符串

Common properties:

soup.title:获取文档中第一个<title>标签的内容

soup.head:获取文档中的<head>标签

soup.body:获取文档中的<body>标签

soup.find_all('tag'):获取文档中所有匹配的<tag>标签,并返回一个列表

soup.text:获取整个文档中的纯文本内容(去除标签等)

Comment object

Comment object is a special type of object in the Beautiful Soup library, used to represent comment content in HTML or XML documents.

When parsing an HTML or XML document, Beautiful Soup represents the comment content as a Comment object. A comment is a special element in a document that is used to add notes, explanations, or temporarily delete part of the content. Comment objects can be automatically recognized and processed by the parser of the Beautiful Soup library.

Example of processing and accessing annotated content in HTML documents:

from bs4 import BeautifulSoup

# 创建包含注释的HTML字符串
html = "<html><body><!-- This is a comment --> <p>Hello, World!</p></body></html>"

# 解析HTML文档
soup = BeautifulSoup(html, 'html.parser')

# 使用type()函数获取注释对象的类型
comment = soup.body.next_sibling
print(type(comment))  # 输出 <class 'bs4.element.Comment'>

# 使用`.string`属性获取注释内容
comment_content = comment.string
print(comment_content)  # 输出 This is a comment

Search document tree

Beautiful Soup provides a variety of ways to find and locate elements in HTML documents.

method selector

Use Beautiful Soup's find() or find_all() method to select elements by tag name, combining attribute names and attribute values, combining text content, etc.

The difference between the two:

find返回符合条件的第一个元素

find_all返回符合条件的所有元素列表

For example: soup.find('div')the first div tag element will be returned, and soup.find_all('a')all a tag elements will be returned.

1. Find elements by tag or tag list

# 查找第一个div标签的元素
soup.find('div')

# 查找所有的a标签元素
soup.find_all('a')

# 查找所有li标签
soup.find_all('li')

# 查找所有a标签和b标签
soup.find_all(['a','b'])

2. Find elements through regular expressions

# 以sp开头的标签查找
import re
print(soup.find_all(re.compile("^sp")))

3. Find elements by attributes

find = soup.find_all(
    attrs={
    
    
         "属性名":"值"
    }
)
print(find)
# 查找第一个具有href属性的a标签元素
soup.find('a', href=True)

# 查找所有具有class属性的div标签元素
soup.find_all('div', class_=True)

4. Find elements by text content

# 查找第一个包含"Hello"文本内容的元素
soup.find(text='Hello')

# 查找所有包含"World"文本内容的元素
soup.find_all(text="World")

5. Find elements through keyword parameters

soup.find_all(id='id01')

6. Mix it up

soup.find_all(
 '标签名',
 attrs={
    
    
   "属性名":"值"
 },
 text="内容"
)

CSS selector

Use Beautiful Soup's select() method to find elements through CSS selectors. CSS selectors are a powerful and flexible way to target elements based on tag names, class names, IDs, attributes, and combinations thereof.

1. Class Selector

Find elements by class name, use .symbol plus class name to find elements

soup.select('.className')

2.ID selector

Find elements by ID, use #symbol plus ID to find elements

soup.select('#id')

3. Tag selector

Find elements by tag name, use tag names directly to find elements

soup.select('p')

4.Attribute selector

Find elements by attributes: [属性名=属性值]formats you can use to find elements with specific attributes and attribute values

soup.select('[属性="值"]')

soup.select('[href="example.com"]')

soup.select('a[href="http://baidu.com"]')

5. Combination selector

Multiple selectors can be combined for more precise searches

# 返回所有在具有特定类名的div元素内的a标签元素
soup.select('div.class01 a')

soup.select('div a')

associated selection

In the process of element search, sometimes you cannot get the desired node element in one step. You need to select a certain node element, and then use this node as the basis to select its child nodes, parent nodes, sibling nodes, etc.

In Beautiful Soup, you can perform associated selection by using CSS selector syntax. You need to use the select() method of Beautiful Soup, which allows you to use CSS selectors to select elements.

1. Descendant selector (space): You can select all descendant elements under the specified element.

div a               /* 选择div元素下的所有a元素 */

.container p        /* 选择类名为container的元素下的所有p元素 */

2. Direct descendant selector (>): You can select the direct descendant elements of the specified element.

div > a             /* 选择div元素的直接子代的a元素 */

.container > p      /* 选择类名为container的元素的直接子代的p元素 */

3. Adjacent sibling selector (+): You can select the next sibling element immediately adjacent to the specified element.

h1 + p              /* 选择紧接在h1元素后的同级p元素 */

.container + p      /* 选择紧接在类名为container的元素后的同级p元素 */

4. Universal sibling selector (~): can select all subsequent elements at the same level as the specified element.

h1 ~ p              /* 选择与h1元素同级的所有p元素 */

.container ~ p      /* 选择与类名为container的元素同级的所有p元素 */

5. Mixed selectors: You can combine multiple selectors of different types to select specific elements

div , p   /* 选择所有div元素和所有p元素 */

.cls1.cls2     /*  选择类名是cls1并且类名是cls2 */

Traverse the document tree

In Beautiful Soup, traversing the document tree is a common operation for accessing and processing individual nodes of an HTML or XML document

Label's child nodes and parent nodes

contents: Get all child nodes of Tag and return a list

print(bs.tag.contents)
# 使用列表索引获取某一个元素
print(bs.tag.contents[1])

children: Get all child nodes of Tag and return a generator

for child in tag.contents:
    print(child)

The parent attribute gets the parent node of the label

parent_tag = tag.parent

The sibling nodes of the label

You can use the .next_sibling and .previous_sibling properties to get the next or previous sibling node of a label.

next_sibling = tag.next_sibling
previous_sibling = tag.previous_sibling

Recursively traverse the document tree

You can use the .find() and .find_all() methods to recursively search for matching tags in the document tree.

You can use the .descendants generator iterator to traverse all descendant nodes of the document tree.

for tag in soup.find_all('a'):
    print(tag)

for descendant in tag.descendants:
    print(descendant)

Traverse tag attributes

You can use the .attrs attribute to get all attributes of a tag and iterate over them.

for attr in tag.attrs:
    print(attr)

Guess you like

Origin blog.csdn.net/qq_38628046/article/details/129028890