02_BeautifulSoup4模块简介与使用

1、BeautifulSoup4模块简介：

本质：python的一个第三方库
作用：在获取到网页源代码的前提下，在HTML文件或者XML文件中提取数据。
安装指令：pip install BeautifulSoup4
安装说明：除了上面的指令安装之外，还可以用pycharm中的图形化安装界面安装
使用BeautifulSoup方法针对网页源代码进行文档解析，返回一个BeautifulSoup对象（本质：树结构），这个解析过程需要解析器。

2、示例代码：

html_str = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup

# BeautifulSoup(网页源代码, 解析器)
soup = BeautifulSoup(html_str, 'html.parser')
# 对文档解析的过程其实就是将html源代码转换为树结构，便于后续的内容查找。

# print(soup, type(soup))

# 提取树结构中内容的方法和属性
# select：使用CSS选择器（标签选择器、id选择器、class选择器、
# 父子选择器、后代选择器、nth-of-type选择器等）从树结构中遍历符合CSS选择器的所有结果，存放在列表中。

# select_one：使用CSS选择器（标签选择器、id选择器、class选择器、
# 父子选择器、后代选择器、nth-of-type选择器等）从树结构中遍历符合CSS选择器的第一个结果。

# text：从标签内获取标签内容。
# attrs：从标签内属性列表中获取指定属性名对应的属性值。

# Q1:提取p标签
# 标签选择器：只写标签名，会获取到整个html源代码中的所有的某标签
p_list = soup.sele

02_BeautifulSoup4模块简介与使用

1、BeautifulSoup4模块简介：

2、示例代码：

猜你喜欢