Python 网络爬虫笔记3 – Beautiful Soup库

Python 网络爬虫系列笔记是笔者在学习嵩天老师的《Python网络爬虫与信息提取》课程及笔者实践网络爬虫的笔记。

课程链接：Python 网络爬虫与信息提取
参考文档：
Requests 官方文档（英文）
Requests 官方文档（中文）
Beautiful Soup 官方文档
 re 官方文档
 Scrapy 官方文档（英文）
Scrapy 官方文档（中文）

一、Beautiful Soup 库安装

介绍： Beautiful Soup库主要是用来处理使用requests库获得的HTML网页文件

pip 安装：

pip install beautifulsoup4

导入模块：

from bs4 import BeautifulSoup

测试：

import requests
from bs4 import BeautifulSoup

r = requests.get('https://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')
# 以树形的结构打印HTML文件，Tag元素也可使用prettify()方法
print(soup.prettify())
print(soup.a.prettify())

二、Beautiful Soup 库基础

1、标签树

标签：一对<>及其中间的内容

# 标签示例
<p class=“title”> … </p>

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾
Name	标签的名字，< p >…< /p >的名字是’p’，格式：< tag>.name
Attributes	标签的属性，字典形式组织，格式：< tag>.attrs
NavigableString	标签内非属性字符串，<>…</>中字符串，格式：< tag>.string
Comment	标签内字符串的注释部分，一种特殊的Comment类型。格式：<!-- xxxx – >

import requests
from bs4 import BeautifulSoup

def tag_element():
    """
    BeautifulSoup 的 tag，及其属性
    :return:
    """
    html = requests.get('https://python123.io/ws/demo.html')
    soup = BeautifulSoup(html.text, 'html.parser')
    # Tag
    tag = soup.p
    print(type(tag))
    print(tag)
    # Name
    name = tag.name
    print(type(name))
    print(name)
    # Attributes
    attrs = tag.attrs
    print(type(attrs))
    print(attrs)
    # NavigableString
    content = tag.string
    print(type(content))
    print(content)
    # Comment
    soup1 = BeautifulSoup('<b><!--This is a comment--></b>', 'html.parser')
    comment = soup1.b.string
    print(type(comment))
    print(comment)

if __name__ == '__main__':
    print('running bs:')
    tag_element()

标签树：不同层级的标签组成的树形结构，构成HTML网页

# 标签树示例
<HTML>
    <head>
            <title>网页的标题，就是网页选项卡的名称</title>
   </head>
    <body>
    		浏览器页面显示的内容
    </body>
</HTML>

2、解析器

解析HTML网页的解析器，自带解析器为 ‘html.parser’，可以安装其它解析器。

解析器	使用方法	安装
bs4的HTML解析器	BeautifulSoup(mk,‘html.parser’)	bs4库自带
lxml的HTML解析器	BeautifulSoup(mk,‘lxml’)	pip install lxml
lxml的XML解析器	BeautifulSoup(mk,‘xml’)	pip install lxml
html5lib的解析器	BeautifulSoup(mk,‘html5lib’)	pip install html5lib

三、HTML 内容遍历

在这里插入图片描述

1、下行遍历

属性	说明
.contents	子节点的列表，将所有儿子节点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

import requests
from bs4 import BeautifulSoup

def x_traversal():
    """
    下行遍历
    """
    html = requests.get('https://python123.io/ws/demo.html')
    soup = BeautifulSoup(html.text, 'html.parser')
    # 子节点的列表
    print(soup.body.contents)
    
    # 遍历儿子节点
    for child in soup.body.children:
        print(child)

    # 遍历子孙节点
    for child in soup.body.descendants:
        print(child)

if __name__ == '__main__':
    print('running bs:')
    x_traversal()

2、上行遍历

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

import requests
from bs4 import BeautifulSoup

def s_traversal():
    """
    上行遍历
    """
    html = requests.get('https://python123.io/ws/demo.html')
    soup = BeautifulSoup(html.text, 'html.parser')
    # 父节点
    print(soup.title.parent)

    # 遍历先辈节点
    for parent in soup.a.parents:
        if parent is None:
            print(parent)
        else:
            print(parent.name)

if __name__ == '__main__':
    print('running bs:')
    s_traversal()

3、平行遍历

node： 平行遍历的对象是同一节点下的平行节点

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

import requests
from bs4 import BeautifulSoup

def p_traversal():
    """
    平行遍历
    """
    html = requests.get('https://python123.io/ws/demo.html')
    soup = BeautifulSoup(html.text, 'html.parser')

    # 后一节点
    print(soup.a.next_sibling)

    # 前一节点
    print(soup.a.previous_sibling)

    # 遍历后续节点
    for sibling in soup.a.next_siblings:
        print(sibling)

    # 遍历前续节点
    for sibling in soup.a.previous_siblings:
        print(sibling)

if __name__ == '__main__':
    print('running bs:')
    p_traversal()

Python 网络爬虫笔记3 -- Beautiful Soup库