Python 深入浅出 - BeautifulSoup 爬虫利器

文末爬取案例的效果图（爬取妹子图）：
这里写图片描述

这里写图片描述

BeautifulSoup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库，简单来说，它能将 HMTL 的标签文件解析成树形结构，然后方便的获取到指定标签的对应属性。

官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

BeautifulSoup 安装

PyCharm 安装：File -> Default Settings -> Project Interpreter

这里写图片描述

入门程序

# 导入 beautifulsoup
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.</p>

<p class="story">the story is beautiful</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())                  # 按照标准的缩进格式的结构输出

all_content = soup.get_text()           # 获取文档所有的显示内容，即出去标签后文字内容
title = soup.title                      # 获取文档的标题
title_name = soup.title.name            # 获取文档的标题的标签名称
title_text = soup.title.string          # 获取文档的标题的显示内容
title_header = soup.title.parent.name   # 获取<title>标签的父级标签的名称
p_all = soup.find_all('p')              # 获取文档的所有段落标签

a_links = soup.find_all('a')            # 获取文档的所有超链接标签

print('all_content = %s' % all_content)
print('title = %s' % title)
print('title_name = %s' % title_name)
print('title_text = %s' % title_text)
print('title_header = %s' % title_header)
for link in a_links:
    print('a = %s ' % link)

for p in p_all:
    print('type(p) = %s ' % type(p))                     #
    print('p.name = %s ' % p.name)                       # 获取段落标签名称
    print('p[class] = %s' % p['class'])                  # 获取段落标签的 class 属性的值

输出结果：

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   and they lived at the bottom of a well.
  </p>
  <p class="story">
   the story is beautiful
  </p>
 </body>
</html>
all_content = 
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
Lacie and
Tillie
and they lived at the bottom of a well.
the story is beautiful

title = <title>The Dormouse's story</title>
title_name = title
title_text = The Dormouse's story
title_header = head
a = <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 
a = <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
a = <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 
type(p) = <class 'bs4.element.Tag'> 
p.name = p 
p[class] = ['title']
type(p) = <class 'bs4.element.Tag'> 
p.name = p 
p[class] = ['story']
type(p) = <class 'bs4.element.Tag'> 
p.name = p 
p[class] = ['story']

BeautifulSoup 解析结果对象

BeautifulSoup 将复杂 HTML 文档转换成一个复杂的树形结构，每个节点都是 Python 对象，所有对象可以归纳为 4 种： Tag，NavigableString，BeautifulSoup，Comment

（1）Tag

通俗的说，就是 HTML 中的一个个标签。

<title>The HTML5 Document</title>
<a class="red_link" href="http://www.baidu.com" id = "link"></a>

上面的 title，a 等等 HTML 标签加上里面的包括的内容就是 Tag 。

html_tag = '<b class="boldest">bold text</b>'
soup = BeautifulSoup(html_tag,'html.parser')
tag = soup.b
# Tag 对象属于的类型
type_tag = type(tag)
# 每一个 Tag 都有自己的名字，可以通过 .name 来获取
tag_name = tag.name
# 一个 tag 可能有很多个属性,tag 属性的操作方法与字典 dict 相同
# 例如，获取 class 属性的值
tag_class = tag['class']
print(type_tag)                           # 结果：<class 'bs4.element.Tag'>
print('type_name = %s' % tag_name)        # 结果：type_name = b
print('type["class"] = %s ' % tag_class)  # 结果：type["class"] = ['boldest']

Tag 的属性可以被增加，删除或修改：

tag['class'] = 'normal'                # 修改 class 属性值
tag['id'] = 'id_bold_text'             # 增加 id 属性
print(soup_tag.prettify())             # 格式化输出

del tag['class']                       # 删除 class 属性
del tag['id']                          # 删除 id 属性
print(soup_tag.prettify())             # 格式化输出

输出结果：

<b class="normal" id="id_bold_text">
 bold text
</b>

<b>
 bold text
</b>

Tag 的其他操作：

html_str = '<head><title>The document Story</title></head><body><b>text body</b><b>text color</b></body>'
soup = BeautifulSoup(html_str,'html.parser')
print(soup.prettify())
print('head = %s ' % soup.head)
print('title = %s' % soup.title)
print('body = %s ' % soup.body)
# 点取属性的方式只能获得当前名字的第一个 tag
print('body.b = %s ' % soup.body.b)

输出结果：

<head>
 <title>
  The document Story
 </title>
</head>
<body>
 <b>
  text body
 </b>
 <b>
  text color
 </b>
</body>
head = <head><title>The document Story</title></head> 
title = <title>The document Story</title>
body = <body><b>text body</b><b>text color</b></body> 
body.b = <b>text body</b>

tag 节点属性	描述
.contents	可以将 tag 的子节点以列表的方式输出（直接子节点）
.children	生成子节点的生成器，可以对 tag 子节点进行循环（直接子节点）
.descendants	可以对 tag 所有子孙节点进行递归循环
.string
.strings	tag 中包含多个字符串，可以使用 .strings 来循环获取
.parent	获取某个元素的父节点
.parents	递归获取元素的所有父节点
.next_sibling .previous_sibling	查询兄弟节点
.next_siblings .previous_sibling	可以对当前节点的兄弟节点迭代输出
.next_element .previous_element	指向解析过程中中下一个被解析的对象（字符串或 tag）
.next_elements .previus_elements	生成迭代器，可以向前或向后访问文档的解析内容

（2）NavigableString

使用 .string 能够很轻松的获取到标签内部的文字内容。

html_string = '<b class="boldest">bold text</b>'
soup = BeautifulSoup(html_string,'html.parser')
tag_b = soup.b
tag_string = tag_b.string       # 结果：bold text
print('type(tag.string) = %s ' % type(tag_string))     # 结果：type(tag.string) = <class 'bs4.element.NavigableString'>

（3）BeautifulSoup

BeautifulSoup 对象表示的一个文档的全部内容，大部分时候，可以把它当做 Tag 对象，是一个特殊的 Tag，可以分别获取它的类型，名称和属性。

from bs4 import BeautifulSoup

html_doc = '<head><title>Document title</title></head>'
soup = BeautifulSoup(html_doc, 'html.parser')
name = soup.name
attrs = soup.attrs
print(type(soup))     # <class 'bs4.BeautifulSoup'>
print(type(name))     # <class 'str'>
print(type(attrs))    # <class 'dict'>

（4） Comment

Comment 对象是一个特殊类型的 NavigableString 对象，和 NavigableString 一样，在输出时，输出的内容仍然不包括 注释符号。即没有把它当做注释看待。

html_href = '<a href="http://www.baidu.com"><!-- a href to baidu--></a>'
soup = BeautifulSoup(html_href,'html.parser')
a = soup.a
string  = soup.a.string
type = type(soup.a.string)
print(a)              # 结果：<a href="http://www.baidu.com"><!-- a href to baidu--></a>
print(string)         # 结果：a href to baidu
print(type)           # 结果：<class 'bs4.element.Comment'>

实际场景可以根据 type 类型进行特定操作：

if(type == element.Commont):
    pass

爬取妹子图

import requests
from bs4 import BeautifulSoup
import os

hostreferer = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Referer': 'http://m.mzitu.com/'
}

picreferer = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Referer': 'http://i.meizitu.net'
}

all_url = 'http://m.mzitu.com/all/'
response = requests.get(all_url, headers=hostreferer)
html_doc = response.text
# print(html_doc)
soup = BeautifulSoup(html_doc, "html.parser")
div_list = soup.find_all('div', class_='archive-brick')
# print(div_list)
root_path = 'C:\mzitu'  # 存放图片的根路径

for div in div_list:
    # print(div)
    a_link = div.find('a')
    # print(a_link)
    href = a_link['href']
    title = a_link.get_text()
    print(title, href)
    # 每套图中含有多张图片，以该套图的标题为文件夹名称创建文件夹
    folder_name = str(title).strip().replace(':', '').replace(' ', '').replace('?', '')  # 文件夹名称，去掉空格
    # os.path.join(path,name): 连接目录与文件名或目录 结果为path/name
    path = os.path.join(root_path, folder_name)
    abspath = os.path.abspath(path)  # 文件夹
    # print(abspath)
    os.makedirs(path)  # 创建一个存放套图的文件夹
    os.chdir(path)  # 切换到创建的文件夹

    response_detail = requests.get(href, headers=hostreferer)
    html_detail = response_detail.text  # 爬取每个详情页面的 html 内容，其中含有图片的url
    # print(html_detail)
    detail_soup = BeautifulSoup(html_detail, 'html.parser')
    # 获取最大的页数，因为详情页的 url 是拼接页数组成的
    max_page = detail_soup.find('div', class_='prev-next').find('span', class_='prev-next-page').get_text()[-3:-1]
    print(max_page)
    # 根据当前页数获取每一页的url
    for page in range(1, int(max_page) + 1):  # 表示从 1 到 max_page + 1 之间的整数,不包括 max_page+1
        page_url = href + '/' + str(page)  # 即每一页的 url 地址
        page_response = requests.get(page_url, headers=hostreferer)
        page_html = page_response.text
        page_soup = BeautifulSoup(page_html, 'html.parser')
        figure = page_soup.find('figure')
        # print(figure)
        img_src = figure.find('img')['src']
        print(img_src)  # 每张图片的 Url
        img_result = requests.get(img_src, headers=picreferer)
        f = open(img_src[-9:-4] + '.jpg', 'ab')
        f.write(img_result.content)
        f.close()

代码封装：

# encoding:utf-8
import requests
from requests import HTTPError
from bs4 import BeautifulSoup
import os

all_url = 'http://m.mzitu.com/all/'

root_path = 'C:\mzitu'  # 存放图片的根路径

hostreferer = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Referer': 'http://m.mzitu.com/'
}

picreferer = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Referer': 'http://i.meizitu.net'
}


def get_html_text(url, headers):
    '''
    url 请求 url
    requests 获取 html 源码
    :param headers: http header
    :return: 返回 response
    '''
    try:
        response = requests.get(url, headers=headers, timeout=30)
        response.raise_for_status()
        response.encoding = 'utf-8'
        html_text = response.text
        return html_text
    except HTTPError as e:
        print(e)
        return 'request failed'


def makedir(title):
    '''
    创建文件夹,并切换到该文件下，因为图片分类需要保存到分类文件夹中
    :param title:
    :return:
    '''
    # 每套图中含有多张图片，以该套图的标题为文件夹名称创建文件夹
    folder_name = str(title).strip().replace(':', '').replace(' ', '').replace('?', '')  # 文件夹名称，去掉空格
    # os.path.join(path,name): 连接目录与文件名或目录 结果为path/name
    path = os.path.join(root_path, folder_name)
    abspath = os.path.abspath(path)  # 文件夹
    # print(abspath)
    os.makedirs(path)  # 创建一个存放套图的文件夹
    os.chdir(path)  # 切换到创建的文件夹


def get_img_url(page_html):
    '''
    获取详情页中的图片的 url
    :param page_html:
    :return:
    '''
    page_soup = BeautifulSoup(page_html, 'html.parser')
    figure = page_soup.find('figure')
    # print(figure)
    img_src = figure.find('img')['src']
    print(img_src)  # 每张图片的 Url
    return img_src


def save_img(img_src):
    '''
    根据图片地址 URL，下载图片
    :param img_src:
    :return:
    '''
    img_result = requests.get(img_src, headers=picreferer)
    f = open(img_src[-9:-4] + '.jpg', 'ab')
    f.write(img_result.content)
    f.close()


def get_max_page(page_detail_text):
    '''
    获取每个类别下的图片详情页的最大页数
    :return: 最大页数
    '''
    detail_soup = BeautifulSoup(page_detail_text, 'html.parser')
    # 获取最大的页数，因为详情页的 url 是拼接页数组成的
    max_page = detail_soup.find('div', class_='prev-next').find('span', class_='prev-next-page').get_text()[-3:-1]
    print(max_page)
    return max_page


def save_to_disk(html_doc):
    soup = BeautifulSoup(html_doc, "html.parser")
    div_list = soup.find_all('div', class_='archive-brick')
    # print(div_list)

    for div in div_list:
        # print(div)
        a_link = div.find('a')
        # print(a_link)
        href = a_link['href']
        title = a_link.get_text()
        print(title, href)
        makedir(title)

        # 详情页的 text 内容
        page_detail_text = get_html_text(href, headers=hostreferer)  # 每个详情页面的 html 内容，其中含有图片的url
        # print(page_detail_text)
        max_page = get_max_page(page_detail_text)  # 获取每个类别下的图片详情页的最大页数

        # 根据当前页数获取每一页的url
        for page in range(1, int(max_page) + 1):  # 表示从 1 到 max_page + 1 之间的整数,不包括 max_page+1
            page_url = href + '/' + str(page)  # 即每一页的 url 地址
            page_html = get_html_text(page_url, headers=hostreferer)
            img_url = get_img_url(page_html)  # 每一页中图片的 url
            save_img(img_url)


if __name__ == '__main__':
    html_doc = get_html_text(all_url, headers=hostreferer)
    save_to_disk(html_doc)

爬取结果：文章开头的效果图。