小说爬取干货教程

学习爬虫有段时间，想找个实例来进行练习，所以找了篇网络小说进行爬取，同时写篇博客记录自己的练习过程，也为他人提供干货。

小说网站：新笔趣阁
URL：https://www.xsbiquge.com/

此次爬取我们在新笔趣阁进行，上面给出了小说网站的地址，至于爬取的小说，我一直追更修罗武神，所以便选择对它进行爬取。在新笔趣阁搜索小说进入页面，小说地址，点击进入其页面，我们可以看到如下内容，
在这里插入图片描述

页面上显示了这篇小说的目录列表。我们要对小说进行全部爬取，那么我们就可以先对一个章节进行爬取下载，之后再重复操作便可以完成。我们点开第一章节，我使用的是谷歌浏览器，点击鼠标右键或者按F12键就可以查看网页html的源码（其他浏览器可能不同），我们可以看到如下，在这里插入图片描述
事实上，小说的爬取就是根据网站的url获取网页的html信息进行提取需要的信息内容，所以我们可以使用强大的requests第三方库进行获取信息，我们可以编写代码：

import requests

if __name__=='__main__':
    target = 'https://www.xsbiquge.com/0_638/1124120.html'
    #网页地址
    response = requests.get(target)
    #resquests获取网页相应的响应
    response.encoding = 'utf-8'
    #调成utf-8的格式
    html = response.text
    #将响应的信息以text打印
    print(html)

运行结果如下：在这里插入图片描述
我们可以看到网页上的html的代码我们成功提取出来了，但是并不是所有的内容都是我们所关心的，我们只需要将正文的内容从众多的html标签中提取出来就可以，这里我就可以使用beautiful soup库进行操作。

在这里插入图片描述

在这之前我们先审查元素，我们可以发现正文的内容在<div id="content">的标签下，其中div是标签，id就是div标签的属性，content就是属性值，这就可以区分不同标签，我们则可以用beautiful soup 提取其中的内容，编写代码：‘

import requests
from bs4 import BeautifulSoup

if __name__=='__main__':
    target = 'https://www.xsbiquge.com/0_638/1124120.html'
    response = requests.get(target)
    response.encoding = 'utf-8'
    html = response.text
    bf = BeautifulSoup(html,"lxml")
    texts = bf.find("div",id = "content")
    #查找第一个此样标签下的内容
    print(texts)

运行结果如下在这里插入图片描述
如此我们就成功提取正文的内容，但是其中还有div和br一类的标签，所以可以编写代码运行

import requests
from bs4 import BeautifulSoup

if __name__=='__main__':
    target = 'https://www.xsbiquge.com/0_638/1124120.html'
    response = requests.get(target)
    response.encoding = 'utf-8'
    html = response.text
    bf = BeautifulSoup(html,"lxml")
    texts = bf.find("div",id = "content")
    print(texts.text)

在这里插入图片描述
我们用texts.text提取了所有的文字，剔除了其他的标签，这样小说的正文我们就已经可以成功提取了，至于爬取整本小说，我们需要每一个章节的网页链接，我们可以回到刚才的目录列表页面，同样查看它的html代码，
在这里插入图片描述

审查元素我们可以发现所有的章节的信息都在<div id='list'>下的a标签内，所以编写代码如下：

import requests
from bs4 import BeautifulSoup

if __name__=='__main__':
    target = 'https://www.xsbiquge.com/0_638/'
    response = requests.get(target)
    response.encoding = 'utf-8'
    html = response.text
    bf = BeautifulSoup(html,"lxml")
    texts = bf.find("div",id = "list")
    texts = texts.find_all("a")
    #提取出所有的a标签
    print(texts)

运行结果
在这里插入图片描述

结果我们可以看到文章的目录列表已经保存以列表的形式保存起来，同时我们还可以对它进行提取信息

import requests
from bs4 import BeautifulSoup

if __name__=='__main__':
    target = 'https://www.xsbiquge.com/0_638/'
    response = requests.get(target)
    response.encoding = 'utf-8'
    html = response.text
    bf = BeautifulSoup(html,"lxml")
    texts = bf.find("div",id = "list")
    texts = texts.find_all("a")
    for i in texts:
        print(i.string)
        print(i.get("href"))

在这里插入图片描述

i.string提取标签中的文字，i.get(“href”)提取a标签中的href属性值，由此我们就获得了所有章节的链接地址了。

整理给出完整代码：

import requests
from bs4 import BeautifulSoup

def content(url):
    target = url
    response = requests.get(target)
    response.encoding = 'utf-8'
    html = response.text
    bf = BeautifulSoup(html,"lxml")
    texts = bf.find("div",id = "content")
    return texts.text


if __name__=='__main__':
    target = 'https://www.xsbiquge.com/0_638/'
    response = requests.get(target)
    response.encoding = 'utf-8'
    html = response.text
    bf = BeautifulSoup(html,"lxml")
    texts = bf.find("div",id = "list")
    texts = texts.find_all("a")
    sum = 0
    names = []
    urls = []
    for i in texts:
        names.append(i.string)
        urls.append(i.get("href"))
    for i in range(len(names)):
        url = 'https://www.xsbiquge.com'+urls[i]
        word = content(url)
        with open('修罗武神/'+str(names[i])+'.txt',"a",encoding='utf-8')as f:
            f.write(word)
    print("下载成功")

这段代码运行后，在第四十一章会出现报错，在这里插入图片描述
查询后知道windows下文件名中不能出现这些敏感字符 ? * : . < > \ / |，所以我们可以进行一些修改，
定义变量将八个敏感字符存入，再加入八个空字符，实现转换。

由此小说可以正常爬取，但是下载顺序有些乱可能是小说本身的问题。
希望这篇博客能对大家有所帮助。

小说爬取干货教程

猜你喜欢