python3.6.5爬虫之一：笔趣阁小说爬取（首页爬取法）

前面几次笔趣阁小说爬取法是根据每一章的地址找到下一章的地址，这种方法有个缺点，如果中间断了话，或者找不到下一章网址就会报错，这种类似串联的方法效率太低，通过研究笔趣阁每篇小说的设计架构让我找到其中的特点，这让我找到更加高效的爬取小说的方法。

列表下载法
第一步：分析小说设计的结构	打开笔趣阁小说的目录界面，例如：龙符，可以看到，所有章节都有链接，我们可以将这些链接爬取下来，放到列表中，然后逐一去下载。
第二步：分析章节之间的联系	通过查看源代码知道，越早的章节地址数字越小，例如：第三章相对地址为3.html（章节与地址关系可能不是连续的，但是一定是递增的），这个很容易理解，毕竟后来写的东西，分配的网页地址一般是递增的，否则混乱的容易重复网页地址，这样我们可以把第一步得到的地址列表进行排序。
第三步：分析得到的章节列表	通过得到的列表，可以知道我们得到的是相对地址，而我们需要的是绝对地址，通过拼凑将该列表修改为绝对地址，另外还需要将列表去重，因为最后几章有重复。配图见下方图一
第四步：按顺序下载章节并追加	请求网页的方法与之前的没有区别，这里做了下优化，将下载的进度打印出来。配图见下方图二

图一：
这里写图片描述

图二:

#coding:utf-8

import re
import os
import sys
from bs4 import BeautifulSoup
from urllib import request
import ssl
url = 'http://www.biqiuge.com/book/4772/'
url = 'https://www.qu.la/book/1/'
url = 'http://www.biqiuge.com/book/1/'

def getHtmlCode(url):
    page = request.urlopen(url)
    html = page.read()  
    htmlTree = BeautifulSoup(html,'html.parser')
    return htmlTree
    #return htmlTree.prettify()
def getKeyContent(url):
    htmlTree = getHtmlCode(url)

def parserCaption(url):
    htmlTree = getHtmlCode(url)
    storyName = htmlTree.h1.get_text() + '.txt'

    print('小说名:',storyName)
    aList = htmlTree.find_all('a',href=re.compile('(\d)*.html'))  #aList是一个标签类型的列表，class = Tag 写入文件之前需要转化为str
    # print(int(aList[1]['href'][0:-5]))
    aDealList = []
    for line in aList:
        # line['href'] = url + line['href']
        # print(line['href'])
        chapter = int(line['href'][0:-5])
        if chapter not in aDealList:    #去重
            aDealList.append(chapter)
    aDealList.sort()    #排序
    # print(aDealList)    
    # print(len(aDealList))
    # aDealList = str(aDealList)
    urlList = []
    for line in aDealList:
        line = url + str(line) + '.html'
        urlList.append(line)
    # print(urlList)
    print(urlList)
    return (storyName,urlList)
def parserChapter(url):
    htmlTree = getHtmlCode(url)
    title = htmlTree.h1.get_text()  #章节名
    content = htmlTree.find_all('div',id = 'content')
    content = content[0].contents[1].get_text()
    return (title,content)
def main(url):
    (storyName,urlList) = parserCaption(url)
    flag = True
    cmd = 'del ' + storyName
    os.system(cmd)
    cmd = 'cls'
    count = 1
    for url_alone in urlList:
        percent = count / len(urlList) * 100
        print('%s 下载进度 %0.2f %%'%(storyName,percent))
        f = open(storyName,'a+',encoding = 'utf-8')
        (title,content) = parserChapter(url_alone)
        tmp = title + '\n' + content
        f.write(tmp)
        f.close()
        count = count + 1

main(url)

python3.6.5爬虫之一：笔趣阁小说爬取（首页爬取法）

猜你喜欢