Python reptile from entry to give up 07 | Python reptile combat - Download Tomb notes Complete Works

This blog article is only used my amateur record, publish this, only users to read reference, if infringement, please inform me and I will be deleted.
This article is pure wild, without any plagiarism and other articles and learn from others. Adhere to the original! !

Foreword

Hello there. Here is a Python reptile from the entry to abandon a series of articles. I SunriseCai.


This article describes the use of crawlers download of South Pine Uncle Tomb notes Complete Works of fiction.

Tomb notes Complete Works Address: http://www.daomubiji.com/

1. Article Ideas

Tomb notes look at novel sites, as shown below:

  • Home (one page)
    Here Insert Picture Description
    Here Insert Picture Description
  • Section pages (two pages)
    Here Insert Picture Description
  • Text pages (three pages)
    Here Insert Picture Description

It can be seen above a few pictures, which set down is a Russian doll ! ! !

  1. Access home page (a page) to obtain two pages
  2. Access section pages (two pages) to obtain body pages (three pages)
  3. Access text pages (three pages) , extract text data.
  4. Save text data.

Here Insert Picture Description
Then take a look at how to request these pages.

2. Request + page analysis

By the top know, Tomb notes novel site home page is: http://www.daomubiji.com/

2.1 Access Home (one page)

浏览器打开 盗墓笔记 的首页,点击F12,进入开发者模式。看看页面结构,发现了二级页面的链接。perfect !!!那接下来就是 请求网页 以及 解析网页提取数据)了。
Here Insert Picture Description

  • 代码实现 访问首页 获取 二级页面 链接
  • 这里使用前面文章介绍过的requests请求网页 及XPath模块去 解析网页
import requests
from lxml import etree

url = 'http://www.daomubiji.com'

def get_overview():
	# 访问盗墓笔记首页
    res = requests.get(url)
    # 如果访问状态码为200(即成功)就继续
    if res.status_code == 200:
    	# 构造XPath解析对象
        parse_html = etree.HTML(res.text)
        link = parse_html.xpath('//*[@id="menu-item-1404"]/ul/li/a/@href')	# 二级页面链接
        name = parse_html.xpath('//*[@id="menu-item-1404"]/ul/li/a/text()')	# 二级页面名称
        return link, name
    else:
        print('your code is fail')

link,name = get_overview()
print(link, name)

# ['http://www.daomubiji.com/dao-mu-bi-ji-1', 'http://www.daomubiji.com/dao-mu-bi-ji-2',
 'http://www.daomubiji.com/dao-mu-bi-ji-3', 'http://www.daomubiji.com/dao-mu-bi-ji-4',
 'http://www.daomubiji.com/dao-mu-bi-ji-5', 'http://www.daomubiji.com/dao-mu-bi-ji-6',
 'http://www.daomubiji.com/dao-mu-bi-ji-7', 'http://www.daomubiji.com/dao-mu-bi-ji-8',
 'http://www.daomubiji.com/dao-mu-bi-ji-2015']
# ['盗墓笔记1:七星鲁王', '盗墓笔记2:秦岭神树', '盗墓笔记3:云顶天宫', '盗墓笔记4:蛇沼鬼城',
 '盗墓笔记5:迷海归巢', '盗墓笔记6:阴山古楼', '盗墓笔记7:邛笼石影', '盗墓笔记8:大结局',
 '盗墓笔记2015年更新']

这里已经获取到了二级页面的链接,接下来就是去访问 章节页面(二级页面) 获取 **正文页面(三级页面)**了。

2.2 访问章节页面(二级页面)

看到二级页面链接的章节目录,发现了 正文页面(三级页面),这正是我们想要的,下面用代码去实现访问 章节页面(二级页面) 获取 正文页面(三级页面)
Here Insert Picture Description

def get_catalogs(url):
    """
    :param url: 传入二级页面的链接
    :return: 返回所有章节目录于标题
    """
    res = requests.get(url=url)
    if res.status_code == 200:
        parse_html = etree.HTML(res.text)
        link = parse_html.xpath('/html/body/section/div[2]/div/article/a/@href')	# 正文链接
        title = parse_html.xpath('/html/body/section/div[2]/div/article/a/text()')	# 正文标题
        return link, title
    else:
        print('your code is fail')

这里获取到正文(三级页面)的链接和标题后,接下来就是请求正文了。

2.3 访问 正文页面(三级页面),提取 + 保存正文

通过下图可以发现,正文的内容都藏在article标签下面,那接下来尝试去请求该页面。
Here Insert Picture Description

  • 正文提取代码示例:
def get_content(url,title):
    """
    :param url: 传入正文链接
    :return: 返回正文内容
    """
    res = requests.get(url=url)
    if res.status_code == 200:
        parse_html = etree.HTML(res.text)
        content = parse_html.xpath("/html/body/section/div[1]/div/article/p/text()") # 正文内容
        return title, content
    else:
        print('your code is fail')

get_content(url)
  • 正文保存代码示例:
def save_content(content, name):
    """
    :param content: 传入正文内容
    :param name:    传入标题
    :return: 
    """
    with open('%s.txt' % name, 'a', newline='') as f:
        for data in content:
            f.write(data+'/n')	# 加一个换行,段落间需要换行
        f.close()

3. 完整代码

  • 这里针对上边的代码做了整合和部分修改
  • 代码直接拷贝即可运行
# -*- coding: utf-8 -*-
# @Time    : 2020/1/25 20:20
# @Author  : SunriseCai
# @File    : DaoMBJSpider.py
# @Software: PyCharm


"""盗墓笔记爬虫程序"""


import os
import requests
from lxml import etree


class DaomubijiSpider(object):
    def __init__(self):
        self.url_homePage = 'http://www.daomubiji.com'
        self.headers = {'User-Agent': 'Mozilla/5.0'}
        self.content_titles = None
        self.content_links = None

    def get_overview(self):
        """
        :return:  访问首页,获取二级页面的标题的链接
        """
        res = requests.get(self.url_homePage, headers=self.headers)
        if res.status_code == 200:
            parse_html = etree.HTML(res.text)
            links = parse_html.xpath('//*[@id="menu-item-1404"]/ul/li/a/@href')
            titles = parse_html.xpath('//*[@id="menu-item-1404"]/ul/li/a/text()')
            for link, title in zip(links, titles):
                self.get_catalogs(title, link)
                self.get_content()
        else:
            print('your code is fail')

    def get_catalogs(self, title, url):
        """
        :param title: 传入二级页面标题
        :param url:   传入二级页面链接
        :return:    将解析度到的章节目录标题和链接保存到self
        """
        if not os.path.exists(title):
            os.makedirs(title)
        res = requests.get(url=url, headers=self.headers)
        if res.status_code == 200:
            parse_html = etree.HTML(res.text)
            self.content_links = parse_html.xpath('/html/body/section/div[2]/div/article/a/@href')
            self.content_titles = [parse_html.xpath('/html/body/section/div[2]/div/article/a/text()'), title]
        else:
            print('your code is fail')

    def get_content(self):
        """
        :return:  提取正文页面的正文
        """
        folder = self.content_titles[1]
        for link, title in zip(self.content_links, self.content_titles[0]):
        	time.sleep(2)   # 休眠2秒,不能给对方服务器造成太大压力
            res = requests.get(url=link, headers=self.headers)
            if res.status_code == 200:
                parse_html = etree.HTML(res.text)
                content = parse_html.xpath("/html/body/section/div[1]/div/article/p/text()")
                self.save_content(folder, title, content)
            else:
                print('your code is fail')

    def save_content(self, folder, title, content):
        """
        :param folder: 传入文件夹
        :param title:  传入标题
        :param content: 传入正文内容
        :return: 
        """
        with open('%s/%s.txt' % (folder, title), 'a', newline='') as f:
            for data in content:
                f.write(data+'\n')
            f.close()

    def main(self):
        self.get_overview()


if __name__ == '__main__':
    spider = DaomubijiSpider()
    spider.main()

Take a look at the results:
Here Insert Picture Description
Here Insert Picture Description
So far, the end of this article.


Admittedly, this article written by passable, recommend you to perform some by copy paste code relive the charm of the South to send Uncle.


Finally, to sum up this chapter:

  1. Introduced the Tomb reptile ideas website
  2. Explain in detail how to use crawler to download the entire network fiction
  3. Very detailed, have any questions please leave a comment below.

sunrisecai

  • Thank you for your patience to watch, point concerns not get lost.
  • To facilitate the pecking chicken dishes are welcome to join QQ group organization: 648 696 280

The next article, titled "Python reptile from entry to give up 08 | Python reptile combat - download a picture network - to be determined" .

Published 42 original articles · won praise 310 · views 50000 +

Guess you like

Origin blog.csdn.net/weixin_45081575/article/details/104081694