运用scrapy爬虫,爬取17k小说网的案例-方法二 - 代码天地

运用scrapy爬虫,爬取17k小说网的案例-方法二

其他 2018-06-25 17:03:41 阅读次数: 2

我们准备爬取此页面的小说，此页面一共有125章

我们点击进去第一章和第一百二十五章发现了一个规律

我们看到此链接的 http://www.17k.com/chapter/271047/6336386.html ->http://www.17k.com/chapter/271047/6336510.html

6336386依次递增到6336510 我们根据此灵感得到下面的spiders核心的代码

# -*- coding: utf-8 -*-
import scrapy
from k17.items import K17Item
import json
class A17kSpider(scrapy.Spider):
    name = '17k'   

    allowed_domains = ['17k.com']
    start_urls = ['http://www.17k.com/chapter/271047/6336386.html']
    def parse(self, response):
        for i in range(6336386, 6336510 + 1):
            new_url="http://www.17k.com/chapter/271047/"+str(i)+".html"
            #print(new_url)
            yield scrapy.Request(new_url, callback=self.next_parse) ##传入url

    def next_parse(self,response):
        for bb in response.xpath('//div[@class="readArea"]/div[@class="readAreaBox content"]'):
                item=K17Item()
                title=bb.xpath("h1/text()").extract()###得到每一章的标题
                new_title=(''.join(title).replace('\n','')).strip()
                item['title']=new_title
                #print(item['title'])
                dec= bb.xpath("div[@class='p']/text()").extract()###得到每一章的详细内容
               # print(type(dec))
                dec_new=((''.join(dec).replace('\n','')).replace('\u3000','')).strip() ###去除内容中的\n 和\u3000和空格的问题
                #print(type(dec_new))
                item['describe'] = dec_new

                yield item

我们在pipelines.py最后得到最终结果

import json
class K17Pipeline(object):
    def process_item(self, item, spider):
        return item
    #初始化时指定要操作的文件
    def __init__(self):
        self.file = open('item.json', 'w', encoding='utf-8')
    # 存储数据，将 Item 实例作为 json 数据写入到文件中
    def process_item(self, item, spider):
        lines = json.dumps(dict(item), ensure_ascii=False) + '\n'
        self.file.write(lines)
        return item
    # 处理结束后关闭 文件 IO 流
    def close_spider(self, spider):
        self.file.close()

扫描二维码关注公众号，回复： 1756333 查看本文章

猜你喜欢

转载自www.cnblogs.com/stevenshushu/p/9225016.html

运用scrapy爬虫,爬取17k小说网的案例-方法二

运用scrapy爬虫,爬取17k小说网的案例

使用scrapy爬虫,爬取起点小说网的案例

Python爬虫-爬取17K小说

网络爬虫-爬取顶点小说网指定小说

scrapy爬虫-爬取wattpad外网小说网站

Python的scrapy之爬取6毛小说网

Python爬虫系列之小说网爬取

爬虫练习——爬取纵横小说网

python爬虫三大解析数据方法：bs4 及爬小说网案例

爬虫爬取小说网站

spider爬虫练习，爬取顶点小说网，小说内容。

【Python3爬虫-爬小说】爬取某小说网小说1/2--利用网址顺序抓

17k小说下载爬虫实例

python爬虫爬取笔趣网小说网站过程图解

Python爬虫实例(一)——爬取某点小说网《庆余年》

爬虫学习之7：使用XPATH爬取起点小说网小说信息(保存到EXCEL)

python爬虫，简单的爬取小说网站的阅读排名

话本小说网-文章内容爬取

记一次scrapy-redis爬取小说网的分布式搭建过程

一周搞定scrapy之第一天--爬取起点中文小说网

【Python3爬虫-爬小说】爬取某小说网小说2/2--利用下一页抓

python：免费看无广告小说之爬取全本免费小说网的小说

Python爬虫期末作业 | 爬取起点小说网作者和书名，并以Excel形式存储

爬虫入门（四）——Scrapy框架入门：使用Scrapy框架爬取全书网小说数据

【每周一爬】爬取盗版小说网的小说

python-scrapy爬取小说下载网小说

爬虫小案例——爬取网站小说

Python3爬取免费小说网小说

小说免费看！python爬虫框架scrapy 爬取纵横网

今日推荐

美国拟限制 AI 大模型出口中国和俄罗斯

苹果将与 OpenAI 达成协议，将 ChatGPT 应用于 iPhone

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

报告：Django 仍然是 74% 开发者的首选

《2024 年一季度互联网投融资运行情况》研究报告

15 年前上了“FFmpeg 耻辱柱”，今天他还得谢谢咱——腾讯QQPlayer一雪前耻？

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

周排行

curl的POST请求，封装方法

8.1.1. Integer Types

Java基础 Day05(个人复习整理)

Python - Django - 中间件 process_exception

小L的试卷

【Shell编程】（函数）判断用户是否存在

python(css样式)

spring ant path 匹配原则 - 【笔记】

《JavaScript与JScript从入门到精通》(美)James.Jaworski.中译本.扫描版.pdf

Eclipse运行带参数的java程序

每日归档

更多

2024-05-12(0)

2024-05-11(38)

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)