学习scrapy框架爬小说

一、背景：近期学习python爬虫技术，感觉挺有趣。由于手动自制爬虫感觉效率低，了解到爬虫界有先进的工具可用，尝试学学scrapy爬虫框架的使用。

二、环境：centos7，python3.7，scrapy1.7.3

三、scrapy原理简述：

1、scrapy框架组成：引擎、调度器、下载器（含下载器中间件）、爬虫组件（spider，含爬虫中间件）、输出管道（item pipelines）

2、scrapy工作过程：

（1）引擎发起爬虫请求，提交给调度器安排任务排序。

（2）调度器安排的下载任务通过引擎提交下载器开展下载任务（发出Quests请求）。

（3）下载器获得的Response通过引擎回到爬虫组件进行内容爬取的处理。（这一步就是人为工作的重点，爬取什么内容，如何爬取等设计需要人为设定控制。）

（4）爬虫组件处理后的数据通过item pipelines组件进行输出，可输出为json、数据库、csv等等格式。（这一步需要根据具体需求，可人为控制输出方案。）

（5）上述过程循环往复，直到预定爬取任务完成。

四、scrapy工程构建过程（以爬取新笔趣阁网站www.xbiquge.com上的《汉乡》小说为例）

1、新建爬虫工程

(base) [python@ELK ~]$ scrapy startproject hanxiang
New Scrapy project 'hanxiang', using template directory '/home/python/miniconda3/lib/python3.7/site-packages/scrapy/templates/project', created in:
    /home/python/hanxiang

You can start your first spider with:
    cd hanxiang
    scrapy genspider example example.com
(base) [python@ELK ~]$ tree hanxiang
hanxiang
├── hanxiang
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

2、进入爬虫工程目录并生成爬虫程序文件

(base) [python@ELK ~]$ cd hanxiang
(base) [python@ELK hanxiang]$ scrapy genspider Hanxiang www.xbiquge.la/15/15158 #注意，爬虫名称不能与工程名称重名；链接不要加http://或https://关键字，链接尾部不要有/符号。
Created spider 'Hanxiang' using template 'basic' in module:
hanxiang.spiders.Hanxiang

(base) [python@ELK hanxiang]$ tree
.
├── hanxiang
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   │   ├── __init__.cpython-37.pyc
│   │   └── settings.cpython-37.pyc
│   ├── settings.py
│   └── spiders
│       ├── Hanxiang.py
│       ├── __init__.py
│       └── __pycache__
│           └── __init__.cpython-37.pyc
└── scrapy.cfg
在爬虫工程中新生成的文件中，重点是Hanxiang.py的爬虫程序文件。

3、编制爬虫文件和相关的配置文件

（1）修改settings.py文件

(base) [python@ELK hanxiang]$ vi hanxiang/settings.py

# -*- coding: utf-8 -*-
# Scrapy settings for hanxiang project

BOT_NAME = 'hanxiang'
SPIDER_MODULES = ['hanxiang.spiders']
NEWSPIDER_MODULE = 'hanxiang.spiders'

ROBOTSTXT_OBEY = False #爬取不受网站限制

...

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'hanxiang.pipelines.HanxiangPipeline': 300, #启用pipelines功能，300是优先级配置，其取值范围0-1000
}
...

FEED_EXPORT_ENCODING = 'utf-8' #输出编码设置

（2）修改items.py配置（确定需要爬取的内容）

(base) [python@ELK hanxiang]$ vi hanxiang/items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class HanxiangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    url = scrapy.Field()           #需要获取小说的章节链接
    preview_page = scrapy.Field()       #小说的上一页章节链接
    next_page = scrapy.Field()            #小说的下一页章节链接
    content = scrapy.Field()                 #小说的章节内容

（3）编制pipelines.py程序，控制输出内容。（目标：把爬取的四个字段内容输出到mysql数据库，以方便后续处理。）

(base) [python@ELK hanxiang]$ vi hanxiang/pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import os
import pymysql
from twisted.enterprise import adbapi
from pymysql import cursors

class HanxiangPipeline(object):             #类名是自动生成的，不用更改。

    def __init__(self):                              #定义类初始化动作，包括连接数据库novels和创建hanxiang数据表。
        dbparams = {
            'host': '127.0.0.1',
            'port': 3306,
            'user': 'root',
            'password': 'password',
            'database': 'novels',
            'charset': 'utf8'
        }
        self.conn = pymysql.connect(**dbparams)
        self.cursor = self.conn.cursor()
        self._sql = None
        self.cursor.execute("drop table if exists hanxiang")
        self.cursor.execute("create table hanxiang (id int unsigned auto_increment not null primary key, url varchar(50) not null, preview_page varchar(50), next_page varchar(50), content TEXT not null) charset=utf8")

    def process_item(self, item, spider):        #此方法名字也是自动生成，不能更改。
        self.cursor.execute(self.sql, (item['url'], item['preview_page'], item['next_page'], item['content']))   #执行sql命令，向hanxiang数据表写入爬取的数据。
        self.conn.commit()    #需执行commit，数据表内容才会更新。
        return item

    @property
    def sql(self):
        if not self._sql:
            self._sql = """
                insert into hanxiang(id, url, preview_page, next_page, content) values(null, %s, %s, %s, %s)
                """
            return self._sql
        return self._sql

（4）编制爬虫主程序Hanxiang.py，爬取设定内容。

(base) [python@ELK hanxiang]$ vi hanxiang/spiders/Hanxiang.py

# -*- coding: utf-8 -*-
import scrapy
from hanxiang.items import HanxiangItem

class HanxiangSpider(scrapy.Spider):       #自动生成的爬虫类名
    name = 'Hanxiang'
    #allowed_domains = ['www.xbiquge.la/15/15158']    #自动生成的爬取页面控制范围（域）
    allowed_domains = ['xbiquge.la']     #放宽爬取页面限制，否则，就需要在更深一级发出的Request方法中使用dont_filter=True参数。
    def start_requests(self):   #方法名称不能变
        start_urls = ['http://www.xbiquge.la/15/15158/']            #启动爬取页面应以list方式存放，变量名称不能改变。
        for url in start_urls:
            yield scrapy.Request(url=url, callback=self.parse)       #生成器模式（yield）调用爬虫处理方法parse，效率很高。
    def parse(self, response):    #方法名称不能变
        dl = response.css('#list dl dd')     #提取章节链接相关信息
        for dd in dl:
            self.url_c = "http://www.xbiquge.la" + dd.css('a::attr(href)').extract()[0]   #组合形成小说的各章节链接
            #print(self.url_c)
            #yield scrapy.Request(self.url_c, callback=self.parse_c,dont_filter=True )
            yield scrapy.Request(self.url_c, callback=self.parse_c)    #以生成器模式（yield）调用parse_c方法获得各章节链接、上一页链接、下一页链接和章节内容信息。
            #print(self.url_c)
    def parse_c(self, response):
        item = HanxiangItem()
        item['url'] = response.url
        item['preview_page'] = "http://www.xbiquge.la" + response.css('div .bottem1 a::attr(href)').extract()[1]
        item['next_page'] = "http://www.xbiquge.la" + response.css('div .bottem1 a::attr(href)').extract()[3]
        title = response.css('.con_top::text').extract()[4]
        contents = response.css('#content::text').extract()
        text=''
        for content in contents:
            text = text + content
        #print(text)
        item['content'] = title + "\n" + text.replace('\15', '\n')     #各章节标题和内容组合成content数据，\15是^M的八进制表示，需要替换为换行符。
        yield item     #以生成器模式（yield）输出Item对象的内容给pipelines模块。

4、运行爬虫：

(base) [python@ELK hanxiang]$ scrapy runspider hanxiang/spiders/Hanxiang.py

学习scrapy框架爬小说

猜你喜欢