1 项目描述

喜欢买书的朋友肯定听说过当当图书，当当图书包含小说、童书、教辅、教材、考试、外语等多个图书种类，书籍相比其他网站算是比较全的。

本项目仅以采集当当网里面编程开发类的书籍为例。在实际操作过程中，可根据需要，更换要采集的分类网址。还可使用URL列表循环，批量采集多个分类网址的书籍。
本项目采集的当当，具体字段为：图书标题，图书价格，图书作者，评论数量，图书出版时间，出版社，图书简介。

2 新建项目(`scrapy startproject xxx`):

新建一个新的爬虫项目;

 scrapy  genspider dd 'dangdang.com'

3 明确目标(编写`item.py`)

明确你要抓取的目标;

class DangdangItem(scrapy.Item):
    # 图书标题
    title = scrapy.Field()
    # 图书价格
    price = scrapy.Field()
    # 图书作者
    author = scrapy.Field()
    # 评论数量
    comment_num = scrapy.Field()
    # 图书出版时间
    publication_date = scrapy.Field()
    # 出版社
    publication_house = scrapy.Field()
    # 图书简介
    introduction = scrapy.Field()

4 制作爬虫(`spiders/xxspider.py`)

– 制作爬虫, 开始爬取网页;

5 存储爬虫(`pipelines.py`)

– 设置管道存储爬取内容;

基于Scrapy框架的当当网编程开发图书定向爬虫

1 项目描述

2 新建项目(`scrapy startproject xxx`):

3 明确目标(编写`item.py`)

4 制作爬虫(`spiders/xxspider.py`)

5 存储爬虫(`pipelines.py`)

猜你喜欢

基于Scrapy框架的当当网编程开发图书定向爬虫

1 项目描述

2 新建项目(scrapy startproject xxx):

3 明确目标(编写item.py)

4 制作爬虫(spiders/xxspider.py)

5 存储爬虫(pipelines.py)

猜你喜欢

2 新建项目(`scrapy startproject xxx`):

3 明确目标(编写`item.py`)

4 制作爬虫(`spiders/xxspider.py`)

5 存储爬虫(`pipelines.py`)