前言:
网上多案例链接都无效,本篇为有效案例(如果链接失效,请留言笔者)笔者将第一时间更新。本篇非入门案例,如果想看入门案例,请看笔者的scray学习一二三的案例
(talk is cheap,show you code right now)
项目结构
该爬虫作用是从网站爬取《百年孤独》这个长篇小说
xpathtest.py内容
import scrapy
from xpathtest.items import XpathtestItem
class XpathTest(scrapy.Spider):
name = "xpathtest"
start_urls = ['https://www.luoxia.com/bainiangudu/']
allowed_domains = ['luoxia.com']
def parse(self, response):
aa = response.xpath('//div[@class="book-list clearfix"]/ul/li/a/@href').extract()
for url in aa:
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self,response):
item = XpathtestItem()
item["name"] = response.css('title::text').extract()[0] #提取标题
item["content"] = response.xpath('//div[@id="nr1"]/p/text()').extract() #提取文章内容
yield item
piplines.py内容
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
class XpathtestPipeline(object):
def process_item(self, item, spider):
filename ="c:/test/"+item["name"]+".txt"
with open(filename,'w') as f:
for i in item["content"]:
f.write(i+"\n")
return item
settings.py里面添加内容
ITEM_PIPELINES = {
'xpathtest.pipelines.XpathtestPipeline': 300,
}
DOWNLOAD_DELAY = 2 #抓取延迟,防止出现page not available报错
items.py内容
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class XpathtestItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
content = scrapy.Field()
最后
在c盘先新建文件夹test
在项目所在文件夹打开cmd命令界面
输入scrapy crawl xpathtest
爬取内容展示