一、创建文件夹
进入cmd命令行,切到D盘
#cmd
D:
创建article文件夹
mkdir article
二、创建项目
scrapy startproject article
三、创建爬虫主程序
scrapy genspider xinwen www.hbskzy.cn
#命令后面加爬虫名和域名
#不能和项目名同名
四、依次编写items,spider,pipelines,settings
items文件:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class ArticleItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
link = scrapy.Field()
spider文件:
import scrapy
from shuichan.items import ShuichanItem
from scrapy import Request
class YuSpider(scrapy.Spider):
name = 'yu'
allowed_domains = ['bbs.liyang-tech.com']
def start_requests(self):
urls = ['http://bbs.liyang-tech.com/forum.php?mod=forumdisplay&fid=4&page=%s'%(i) for i in range(1,20)]
for i in urls:
yield Request(url = i,callback = self.next_parse)
def next_parse(self, response):
www = 'http://bbs.liyang-tech.com/'
item = ShuichanItem()
title = response.xpath('//*/tr/th/a[2]/text()')[3:].extract()
href = response.xpath('//*/tr/th/a[2]/@href')[3:].extract()
for i in range(len(title)):
item['title'] = title[i]
item['link'] = href[i]
yield item
pipelines文件:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class ArticlePipeline:
def process_item(self, item, spider):
return item
settings文件:
# Scrapy settings for article project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'article'
SPIDER_MODULES = ['article.spiders']
NEWSPIDER_MODULE = 'article.spiders'
FEED_FORMAT = 'CSV' #可选
FEED_URI = '文件名.csv' #可选
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'article (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
五、运行爬虫
编写完检查各个文件代码错误,认真检查
然后启动爬虫,先进入cmd命令行输入:
cd /D D:/article/article
scrapy crawl xinwen
网页分析的过程这里不讲,请提前测试调试爬虫,否则容易产生报错信息。
完成后会在article目录下生成一个csv文件:
打开后
tip:如果写入空值也是会报错的,可以在管道文件加入if判断
if 'key' in item: