scrapy爬虫简单案例

一、创建文件夹

进入cmd命令行,切到D盘

#cmd
D:

创建article文件夹

mkdir article

二、创建项目

scrapy startproject article

三、创建爬虫主程序

scrapy genspider xinwen www.hbskzy.cn
#命令后面加爬虫名和域名
#不能和项目名同名

四、依次编写items,spider,pipelines,settings

items文件:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ArticleItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    link = scrapy.Field()

spider文件:

import scrapy
from shuichan.items import ShuichanItem
from scrapy import Request
class YuSpider(scrapy.Spider):
    name = 'yu'
    allowed_domains = ['bbs.liyang-tech.com']


    def start_requests(self):
        urls = ['http://bbs.liyang-tech.com/forum.php?mod=forumdisplay&fid=4&page=%s'%(i) for i in range(1,20)]
        for i in urls:
            yield Request(url = i,callback = self.next_parse)

    def next_parse(self, response):
        www = 'http://bbs.liyang-tech.com/'
        item = ShuichanItem()
        title = response.xpath('//*/tr/th/a[2]/text()')[3:].extract()
        href = response.xpath('//*/tr/th/a[2]/@href')[3:].extract()
        for i in range(len(title)):
            item['title'] = title[i]
            item['link'] = href[i]
            yield item

pipelines文件:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter

class ArticlePipeline:
    def process_item(self, item, spider):
    	return item

settings文件:

# Scrapy settings for article project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'article'

SPIDER_MODULES = ['article.spiders']
NEWSPIDER_MODULE = 'article.spiders'
FEED_FORMAT = 'CSV'      #可选
FEED_URI = '文件名.csv'	 #可选

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'article (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

五、运行爬虫

编写完检查各个文件代码错误,认真检查
然后启动爬虫,先进入cmd命令行输入:

cd /D D:/article/article
scrapy crawl xinwen

网页分析的过程这里不讲,请提前测试调试爬虫,否则容易产生报错信息。
完成后会在article目录下生成一个csv文件:
在这里插入图片描述
打开后
在这里插入图片描述
tip:如果写入空值也是会报错的,可以在管道文件加入if判断

if 'key' in item:

猜你喜欢

转载自blog.csdn.net/qq_17802895/article/details/108524180