Scrapy框架爬虫基本使用流程

爬取数据时，单个数据使用requests或urllib将数据爬取，但是多个url会导致麻烦，使用Scrapy框架一次性爬取多个页面
使用scrapy startproject [项目名称]
在使用命令创建完成之后进入项目文件夹，创建爬虫
scrapy genspider [爬虫名称] 爬虫域名
在项目中的spiders中查看刚刚创建的爬虫
文件位置
我们需要将存储到的数据存储到本地，需要编写items.py，也需要更改settings.py，pipelines.py
首先编写items.py，

class BookItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    //在类中，编写要存储的数据
    content=scrapy.Field()//此处，为存储的字段名称

编写piplines.py，此时如果我们需要以json的格式存储数据，需要引入JsonLinesItemExporter

import json
from scrapy.exporters import JsonLinesItemExporter//引入相关json包
class ContentPipeline(object):
	def open_spider(self,spider)://此函数是，爬虫开始之前自动执行的函数
		self.fp=open('zuowen.json','wb')//需要先打开存储的文件，此处以二进制打开文件
		self.exporter=JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')
//将打开的文件作为参数传递给JsonLinesItemExporter,ensure_ascii用于存储的数据为中文，如果没有此参数，存储的格式为ascii码，第三个为编码格式encoding='utf-8'，使用JsonLinesItemExporter好处是，每一个字典都会分行存储，在使用JsonItemExporter时，此时字典的内容将会存储到一行
		self.exporter.start_exporting()//开始导入
	def process_item(self, item, spider)://返回的数据作为参数item传递给函数process_item()
		self.exporter.export_item(item)//返回的数据保存到json文件中
		return item
	def close_item(self,spider)://爬虫结束时自动执行的函数
		self.exporter.finish_exporting()//关闭

编写爬虫，需要在爬虫的文件中引入items.py

import scrapy
from content.items import ContentItem
class ZuowenSpider(scrapy.Spider):
    name = 'zuowen'
    allowed_domains = ['']//此处为创建爬虫时自动填写的内容，被我隐藏了
    start_urls = ['']，开始爬虫的url
    def parse(self, response):
    	contents=response.xpath("//div[@id='ArtContent']/p/text()").getall()
    	//使用xpath将目标数据爬取，此处使用的getall()函数，返回的是数组形式的所有符合的数据，
    	for content in contents ://遍历数组
    		item=ContentItem(content=content)//将数据中的内容按照字段名称存储
    		yield item
    	urls=response.xpath("//div[@class='art-foot-relate']/ul/li/a/@href").get()//如果有url下一页，则进入循环状态
    	if urls ：//判断，如果有url则会执行，如果没有符合的url则会退出
	    	yield scrapy.Request(urls,callback=self.parse)//如果有url，则会自动调用本身函数

在数据存储时，在创建项目时，默认是不开启的
在pipeelines.py中，打开

ITEM_PIPELINES = {
    'content.pipelines.ContentPipeline': 300,//将注释去掉
}

至此，一个小的scrapy爬虫结束
个人总结，如有错误，敬请指正

Scrapy框架爬虫基本使用流程

猜你喜欢