重写第一个爬虫程序

第一个爬虫程序是利用scrapy命令创建好之后，直接编写代码实现的。文章见 [
scrapy 从第一个爬虫开始]，本文将利用item，pipeline以及文件保存重写此程序，从而使大家更好的理解。

一、首先是image.py程序

# -*- coding: utf-8 -*-
import scrapy
from image.items import ImageItem
from scrapy.http import Request
import sys
 
class ImageSpider(scrapy.Spider):
    name = 'image'
    allowed_domains = ['xdcd.com']

    base = 'https://xkcd.com/'
    start_urls = ['https://xkcd.com/1']
    
    def parse(self, response):
    	item = ImageItem()
    	item['title'] = ''
    	item['url'] = ''

    	item['title'] = response.xpath('//div[@id="ctitle"]/text()').extract()[0]

    	for urlSelector in response.xpath('//div[@id="comic"]'):
    		urls = urlSelector.xpath('img/@src').extract()
    		
    		if len(urls):
    			item['url'] = 'https:' + urls[0]
    		else:
    			urls = urlSelector.xpath('a/img/@src').extract()
    			if len(urls):
    				item['url'] = 'https:' + urls[0]

    		if item['title'] and item['url']:
    			yield item

    	nextPageSelector = response.xpath('//div[@id="middleContainer"]/ul[@class="comicNav"]')
    	urlStr = nextPageSelector.xpath('li/a/@href').extract()[3]
    	urlArr = urlStr.split('/')
    	print(urlArr)
    	if len(urlArr) >= 2:
    		nextPageUrl = self.base + str(urlArr[1]) + '/'
    		yield Request(nextPageUrl, callback=self.parse, dont_filter = True)

这段代码的改动包括以下几个方面：
1 从当前页获取下一页的url，而不是把设置start_urls为所有的页面url，更接近实际

2 使用了item，item是要爬取的数据结构定义，比如这里我们定义了title,url两个字段，代码在items.py里，注意引入方式

3 利用yield关键把item传递给pipeline进行清洗，过滤或者持久化处理；同时利用yield处理下一页的请求，注意参数为url，回调函数即处理response的函数名称，第三个参数告诉爬虫不进行过滤处理，否则会被去重过滤掉

yield Request(nextPageUrl, callback=self.parse, dont_filter = True)

二、items.py程序

# -*- coding: utf-8 -*-

import scrapy
class ImageItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    url = scrapy.Field()

三、pipelines.py程序

# -*- coding: utf-8 -*-

# Define your item pipelines here
import json

class ImagePipeline(object):
	def __init__(self):
		self.file = open('data.json', 'wb')

	def process_item(self, item, spider):
		line = json.dumps(dict(item)) + "\n"
		self.file.write(line.encode())
		return item

这里就是处理item，过滤或者持久化，此处是保存到data.json文件里，注意这里line需要调用encode方法，否则会报错，大概是两种string类型的问题。

到此，第一个爬虫程序就重新完成了，也更清楚利用scrapy进行数据爬取的整个数据流程。

重写第一个爬虫程序

猜你喜欢