一、创建项目:
1.桌面创建一个xiaohua文件夹,在xiaohua文件夹内打开命令窗口;
2.在命令运行scrapy startproject downimages ,创建downimages项目
二、给项目添加爬虫模块:
Scrapy 中所有的爬虫模块都是存放在spiders文件夹中,所以要在downimages/spiders下创建爬虫模块
1.在命令行运行cd spiders 进入spiders目录下;
2.继续运行命令scrapy genspider DownImages xiaohuar.com ,其中DownImages是模块名,xiaohuar.coma是校花网的主域名。执行该命令会在downimages/spiders目录下生成DownImages.py文件。
三、定义Item:
Item主要是从非结性数据源提取结构性数据,items.py代码如下:
import scrapy class DownimagesItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() image_urls=scrapy.Field() images=scrapy.Field() images_path=scrapy.Field()#本地路径 name=scrapy.Field()#存照片名
四、编写spiders.py文件:
spiders.py文件主要是用来解析网页,代码如下:
import re import scrapy from scrapy.selector import Selector from downimages.items import DownimagesItem#注意导入的方式 class DownimagesSpider(scrapy.Spider): name = 'DownImages' allowed_domains = ['xiaohuar.com'] start_urls = ['http://www.xiaohuar.com/hua/'] def parse(self, response): imgs=response.xpath("//div[@class='img']/a/img//@src").extract() names=response.xpath("//div[@class='img']/a/img//@alt").extract() http="http://www.xiaohuar.com" t=-1 for img in imgs: if not img[0]=="h": img=http + img t=t+1 item = DownimagesItem(image_urls=img,name=names[t]) yield item next_paper=Selector(response).re('<a href="(\S*)">下一页</a>')[0]#start_urls中的其他目标网址 if next_paper: yield scrapy.Request(url=next_paper,callback=self.parse)
五、编写piplelines.py文件:
piplelines.py文件主要用于将数据持久化,代码如下:
class ImgPipeline(ImagesPipeline): def file_path(self,request,response=None,info=None): image_guid=request.meta['name']+'.jpg'#通过request.meta中的参数传递了文件名 return 'full/%s' % (image_guid) def get_media_requests(self,item,info): # for image_url in item['image_urls']: # print(image_url) # yield Request(image_url) yield Request(url=item['image_urls'],meta={'name':item['name']}) # request.meta['name']=item['name'] # yield request def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") # 如果没有路径则抛出异常 item['image_paths'] = image_paths return item
六、设置settings.py文件:
settings.py主要是进行爬虫模块相关设置,代码如下:
# -*- coding: utf-8 -*- BOT_NAME = 'downimages' SPIDER_MODULES = ['downimages.spiders'] NEWSPIDER_MODULE = 'downimages.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'downimages (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 ITEM_PIPELINES = { # 'downimages.pipelines.DownimagesPipeline': 300, "downimages.pipelines.ImgPipeline":300, } IMAGES_URL_FIELD="image_urls" IMAGES_RESULT_FIELD="images" IMAGE_EXPIRES=30
七、运行爬虫:
在命令行运行scrapy crawl DownImage,运行爬虫,等待一段时间后所有图片将全部爬取下来,如下: