Item Pipeline Introduction
Item pipeline is the primary responsibility of dealing with spiders drawn from the Web Item, his main task is to clean, validate and store data.
When a page is parsed spider, Item to be transmitted to the pipe, and after a few specific order of processing data.
Item duct are components of each class have a simple Python method thereof.
They get the Item and execute their methods, but they also need to determine whether to proceed to the next step in Item pipeline or directly discarded without processing.
Execution pipeline project
Clean up HTML data
to verify analytical data (check Item contains the necessary fields)
checks whether duplicate data (if you delete the duplicate)
to parse the data stored in the database
Writing your own Item Pipeline
Each project pipeline component is a Python class, you must implement the following methods:
process_item(self, item, spider)
This method is called for each project pipeline components. process_item () must return a dict band data, return an Item (or any progeny class) object, or returns a Twisted Deferred raise DropItemexception. Discarding other items are no longer processed by the pipeline components.
parameter:
- item (Itemobject or dict) - Cut Project
- Spider (Spider Object) - spiders crawl items
Further, they can also implement the following methods:
# This method is called when the spider is open. open_spider (Self, spider) # parameters spider open spider # This method is called when the spider off. close_spider (Self, spider) # parameters spider is closed spider # If present, such a method is called to create a pipeline from a instance Crawler. It must return a new instance of the pipe. Crawler Scrapy object provides access to all core components (e.g., signal and setting); it is the function of the pipe access them and to hook in a manner Scrapy. from_crawler (CLS, crawler) # parameters crawler (Crawlerobject) - Use of this pipe crawler
The captured items to be saved to a file format json
Fetch items from the spider to be serialized as json format, and is written to the file in the form of items.jl each item in a row
import json class JsonWriterPipeline(object): def __init__(self): self.file = open('items.jl', 'wb') def process_item(self, item, spider): line = json.dumps(dict(item)) + "\n" self.file.write(line) return item
Remove duplicates
Assume extracted in duplicate spider in item id, then we can be filtered process_item function
from scrapy.exceptions import DropItem class DuplicatesPipeline(object): def __init__(self): self.ids_seen = set() def process_item(self, item, spider): if item['id'] in self.ids_seen: raise DropItem("Duplicate item found: %s" % item) else: self.ids_seen.add(item['id']) return item
Activation component ItemPipeline
In settings.py file, add a project pipeline to ITEM_PIPELINES class name, you can activate the project pipeline components
ITEM_PIPELINES = { 'myproject.pipeline.PricePipeline': 300, 'myproject.pipeline.JsonWriterPipeline': 800, }
The image pipeline
items
Defined filter field
import scrapy class ImgpilelineproItem(scrapy.Item): # define the fields for your item here like: img_src = scrapy.Field()
spider
Just to get a picture and submit it to itme Download
import scrapy from imgPileLinePro.items import ImgpilelineproItem class ImgSpider(scrapy.Spider): name = 'img' # allowed_domains = ['www.xxx.com'] start_urls = ['http://pic.netbian.com/4kmeinv/'] url = 'http://pic.netbian.com/4kmeinv/index_%d.html' page = 2 def parse(self, response): li_list = response.xpath('//*[@id="main"]/div[3]/ul/li') for li in li_list: img_src = 'http://pic.netbian.com'+li.xpath('./a/img/@src').extract_first() item = ImgpilelineproItem() item['img_src'] = img_src yield item if self.page <= 2: # 爬取前两页 new_url = format(self.url%self.page) self.page += 1 yield scrapy.Request(url=new_url,callback=self.parse)
pipelines
from scrapy.pipelines.images Import ImagesPipeline Import Scrapy # used to download images Pipeline class ImgPileLine (ImagesPipeline): # receiving a request item and the item is stored in the transmission img_src DEF get_media_requests (Self, item, info): the yield Scrapy. the Request (URL = Item [ ' img_src ' ]) # Specifies the path data store (folder [specified in the configuration file] + picture name [the process returns]) DEF file_path (Self, Request, Response = None, info = none): img_name = request.url.split ( ' / ' ) [-. 1 ] return img_name #The item is passed to the next pipeline class to be executed DEF item_completed (Self, Result, item, info): return item
settings configuration
# Download path specified file IMAGES_STORE = ' ./imgsLib ' the file is automatically created # # enable pipeline ITEM_PIPELINES = { ' imgPileLinePro.pipelines.ImgPileLine ' : 300 , }