Pipeline Pipeline scrapy framework

Item Pipeline Introduction

Item pipeline is the primary responsibility of dealing with spiders drawn from the Web Item, his main task is to clean, validate and store data.
When a page is parsed spider, Item to be transmitted to the pipe, and after a few specific order of processing data.
Item duct are components of each class have a simple Python method thereof.
They get the Item and execute their methods, but they also need to determine whether to proceed to the next step in Item pipeline or directly discarded without processing.

Execution pipeline project

Clean up HTML data 
to verify analytical data (check Item contains the necessary fields) 
checks whether duplicate data (if you delete the duplicate) 
to parse the data stored in the database

Writing your own Item Pipeline

Each project pipeline component is a Python class, you must implement the following methods:

process_item(self, item, spider)

This method is called for each project pipeline components. process_item () must return a dict band data, return an Item (or any progeny class) object, or returns a Twisted Deferred raise DropItemexception. Discarding other items are no longer processed by the pipeline components.

parameter:

  • item (Itemobject or dict) - Cut Project
  • Spider (Spider Object) - spiders crawl items

Further, they can also implement the following methods:

# This method is called when the spider is open. 
open_spider (Self, spider)    # parameters spider open spider 

# This method is called when the spider off. 
close_spider (Self, spider)    # parameters spider is closed spider 

# If present, such a method is called to create a pipeline from a instance Crawler. It must return a new instance of the pipe. Crawler Scrapy object provides access to all core components (e.g., signal and setting); it is the function of the pipe access them and to hook in a manner Scrapy. 
from_crawler (CLS, crawler)   # parameters crawler (Crawlerobject) - Use of this pipe crawler

The captured items to be saved to a file format json

Fetch items from the spider to be serialized as json format, and is written to the file in the form of items.jl each item in a row

import json 
  
class JsonWriterPipeline(object): 
  
  def __init__(self): 
    self.file = open('items.jl', 'wb') 
  
  def process_item(self, item, spider): 
    line = json.dumps(dict(item)) + "\n"
    self.file.write(line) 
    return item

Remove duplicates

Assume extracted in duplicate spider in item id, then we can be filtered process_item function

from scrapy.exceptions import DropItem 
  
class DuplicatesPipeline(object): 
  
  def __init__(self): 
    self.ids_seen = set() 
  
  def process_item(self, item, spider): 
    if item['id'] in self.ids_seen: 
      raise DropItem("Duplicate item found: %s" % item) 
    else: 
      self.ids_seen.add(item['id']) 
      return item

Activation component ItemPipeline

In settings.py file, add a project pipeline to ITEM_PIPELINES class name, you can activate the project pipeline components

ITEM_PIPELINES = { 
  'myproject.pipeline.PricePipeline': 300, 
  'myproject.pipeline.JsonWriterPipeline': 800, 
}

The image pipeline

items

Defined filter field

import scrapy

class ImgpilelineproItem(scrapy.Item):
    # define the fields for your item here like:
    img_src = scrapy.Field()

spider

Just to get a picture and submit it to itme Download

import scrapy
from imgPileLinePro.items import ImgpilelineproItem

class ImgSpider(scrapy.Spider):
    name = 'img'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://pic.netbian.com/4kmeinv/']
    url = 'http://pic.netbian.com/4kmeinv/index_%d.html'
    page = 2

    def parse(self, response):
        li_list = response.xpath('//*[@id="main"]/div[3]/ul/li')
        for li in li_list:
            img_src = 'http://pic.netbian.com'+li.xpath('./a/img/@src').extract_first()
            item = ImgpilelineproItem()
            item['img_src'] = img_src

            yield item

        if self.page <= 2:  # 爬取前两页
            new_url = format(self.url%self.page)
            self.page += 1
            yield scrapy.Request(url=new_url,callback=self.parse)

pipelines

from scrapy.pipelines.images Import ImagesPipeline
 Import Scrapy 

# used to download images Pipeline 
class ImgPileLine (ImagesPipeline):
     # receiving a request item and the item is stored in the transmission img_src 
    DEF get_media_requests (Self, item, info):
         the yield Scrapy. the Request (URL = Item [ ' img_src ' ]) 

    # Specifies the path data store (folder [specified in the configuration file] + picture name [the process returns]) 
    DEF file_path (Self, Request, Response = None, info = none): 
        img_name = request.url.split ( ' / ' ) [-. 1 ]
         return img_name 

    #The item is passed to the next pipeline class to be executed 
    DEF item_completed (Self, Result, item, info):
         return item

settings configuration

# Download path specified file 
IMAGES_STORE = ' ./imgsLib ' the file is automatically created # 
# enable pipeline 
ITEM_PIPELINES = {
    ' imgPileLinePro.pipelines.ImgPileLine ' : 300 , 
}

 

Guess you like

Origin www.cnblogs.com/songzhixue/p/11334851.html