Scrapy Pipeline

Scrapy The Pipeline provides us with a function to process data, we often use it to clean the actual development / verification data, de-duplication and data storage. Exist in a variety of Pipeline project, each Pipeline is a class, which contains some Item processing method. Item will be sequentially transmitted in the order in which the Pipeline, wherein if a Pipeline Item discarded, then the latter is not performed to the Pipeline will not receive the Item.

Zero, Custom Pipeline

Custom Pipeline is actually very simple, just to achieve the specified method.

  1. process_item (self,item,spider)
  • Explanation: This method must be implemented, the work of processing of data are carried out in this method, the method returns dict, Item, Twisted Deferred DropItem or abnormal.
  • parameter:
    • item: Item to be crawled;
    • spider: the time taken to climb Spider Item used.

Tip: If the drop in the Item process_item method, then the Item Item will not be delivered to the subsequent Pipeline.

  1. open_spider(self,spider)
  • Explanation: The crawler starts running, will be executed in this method some initialization, such as opening a database, open the files.
  • parameter:
    • spider: Spider currently in use
  1. close_spider(self,spider)
  • Explanation: When the crawler is closed, this method will be executed in a number of follow-up work, such as closing the database, close the file and so on.
  • parameter:
    • spider: Spider currently in use
  1. from_crawl(self,crawler)
  • Explanation: method class method returns the object Pipeline instance by initializing the crawler. We can go back through all of the core components Scrapy crawler.

A special Pipeline

In some projects we not only want to crawl pages of data, but also crawling files or images, and save them locally. Then we need to use special Scrapy of Pipeline: FilesPipeline and ImagesPipeline, they have some of the same methods and structure, we call this Pipeline is MediaPipeline. FilesPipeline ImagesPipeline and contain the following features:

  • Avoid duplication of downloaded data
  • Designated storage location

ImagesPipeline also contains the following features:

  • Pictures converted to JPG format or RGB format
  • Generate thumbnails
  • Pictures download limit of maximum / minimum width and height

Tip: Scrapy Pipeline method is to avoid duplicates the URL of the file to be downloaded into a queue, and the associated and Response, thus avoiding re-downloading.

  1. FilesPipeline

Workflow FilesPipeline download file is very simple, a total of four steps:

  • Reptile save URL acquired Item and want to download the file to file_urls in;
  • Item crawler returned into the interior of the Pipeline;
  • When transferred to Item FilesPipeline order, file_urls URL is downloaded in a built-in scheduler and downloader. At this time Item is locked until you need to download the file download is complete or an error, Item was unlocked;
  • After the download is complete, the result will be stored in the files, a list of files, each data type dict.
  1. ImagesPipeline

ImagesPipeline is inherited from FilesPipeline, that is to say most of its steps and FilesPipeline same. The only difference is ImagesPipeline will need to download the images saved to image_urls URL, the download is complete, save the results to images.

Tip: Pipeline includes not only these two special Pipeline, because FilesPipeline and ImagesPipeline is more commonly used, so I had to explain here. More built Pipeline Scrapy you can go to the official website to see specific documents.

We need to be registered in the settings.py file after you finish writing Pipeline, we will write Pipeline injected into the Scrapy.

ITEM_PIPELINS= {
  '自定义Pipeline 路径':'优先级'
}

Second, summary

This article mainly on theoretical knowledge Pipeline, although very short, but this knowledge is the core of knowledge Pipeline. The next section I will use Pipeline to show through in the form of code.

Published 204 original articles · won praise 101 · Views 350,000 +

Guess you like

Origin blog.csdn.net/gangzhucoll/article/details/104046855