Reptile (XVI): Scrapy framework (c) Spider Middleware, Item Pipeline, docking Selenium

1. Spider Middleware

Spider Middleware is involved in the hook frame Spider-handling mechanism of Scrapy.

After generating Downloader Response, Response is sent to Spider, before being sent to Spider, the Response processing will first go through the Middleware Spider, when processing for generating Item and Spider Request, Item Request will be processed through the Middleware Spider.

Spider Middleware has three functions:

  • We can generate in the Downloader Response before sending it to Spider, which is sent in Response to Response to be processed before Spider.
  • We can generate in a Spider Request before sending it to the Scheduler, which is sent in Request for Request for processing prior to the Scheduler.
  • Item processing prior to before we can generate in a Spider Item sent to the Item Pipeline, which is sent to the Item Pipeline in Item.

1.1 instructions

It should be noted, Scrapy is already provided many Spider Middleware, which are defined SPIDER_MIDDLEWARES_BASE this change disk.

SPIDER_MIDDLEWARES_BASE variable contents are as follows:

{
    'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware':50,
    'scrapy spidermiddlewares offsite Of site iddleware':500,
    'scrapy.spidermiddlewares.referer.RefererMiddleware':700,
    'scrapy.spidermiddlewares.urllength.UrllengthMiddleware':800,
    'scrapy.spidermiddlewares.depth.DepthMiddleware':900,
}

And Downloader Middleware as, Spider Middleware SPIDER_MIDDLEWARES first added to the arrangement, Spider Middleware Scrapy and the setting defined in SPIDER_MIDDLEWARES_BASE combined. According to figures then prioritize key values ​​to obtain an ordered list. Middleware is the closest to the first engine, the last Middleware is the closest Spide.

1.2 core methods

Scrapy built-in Spider Middleware provides basic functionality for Scrapy. If we want to extend its functionality, just a few methods to achieve.

Spider Middleware Each class defines one or more methods, the method has the following four core.

  • process_spider_input(response,spider)
  • process_spider_output(response,result,spider)
  • process_spider_exception(response,exception,spider)
  • proce ss_start_requests(start_requests,spider)

Only need to implement one of the ways you can define a Spider Middleware.

(1) process_spider_input(response,spider)

When the Response is processed Spider Middleware, process_spider_input () method is called.

Parameter process_spider_input () method has the following two:

response, a Response object, Response i.e. processing.

spider, Spider is the object, i.e. the corresponding Response Spider.

process_spider_input () should return None or throw an exception.

If it returns None, Scrapy will continue to process the Response, call all the other Spider Middleware, until Spider process the Response.

If it throws an exception, Scrapy will not call any other Spider Middleware's process_spider_input () method, which calls errback () method of the Request. errback output will be re-entered to the middleware using process_spider_output () method to handle, when it is called an exception is thrown process_spider_exception () to process.

(2) process _spider_output(response,result,spider)

Response When the processing returns Spider result, process_spider_output () method is called. Parameter process_spider_output () method has the following three:

Response, a Response object, i.e., the output generated Response.

Result, comprising Request Item object or objects may be iterative, i.e. Spider results returned.

spider, Spider is the object that corresponds to the result Spider.

process_spider_output () must return iterables comprise Request or the Item object.

(3) process_spider_exception(response,exception,spider)

When Spider or Spider Middleware's process_spider_input () method throws an exception, process_spider_exception () method is called.

Parameter process_spider_exception () method has the following three:

response, is the Response object, that is, Response to be processed when an exception is thrown.

exception, is the Exception object, ie thrown.

spider, Spider is an object that is thrown Spider the exception.

process_spider_exception () must either return None, or return iterables contains a Response or Item object.

If you ask it to return None, Scrapy will continue to handle the exception, call other Spider Middleware in process_spider_exception () method until all Spider Middleware are called.

If it returns an iterable, Spider Middleware's process_spider_output () method is called, the other process_spider_exception () it will not be called.

(4) process_start_requests (start_requests,spider)

process_start_requests () method to Request Spider started as an argument is invoked during the execution of similar process_spider_output (), except that it is not associated with the Response and must return Request.

Parameter proces s_start_requests () method has the following two:

start_requests, comprising Request iterable, i.e. Start Requests.

spider, Spider is an object that Spider Start Requests belongs.

process_start_requests () must return iterables Request object contains another.

2. Item Pipeline

Item Pipeline project pipeline.

Item Pipeline call occurs after Spider generating Item. When Spider finished parsing Response, Item is passed on to Item Pipeline, Item Pipeline assembly is defined in turn calls the complete series of processing, such as data cleaning and storage.

Item Pipeline There are four main features:

Clean up HTML data.

Verification data crawling, crawling check field.

Weight check and discard duplicates.

Save the crawling results to the database.

We can customize Item Pipeline, only need to implement the specified method, a method which must be achieved is: process_item (item, spider).

Further there are several more practical approach:

open_spider(spider)

close_spider(spider)

from_crawler(cls,crawler)

Here we detail the use of these methods.

(1) process_item(item,spider)

process_item () method must be achieved, calling this method Item processing is defined in Item Pipeline default. For example, we can process data or writes data to database operations. It must return a value of type Item or throw a DropItem exception.

Parameter process_itern () There are two methods:

item, is an Item object, Item i.e. processing.

Spider, Spider is the object that generates the Item Spider.

process_item () method returns the type summarized as follows:

If it returns to the Item object, then this Item will be processed low priority Item Pipeline of process_item () method, the method is called until all finished.

If it is thrown DropItem exception, this Item is discarded without being processed.

(2) open_spider(self,spider)

open_spider () method is turned on when the Spider is automatically called. Here we can do some initialization operations, such as open database connectivity. Among them, the parameters spider Spider object is opened.

(3) close_spider(spider)

close_spider () method is Spider close automatically when called. Here we can do some finishing work, such as closing database connections. Wherein the parameter is to be closed spider Spider object.

(4) from_crawler(cls,crawler)

from_crawler () method is a class method, with @classmethod logo, a dependency injection manner. Its argument is a crawler, crawler through the object, we can all core components Scrapy each to get information such as the global configuration, and then create Pipeline instance. Cls parameter is the Class, and finally return to a Class instance.

Guess you like

Origin www.cnblogs.com/liuhui0308/p/12127771.html