Incremental distributed reptiles and reptiles

First thing to say about scrapy-redis

  Profile settings configured Scrapy_Redis main function, which is to change the configuration SCHEDULER Scrapy original scheduler. When the program runs, Scrapy read configuration information from a configuration file, according to configuration information Scrapy_Redis operation function, a scheduler such that the entire project and Spider Scheduler Scrapy_Redis are defined, so that a distributed crawling

 

Distributed reptile from the principle that Scrapy_Redis, URL address and respond to the content of the current request have been saved when Redis database, Scrapy not send HTTP requests to the current request, but directly following the implementation of a request, which prevents distributed the procedure was repeated crawling reptiles, thus ensuring the uniqueness of the data

  According to the principle of distributed crawler may be derived from a new use - incremental reptiles. Incremental reptiles in the case of data already stored parts of the site, when the reptile run again, it does not correspond to some duplication of data crawling, crawling only database unsaved data, can also be used as a distributed crawler Scrapy_Redis incremental reptiles, in addition to reptiles can be developed in an incremental manner Scrapy project in ,, and the big bad not bad Scrapy_Redis

  Incremental reptiles developed two ways: INCREMENTAL INCREMENTAL based pipeline reptiles and middleware. Implemented in two different files in Scrapy

 

  INCREMENTAL based pipeline crawler is in the pipelineswenjian process_ite () method determines whether the data has been stored in the database redis, if present, is not database storage process, on the contrary it will write data to the current and target databases in Redis

  , Is defined based on incremental middleware reptiles in middlewares intermediate file, it determines whether the current request is the URL address already exists in the database Redis, if the current request is skipped and the next request is executed directly, contrary to the currently speaking URL address Redis database and write down the request execution

 

 

Based Pipeline Crawler INCREMENTAL

 

  Pipeline implementation is the Scrapy_Redis of pipelines used alone, it is the biggest bit of repeat visits to the URL address to access and obtain the data, so that data can be updated dynamically change based on

  Scenario:

    Rankings, forums and other sites posted it crawling;

  Disadvantages:

    It will repeat the URL address of the access site data if fixed will result in waste of network resources, but also increases the risk of anti-detection mechanisms reptile

  Achieve incremental pipeline crawlers can be used RedisPipeline class Scrapy_Redis of pipelines files, but he will involve the use of Scrapy_Redis other files. In order to simplify the complexity of the function, you can write the corresponding function in accordance with the principles of project pipelines files on it,

 

 

Middleware based on incremental reptiles

 

  Before sending the HTTP request is first determined to be the URL address of the request based on incremental middleware reptiles, if the URL address has been already transmitted before this HTTP request, this request will not be performed down , to avoid repeated access to the same URL address. Thus custom intermediate URL address of the main current request judgment, performs different processing according to different determination results

 

 

 

to sum up

  INCREMENTAL based pipeline crawler is process_item pipelines file () method determines whether the data stored in the database Redis, if there is no need to do a database storage process, on the contrary it is necessary to write the current data and destination databases Redis

  On the contrary it is necessary incremental crawler middleware middlewares is defined in the intermediate file, determines whether the current request is the URL address Redis already exists in the database, if present, and skip the current request directly to the next request, the current URL address into Redis database and execute down the request

 

Guess you like

Origin www.cnblogs.com/tulintao/p/11594481.html