Use scrapy-redis build a distributed environment reptiles

 scrapy-redis Profile

   scrapy-redis redis component-based framework is scrapy database for distributed development projects scraoy and deployment.

  It has the following characteristics:

    Distributed crawling:

      You can start multiple spider project, share a single queue requests between each other, the most suitable for a wide range of multiple domains crawl.

  Distributed data processing:

    Crawling to the item data scrapy can be pushed into the redis queue, it means that you can start with as many handlers as required to share the item queues, perform data persistence item processing

  scrapy plug and play components:

    Scheduler The scheduler + Duplication replica filter, Item Pipeline, substantially spider 

scrapy-redis architecture

 

1, first end Slave fetch data from the task that the Master side (Requests, Url), while Slaver fetch data, a new task to the Master Request submitted knitting process.

2, Master end only a Redis database, responsible for processing the Request to re-tasking, the Ruqest treated join the queue to be crawled and stored data crawling, Scrapy-redis default is to use this strategy, we realize it is very simple, because the task scheduling and other work Scrapy-redis have helped us do a good job, we only need to inherit RedisSpider, designated redis_key on the line.

weakness is:

  Scrapy-redis scheduled task Request object, which the large amount of information (including not only url, as well as callback, header information)

  The likely result would be to reduce the speed of reptiles, Redis and will take up a lot of storage space, so if you want to ensure efficiency, you need a certain level of hardware.

scrapy-redis installation

  Can be installed by pip, pip install scrapy-redis

  Generally requires python, redis, scrapy three installation package

  The official document: https://scrapy-redis.readthedocs.io/en/stable/ 

  Source Location: https://github.com/rmax/scrapy-redis 

  Reference blog: https://www.cnblogs.com/kylinlin/p/5198233.html

Common Configuration scrapy-redis

  General add the following several common configuration options in the configuration file:

  1, (we are) used to reorganize the scrapy-redis member, deduplication in the database redis

DUPEFILTER_CLASS = "scrapy-redis.dupefilter.RFPDupeFiter"

  2, (be) used scrapy-redis scheduler, allocation request in redis

SCHEDULER = "scrapy_redis.scheduler.Scheduler"

  3, (optional) to maintain each queue used in scrapy-redis redis, thus allowing the pause after pause and resume, that is, do not clean up redis requests

SCHEDULER_PERSIST = True

  4, (be) written by arranging RedisPipeline the key item of spider .name: list redis of the items. For distributed processing behind the item that has scrapy-redis achieve, no need to write code that can be used directly:

ITEM_PIPELINE = {
     "scrapy_redis.pipeline.RedisPipeline":100,
}

  5, (must) specifies the connection parameters redis database:

REDIS_HOST = 127.0.0.1
REDIS_PORT = 6309A

 

Scrapy_redis keys Introduction

  scrapy-redis are used in the form of key-value store data, which has several common forms of key-value:

  1, "Project name: items" -> list type. Save crawlers get content to the data item is json string

  2, "Project name: dupefilter" -> set type for reptiles url to access the content is heavy url 40-character hash string

  3, "Project name: start_url:" -> list type, used to get was the first to start a url spider crawling 

  4. "Project name: requests" -> zset type. Scheduler scheduling requests for content requests are serialized string object

Examples of simple Scrapy-redis

  On the basis of the original non-distributed reptiles on the use scrapy-redis simple to build a distributed crawler, spider process only need to modify the following inheritance and configuration files, is very simple:

  First, modify the configuration file, add the code in setting.py file:

  

DUPEFILTER_CLASS = "scrapy-redis.dupefilter.RFPDupeFiter"
SCHEDULER = “scrapy-redis.scheduler.Scheduler”
SCHEDULER_PERSIST = True
ITEM_PIPELINE = {
    “scrapy-redis.pipelines,RedisPipeline”:300,

}
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379

Then you need to modify the file, is spider files, source code files:

  

 

 change into:

  

# -*0- coding: utf-8 -*-
import scrapy
from scrapy_redis.spider import RedisSpider

from tencent,items import TencentItem, DetailItem

class TencetWantedSpider(RedisSpider):
    name = 'tencent_wanted'
    # allowed_domain = ["hr.rencent.com"]

    # start_url = ["https://hr.tencent.com/position.php"]
    redis_key = 'tencent:start_urls'
    base_url = 'tencent://hr.tencent.com/'
    

    def parse(self, response):
        ...

Only modify the two places, one is inherited classes: modified by scrapy.Spider to RedisSpider 

Then start_url no longer needed, revised as: redis) key = "xxxx", where the value of this key to temporarily get their own name

Usually with the project name, start_urls url initial crawl instead, since distributed scray-redis in each request from redis taken out, and therefore, in redis database, a redis_key setting value as an initial url , Scrapy automatically removed in redis redis_key value, as the initial url, automatic crawling.

Therefore, to redis, add the code:

 

 

 which is:

  Redis disposed in a key-value pair, the key is tencent2: start_urls, is: initializing url to incoming url url as the initial crawl.

  In this way, a distributed crawling up is completed!

 

 

 

 

 

 

  

Guess you like

Origin www.cnblogs.com/jcjc/p/11613411.html