scrapy_redis use

Configuring Scrapy-Redis

Configuring Scrapy-Redis is very simple, only need to modify settings.py configuration file.

1. Core Configuration

First, the most important is the need to replace the class Scrapy-Redis provided to scheduler classes and weight class, add the following can be disposed inside settings.py

SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

2. Redis connection configuration

The first, which is configured to directly settings.py REDIS_URL variables can be:

REDIS_URL = 'redis://:[email protected]:6379'

The second

REDIS_HOST = '120.27.34.25'
REDIS_PORT = 6379
REDIS_PASSWORD = 'foobared'

Note: If you configure REDIS_URL, then Scrapy-Redis will give priority to the use of REDIS_URL connection, will cover the above three configurations. If you want itemized separately configured, do not configure REDIS_URL.

3. Set dispatch queue

This configuration is optional, use PriorityQueue default. If you want to change the configuration, you can configure SCHEDULER_QUEUE_CLASS variable as follows:

SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'

4. placed endurance of

This configuration is optional, the default is False. Scrapy-Redis crawling queue empty by default and de-emphasis in the fingerprint collection crawling completed.

If you do not want to automatically empty queue crawling fingerprint collection and de-emphasis, may be increased as follows:

SCHEDULER_PERSIST = True

After SCHEDULER_PERSIST set to True, and to re-crawling fingerprint collection queues are not automatically cleared after completion of crawling, if not configured, the default is False, that is automatically emptied.

It is noteworthy that, if forced to interrupt the operation of reptiles crawling queues and deduplication fingerprint collection is not automatically cleared.

Without any configuration In this project, we use the default configuration.

5. Configuration weight crawl

This configuration is optional, the default is False. If you configure a persistent or forcibly interrupted reptiles, then crawling queues and fingerprint collection is not empty, then it will restart after the last reptiles crawling. If you want to re-crawling, we can re-crawl configuration options:

SCHEDULER_FLUSH_ON_START = True

After this the SCHEDULER_FLUSH_ON_START set to True, each time when the crawler starts crawling queue and will clear the fingerprint collection. Distributed crawling to do so, we must ensure that only emptied once, or when you start each task reptiles are emptied once, they will put before crawling queue empty, will certainly affect the distributed crawling.

Note that, in this configuration, when the single crawling more convenient, this is not commonly distributed crawling configuration.

Without any configuration In this project, we use the default configuration.

6. Pipeline Configuration

This configuration is optional and defaults does not start Pipeline. Scrapy-Redis Redis implements a store to the Item Pipeline, the Pipeline is enabled, then, it would generate a reptile Item stored in the Redis database. In the case of large amount of data, we generally do not do so. Because the Redis is based on memory, we take advantage of the fast processing speed of its properties, use it for storage would be too wasteful, configuration is as follows:

ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 300
}

The project does not require any configuration that does not start Pipeline.

So far, Scrapy-Redis configuration is complete. We do not have to configure some options, but these configurations may be used in other Scrapy projects, according to the specific circumstances.

This abstract is selected from the public numbers: attack the coder, the complete article link: https: //mp.weixin.qq.com/s/JPkwHioLOC_27xfQCeWYhg

Guess you like

Origin www.cnblogs.com/c-x-a/p/10943026.html