scrapy-redis distributed crawling combat

Actual codes Scrapy-Redis

Scrapy crawler is a generic frame, but does not support distributed, redis-Scrapy order to more easily achieve Scrapy distributed crawling, provide some of the components (component only) to redis basis.

scrapy-redis on scrapy architecture increases redis, based on the characteristics redis expanded the following four components:

  • Scheduler
  • Duplication Filter
  • Item Pipeline
  • Base Spider

    scrapy-redis architecture

    scrapy-redis architecture

Scheduler

Scrapy original queue does not support multiple spider is a shared queue, scrapy-redis redis replaced by the queue queue sharing.

Duplication Filter

Python Scrapy achieved by the fingerprint collection request to weight, to the weight of scrapy-redis Duplication Filter assembly is implemented, it will not be repeated by a characteristic set redis, clever achieved DuplicationFilter deduplication.

Item Pipeline

The engine (Spider returned) to crawl to the Item Item Pipeline, scrapy-redis the Pipeline of the Item Item crawling to the stored items queue redis. Modified Item Pipeline can be easily extracted according to key item from items queue, in order to achieve items processes cluster.

Base Spider

Scrapy no longer use the original Spider class, rewriting RedisSpider inherited Spider and RedisMixin these two classes, RedisMixin is used to read from the url redis class.
When we generate a Spider inheritance RedisSpider, call setup_redis function, which will go to connect redis database, and then sets the signals (signals): one is when the signal spider idle time, calls spider_idle function that calls schedule_next_request function to ensure spider is the state have been alive, and throw DontCloseSpider exception. When a signal is caught when a item, calls item_scraped function that calls schedule_next_request function to get the next request

Installation Scrapy-Redis

python3.6 -m pip install scrapy-redis

Project practice

First, modify the configuration file

BOT_NAME = 'cnblogs'
SPIDER_MODULES = ['cnblogs.spiders']
NEWSPIDER_MODULE = 'cnblogs.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
DOWNLOAD_DELAY = 2 # 等待2s
MY_USER_AGENT = ["Mozilla/5.0+(Windows+NT+6.2;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/45.0.2454.101+Safari/537.36",
    "Mozilla/5.0+(Windows+NT+5.1)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/28.0.1500.95+Safari/537.36+SE+2.X+MetaSr+1.0",
    "Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/50.0.2657.3+Safari/537.36"]
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'cnblogs.middlewares.UserAgentMiddleware': 543,
}
LOG_LEVEL = "ERROR"

ITEM_PIPELINES = {
   'cnblogs.pipelines.MongoPipeline': 300,
}
#将结果保存到Mongo数据库
MONGO_HOST = "127.0.0.1"  # 主机IP
MONGO_PORT = 27017  # 端口号
MONGO_DB = "spider_data"  # 库名
MONGO_COLL = "cnblogs_title"  # collection名

#需要将调度器的类和去重的类替换为 Scrapy-Redis 提供的类
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 7001 #Redis集群中其中一个节点的端口

#配置持久化
#Scrapy-Redis 默认会在爬取全部完成后清空爬取队列和去重指纹集合。
#SCHEDULER_PERSIST = True

#设置重爬
#SCHEDULER_FLUSH_ON_START = True

Local code to be changed, there are two:
The first is inherited RedisSpider
second place is start_urls changed to redis_key.

# -*- coding: utf-8 -*-
import scrapy
import datetime
from scrapy_redis.spiders import RedisSpider
class CnblogSpider(RedisSpider):
    name = 'cnblog'
    redis_key = "myspider:start_urls"
    #start_urls = [f'https://www.cnblogs.com/c-x-a/default.html?page={i}' for i in range(1,2)]
    
    def parse(self, response):
        main_info_list_node = response.xpath('//div[@class="forFlow"]')
        content_list_node = main_info_list_node.xpath(".//a[@class='postTitle2']/text()").extract()
        for item in content_list_node:
            url = response.url
            title=item
            crawl_date = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            item = {}
            item['url'] = url
            item['title'] = title.strip() if title else title
            item['crawl_date'] = crawl_date
            yield item

Because Scrapy-Redis Redis is a message shared queue, so our task needs to be inserted into the database in advance, we called it the key designated "myspider: start_urls".
Insert a task to create a good redis clusters before, first connect to the database using the cluster model

redis-cli -c -p 7000 #我的redis集群的一个Master节点端口

Execute the following statement Insert Task

lpush myspider:start_urls https://www.cnblogs.com/c-x-a/default.html?page=1
lpush myspider:start_urls https://www.cnblogs.com/c-x-a/default.html?page=2

And then view

lrange myspider:start_urls 0 10

We see our task, the task into good success.
The next step is to run the code, and then run the finished code to see the three.
First, the view redis mission found that the task has been no

(empty list or set)

Second, the view mongo database, we found that successfully saved the results.

Third place, you will find your crawler is not over, in fact, this is normal, because we use scrapy-redis after crawlers will always take the task redis, if there is no task will wait, if redis insert a new task he will continue crawlers, and later into a state of waiting tasks.

No public concern: Python learning development, backstage reply: redis can get the source code.

Reference material

https://segmentfault.com/a/1190000014333162?utm_source=channel-hottest

Guess you like

Origin www.cnblogs.com/c-x-a/p/12301836.html