Scrapy framework learning (8) - Scrapy-redis distributed crawler learning
Scrapy-redis
The distributed crawler framework is Scrapy
improved on the basis of the crawler framework. By Redis
caching data, crawler programs can be run on multiple machines. The examples in this article are CentOS
running on a virtual machine.
1. Redis installation
Regarding the installation of Redis, there are many articles on the Internet, and there are also some problems in configuring the Redis environment. The following two articles introduce the installation and problems of Redis in detail, and they are directly transferred here. Because my Scrapy-Redis
crawler code is running CentOS
in
Installation and configuration of Redis under CentOS 7
CentOS 7.0 installs Redis 3.2.1 detailed process and usage frequently asked questions
2. Redis distributed testing
We use 3 machines to crawl data with distributed crawler. After installing it on the virtual machine Redis
, we will conduct Redis
distributed testing.
We have server1 as the Master
machine ( ip:192.168.108.20
). server2( ip:192.168.108.21
), server3( ip:192.168.108.22
) as 2 machines Slaves
.
2.1 Start the Redis service of the Master machine
Find the configuration file of redis redis.conf
, comment bind 127.0.0.1
, and set the protected mode protected-mode
to no
.
specified redis.conf
startup
redis-server /redis/redis-stable/redis.conf
As shown in the figure:
2.2 Start the Redis client redis-cli on each machine
Master machine:
redis-cli
Slaves machine, specify the ip of the Master machine
redis-cli -h 192.168.108.20
If the following occurs:
Could not connect to Redis at 192.168.108.20:6379: No route to host
You need to clear the firewall configuration for each machine:
sudo iptables -F
Connect to Master's Redis again, as shown in the figure:
If the machine fails to connect, it may be a firewall problem, you can try: sudo iptables -F to clear the firewall configuration
2.3 Test Distributed Redis
3. Crawler code of scrapy-redis
Scrapy-redis distributed crawler, crawling is: http://sh.58.com/chuzu/
.
Let's look at the code below, which is basically the same as the Scrapy project, except that the Spider class is changed to the class in Scrapy-redis, and the configuration in settings.py is different.
3.1 Item
define entity class
class Redis58TestItem(scrapy.Item):
# 标题
title = scrapy.Field()
# 房间
room = scrapy.Field()
# 区域
zone = scrapy.Field()
# 地址
address = scrapy.Field()
# 价格
money = scrapy.Field()
# 发布信息的类型,品牌公寓,经纪人,个人
type = scrapy.Field()
3.2 spider
spider crawler class implementation
from scrapy.spiders import Rule
from scrapy_redis.spiders import RedisCrawlSpider
from scrapy.linkextractors import LinkExtractor
from redis58test.items import Redis58TestItem
class Redis58Spider(RedisCrawlSpider):# 继承的Spider是Scrapy-redis框架提供的
# spider的唯一名称
name = 'redis58spider_redis'
# 开始爬取的url
redis_key = 'redis58spider:start_urls'
# 从页面需要提取的url 链接(link)
links = LinkExtractor(allow="sh.58.com/chuzu/pn\d+")
# 设置解析link的规则,callback是指解析link返回的响应数据的的方法
rules = [Rule(link_extractor=links, callback="parseContent", follow=True)]
def parseContent(self, response):
"""
解析响应的数据,获取需要的数据字段
:param response: 响应的数据
:return:
"""
# 根节点 //ul[@class="listUl"]/li[@logr]
# title: .//div[@class="des"]/h2/a/text()
# room: .//div[@class="des"]/p[@class="room"]/text()
# zone: .//div[@class="des"]/p[@class="add"]/a[1]/text()
# address: .//div[@class="des"]/p[@class="add"]/a[last()]/text()
# money: .//div[@class="money"]/b/text()
# type: # .//div[@class="des"]/p[last()]/@class # 如果是add,room .//div[@class="des"]/div[@class="jjr"]/@class
for element in response.xpath('//ul[@class="listUl"]/li[@logr]'):
title = element.xpath('.//div[@class="des"]/h2/a/text()')[0].extract().strip()
room = element.xpath('.//div[@class="des"]/p[@class="room"]')[0].extract()
zone = element.xpath('.//div[@class="des"]/p[@class="add"]/a[1]/text()')[0].extract()
address = element.xpath('.//div[@class="des"]/p[@class="add"]/a[last()]/text()')[0].extract()
money = element.xpath('.//div[@class="money"]/b/text()')[0].extract()
type = element.xpath('.//div[@class="des"]/p[last()]/@class')[0].extract()
if type == "add" or type == "room":
type = element.xpath('.//div[@class="des"]/div[@class="jjr"]/@class')[0].extract()
item = Redis58TestItem()
item['title'] = title
item['room'] = room
item['zone'] = zone
item['address'] = address
item['money'] = money
item['type'] = type
yield item
3.3 Pipe Item
In fact, we don't need to process Pipeitem's data, because the Scrapy framework will deduplicate the data to redis.
class Redis58TestPipeline(object):
def process_item(self, item, spider):
return item
3.4 settings.py settings
Information for configuring Scrapy-Redis
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
# 使用什么队列调度
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue" # 优先级队列
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue" # 基本队列
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack" # 栈
ITEM_PIPELINES = {
'redis58test.pipelines.Redis58TestPipeline': 300,
'scrapy_redis.pipelines.RedisPipeline': 400,
}
REDIS_HOST = '192.168.108.20' # redis Master机器的ip
REDIS_PORT = 6379
4. Execute the crawler
4.1 Start the redis-server of the Master machine
# 启动Master机器上的redis-server
redis-server /redis/redis-stable/redis.conf
4.2 Execute the crawler
Put the crawler code of the Scrapy-redis project on the slaves machine. Enter the spider directory of the project, and then start the scrapy-redis project crawler program on the slaves machine, in no particular order.
scrapy runspider myspider.py
4.3 redis-cli of Master machine
Start the url to be crawled in the redis-cli of the Master machine.
127.0.0.1:6379> lpush redis58spider:start_urls http://sh.58.com/chuzu/
5. Permanently store redis data
After the crawler program crawls, we save the data in redis to the mongodb database. as follows
import json
import redis
import pymongo
def main():
# 指定Redis数据库信息
rediscli = redis.StrictRedis(host='192.168.108.20', port=6379, db=0)
# 指定MongoDB数据库信息
mongocli = pymongo.MongoClient(host='localhost', port=27017)
# 创建数据库名
db = mongocli['redis58test']
# 创建表名
sheet = db['redis58_sheet2']
while True:
# FIFO模式为 blpop,LIFO模式为 brpop,获取键值
source, data = rediscli.blpop(["redis58spider_redis:items"])
item = json.loads(data)
sheet.insert(item)
try:
print(u"Processing: %(name)s <%(link)s>" % item)
except KeyError:
print(u"Error procesing: %r" % item)
if __name__ == '__main__':
main()
The data is shown in the figure: