Scrapy分布式爬虫

1.Scrapy分布式原理及Scrapy-Redis源码解析

分布式爬虫架构

在这里插入图片描述

队列维护?redis队列

在这里插入图片描述

去重- redis集合

在这里插入图片描述

怎样防⽌中断?Scrapy启动判断

在这里插入图片描述

怎样实现Scrapy-redis架构?

https://github.com/rolando/scrapy-redis
在这里插入图片描述

scrapy-redis settings

# -*- coding: utf-8 -*-


BOT_NAME = 'dangdang_book'

SPIDER_MODULES = ['dangdang_book.spiders']
NEWSPIDER_MODULE = 'dangdang_book.spiders'

# 一个去重的类,用来将url去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 一个队列
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否持久化
SCHEDULER_PERSIST = True
# redis地址
REDIS_URL = "redis://127.0.0.1:6379"
# REDIS_HOST = '127.0.0.1'
# REDIS_PORT = 6379


LOG_LEVEL = "DEBUG"
# user-agent
UA_LIST = [
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# 下载延迟
DOWNLOAD_DELAY = 0

DOWNLOADER_MIDDLEWARES = {
    'dangdang_book.middlewares.DangdangBookDownloaderMiddleware': 543,
}

# Configure item pipelines
ITEM_PIPELINES = {
    # 'dangdang_book.pipelines.DangdangBookPipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline': 300
}

2. scrapy-redis架构

在这里插入图片描述

  1. 首先Slaver端从Master端拿任务(Request、url)进行数据抓取,Slaver抓取数据的同时,产生新任务的Request便提交给 Master 处理;

  2. Master端只有一个Redis数据库,负责将未处理的Request去重和任务分配,将处理后的Request加入待爬队列,并且存储爬取的数据。

Scrapy-Redis默认使用的就是这种策略,我们实现起来很简单,因为任务调度等工作Scrapy-Redis都已经帮我们做好了,我们只需要继承RedisSpider、指定redis_key就行了。

缺点是,Scrapy-Redis调度的任务是Request对象,里面信息量比较大(不仅包含url,还有callback函数、headers等信息),

可能导致的结果就是会降低爬虫速度、而且会占用Redis大量的存储空间,所以如果要保证效率,那么就需要一定硬件水平。

3. scrapy-redis存储数据

scrapy-redis中都是用key-value形式存储数据,其中有几个常见的key-value形式:

  • 1、 “项目名:items” -->list 类型,保存爬虫获取到的数据item 内容是 json 字符串

  • 2、 “项目名:dupefilter” -->set类型,用于爬虫访问的URL去重 内容是 40个字符的 url 的hash字符串

  • 3、 “项目名: start_urls” -->List 类型,用于获取spider启动时爬取的第一个url

  • 4、 “项目名:requests” -->zset类型,用于scheduler调度处理 requests 内容是 request 对象的序列化 字符串

查看items结果

127.0.0.1:6379> lrange dd_book:items 0 1

查看dupefilter指纹

127.0.0.1:6379> smembers dd_book:dupefilter
 1) "6d90fff4881b727565ab3d04e6a5b45ddbc88617"
 2) "79c4efb8b2ba80507ec3a2780740a77e503284a7"
 3) "af25813064c7f745dd1ce19be2a6b32fe1e47f32"
 4) "8ddfe9ee5bbdb9c1a01162bd0579dd28e93d0720"
 5) "9fecae546b25ab25048ca7bc281a91114807599d"
 6) "ac7d958281309bcee7fe2aae3901463ceabbe99d"
 7) "bd1b4155ce7d85856c2fb8cace6b68d5ba6fb647"
 8) "bddc5f13b3db0b36a3a4454bb5708bd85c58da98"
 9) "41e1c5002b277cab83eeca743904010c33f442df"
127.0.0.1:6379>

查看requests待请求的网址

127.0.0.1:6379> type dd_book:requests
zset
127.0.0.1:6379> zrange dd_book:requests 0 -1
1) "\x80\x04\x95h\x01\x00\x00\x00\x00\x00\x00}\x94(\x8c\x03url\x94\x8c5http://category.dangdang.com/cp01.03.55.00.00.00.html\x94\x8c\bcallback\x94\x8c\x0cbook_details\x94\x8c\aerrback\x94N\x8c\x06method\x94\x8c\x03GET\x94\x8c\aheaders\x94}\x94C\aReferer\x94]\x94C)http://category.dangdang.com/?ref=www-0-C\x94as\x8c\x04body\x94C\x00\x94\x8c\acookies\x94}\x94\x8c\x04meta\x94}\x94(\x8c\x04item\x94}\x94(\x8c\t\xe5\xa4\xa7\xe6\xa0\x87\xe9\xa2\x98\x94\x8c\x06\xe5\xb0\x8f\xe8\xaf\xb4\x94\x8c\t\xe5\xb0\x8f\xe6\xa0\x87\xe9\xa2\x98\x94\x8c\t\xe4\xbd\x9c\xe5\x93\x81\xe9\x9b\x86\x94u\x8c\x05depth\x94K\x01u\x8c\t_encoding\x94\x8c\x05utf-8\x94\x8c\bpriority\x94K\x00\x8c\x0bdont_filter\x94\x89\x8c\x05flags\x94]\x94\x8c\tcb_kwargs\x94}\x94u."
2) "\x80\x04\x95k\x01\x00\x00\x00\x00\x00\x00}\x94(\x8c\x03url\x94\x8c5http://category.dangdang.com/cp01.03.35.00.00.00.html\x94\x8c\bcallback\x94\x8c\x0cbook_details\x94\x8c\aerrback\x94N\x8c\x06method\x94\x8c\x03GET\x94\x8c\aheaders\x94}\x94C\aReferer\x94]\x94C)http://category.dangdang.com/?ref=www-0-C\x94as\x8c\x04body\x94C\x00\x94\x8c\acookies\x94}\x94\x8c\x04meta\x94}\x94(\x8c\x04item\x94}\x94(\x8c\t\xe5\xa4\xa7\xe6\xa0\x87\xe9\xa2\x98\x94\x8c\x06\xe5\xb0\x8f\xe8\xaf\xb4\x94\x8c\t\xe5\xb0\x8f\xe6\xa0\x87\xe9\xa2\x98\x94\x8c\x0c\xe5\xa4\x96\xe5\x9b\xbd\xe5\xb0\x8f\xe8\xaf\xb4\x94u\x8c\x05depth\x94K\x01u\x8c\t_encoding\x94\x8c\x05utf-8\x94\x8c\bpriority\x94K\x00\x8c\x0bdont_filter\x94\x89\x8c\x05flags\x94]\x94\x8c\tcb_kwargs\x94}\x94u."
3) "\x80\x04\x95}\x01\x00\x00\x00\x00\x00\x00}\x94(\x8c\x03url\x94\x8c5http://category.dangdang.com/cp01.01.12.00.00.00.html\x94\x8c\bcallback\x94\x8c\x0cbook_details\x94\x8c\aerrback\x94N\x8c\x06method\x94\x8c\x03GET\x94\x8c\aheaders\x94}\x94C\aReferer\x94]\x94C)http://category.dangdang.com/?ref=www-0-C\x94as\x8c\x04body\x94C\x00\x94\x8c\acookies\x94}\x94\x8c\x04meta\x94}\x94(\x8c\x04item\x94}\x94(\x8c\t\xe5\xa4\xa7\xe6\xa0\x87\xe9\xa2\x98\x94\x8c\x0c\xe9\x9d\x92\xe6\x98\xa5\xe6\x96\x87\xe5\xad\xa6\x94\x8c\t\xe5\xb0\x8f\xe6\xa0\x87\xe9\xa2\x98\x94\x8c\x18\xe5\x85\xb6\xe4\xbb\x96\xe5\x9b\xbd\xe5\xa4\x96\xe9\x9d\x92\xe6\x98\xa5\xe6\x96\x87\xe5\xad\xa6\x94u\x8c\x05depth\x94K\x01u\x8c\t_encoding\x94\x8c\x05utf-8\x94\x8c\bpriority\x94K\x00\x8c\x0bdont_filter\x94\x89\x8c\x05flags\x94]\x94\x8c\tcb_kwargs\x94}\x94u."
127.0.0.1:6379>

具体的部署也可参考https://blog.csdn.net/u013399297/article/details/80393020?utm_source=blogxgwz0

猜你喜欢

转载自blog.csdn.net/weixin_43746433/article/details/106581190