环境要求
-
Python 2.7, 3.4 or 3.5
-
Redis >= 2.8
-
Scrapy
>= 1.1 -
redis-py
>= 2.10
1. 先安装scrapy-redis
sudo pip3 install scrapy-redis
2. 安装redis
3. 安装 redis的可视化工具 redis desktop manager
连接https://pan.baidu.com/s/1miRPuOC?fid=489763908155827
4. 改写spider
#文件为wb.py
import scrapy
from datetime import datetime
from ..items import QuestionItem, AnswerItem
from scrapy_redis.spiders import RedisSpider
import re
class WbSpider(RedisSpider):
name = 'wb'
allowed_domains = ['58che.com']
# start_urls = ['https://bbs.58che.com/cate-1.html']
redis_key = "wbSpider:start_urls"
首先是改成继承RedisSpider,然后增加一个redis_key是爬虫名字,同时注释掉start_urls,同时使用Redis命令向该列表添加起始爬取点,去掉了start_requests,因为所有的爬虫都是从redis来获取url,所以没有开始请求的地址了
redis-cli
lpush wbSpider:start_urls:https://bbs.58che.com/cate-1.html
5.修改setting设置
# Enables scheduling storing requests queue in redis.
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 300
}
# Enables scheduling storing requests queue in redis.
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 300
}
## 爬虫数据采用redis存储,注释掉其他存储设置
# Enables scheduling storing requests queue in redis.
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 300
}
6.部署到不同的电脑上,启动爬虫
scrapy crawl wb