scrapy-redis
By default only supports the redis list
and set
data structures, but when faced with more business after taking into account the need reptiles task priority issue. For example, currently has three business lines also need to use a reptile, the importance of the three lines of business are not the same , then there are several options:
- Open three spider (not recommended)
- Add the scheduler for scheduling priority (the added complexity)
- Let
scrapy-redis
'sstart_urls
support the priority
Was also faced with the problem, adding a layer uses a scheduler to run, then take the time to scrapy-redis
provide support for start_urls
priority feature, by settings.py
setting parameters can be supported, the test has passed, the project may be too busy author , there is no feedback to the PR.
project address
https://github.com/qshine/scrapy-redis
Instructions
git clone https://github.com/qshine/scrapy-redis.git
cd scrapy-redis
python setup.py install
In settings.py
setting this parameter, other parameters may refer to README
# settings.py
......
REDIS_URL = 'redis://:@127.0.0.1:6379'
REDIS_START_URLS_KEY = '%(name)s:start_urls'
REDIS_START_URLS_AS_SET = False
......
Test spider
as follows
# -*- coding: utf-8 -*-
from scrapy_redis.spiders import RedisSpider
class MysiteSpider(RedisSpider):
name = 'mysite'
def parse(self, response):
print(response.url)
To redis
add three different priority task
zadd mysite:start_urls 0 'a' 10 'b' 5 'c'
Start spider
, log is as follows
http://www.sina.com
2019-07-03 23:54:34 [mysite] DEBUG: Request not made from data: b'http://www.sina.com'
http://www.163.com
2019-07-03 23:54:34 [mysite] DEBUG: Request not made from data: b'http://www.163.com'
http://www.baidu.com
2019-07-03 23:54:34 [mysite] DEBUG: Request not made from data: b'http://www.baidu.com'
Epilogue
This feature
is the latest to address priority submitted personally think that is a more practical function. If there are less than welcome to share, if you can help achieve rapid demand, also welcomed the click star
.