Article Directory
- 7.1 scrapy-redis combat - own source Item Description
- Source comes Item Description:
- Use example scrapy-redis to modify
- 一、dmoz (class DmozSpider(CrawlSpider))
- 二、myspider_redis (class MySpider(RedisSpider))
- note:
- execution way:
- `scrapy runspider myspider_redis.py`
- `$redis > lpush myspider:start_urls http://dmoztools.net/`
- 三、mycrawler_redis (class MyCrawler(RedisCrawlSpider))
- to sum up:
7.1 scrapy-redis combat - own source Item Description
Source comes Item Description:
Use example scrapy-redis to modify
Github to get start scrapy-redis example, and then the inside of the example-project directory to a designated address:
# clone github scrapy-redis源码文件
git clone https://github.com/rolando/scrapy-redis.git
# 直接拿官方的项目范例,改名为自己的项目用(针对懒癌患者)
mv scrapy-redis/example-project ~/scrapyredis-project
We clone to scrapy-redis source has a built-example-project project, which includes three spider, are dmoz, myspider_redis, mycrawler_redis.
一、dmoz (class DmozSpider(CrawlSpider))
The reptile inherited CrawlSpider, which is used to explain the persistence Redis, when we first run dmoz reptiles, then Ctrl + C then stopped, then run dmoz reptile, crawling before the record is retained in the Redis .
Analysis of them, in fact, this is a scrapy-redis version of the CrawlSpider
class, you need to set rules Rule, and callback can not write parse () method.
execution way:scrapy crawl dmoz
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class DmozSpider(CrawlSpider):
"""Follow categories and extract links."""
name = 'dmoz'
allowed_domains = ['dmoztools.net/']
start_urls = ['http://dmoztools.net/']
rules = [
Rule(LinkExtractor(
restrict_css=('.top-cat', '.sub-cat', '.cat-item')
), callback='parse_directory', follow=True),
]
def parse_directory(self, response):
for div in response.css('.title-and-desc'):
yield {
'name': div.css('.site-title::text').extract_first(),
'description': div.css('.site-descr::text').extract_first().strip(),
'link': div.css('a::attr(href)').extract_first(),
}
二、myspider_redis (class MySpider(RedisSpider))
The reptile inherited RedisSpider, it can support a distributed crawling, using the basic spider, need to write a parse function.
Secondly, there is no longer a start_urls, replaced by redis_key, scrapy-redis from Redis in the key pop out and become url address request.
from scrapy_redis.spiders import RedisSpider
class MySpider(RedisSpider):
"""Spider that reads urls from redis queue (myspider:start_urls)."""
name = 'myspider_redis'
# 注意redis-key的格式:
redis_key = 'myspider:start_urls'
# 可选:等效于allowd_domains(),__init__方法按规定格式写,使用时只需要修改super()里的类名参数即可
def __init__(self, *args, **kwargs):
# Dynamically define the allowed domains list.
domain = kwargs.pop('domain', '')
self.allowed_domains = filter(None, domain.split(','))
# 修改这里的类名为当前类名
super(MySpider, self).__init__(*args, **kwargs)
def parse(self, response):
return {
'name': response.css('title::text').extract_first(),
'url': response.url,
}
note:
RedisSpider class does not need to write allowd_domains
and start_urls
:
- scrapy-redis in the constructor will
__init__()
be defined dynamically in the crawler crawling gamut may be selected directly writtenallowd_domains
. - You must specify redis_key, namely reptiles start command reference format:
redis_key = 'myspider:start_urls'
- According to the specified format,
start_urls
to the Master lpush end redis-cli in the Redis database, RedisSpider start_urls acquired in the database.
execution way:
-
By runspider method of execution reptile py files (also can be divided into multiple execution), reptiles (they) will be in a wait state ready:
scrapy runspider myspider_redis.py
-
In Master end redis-cli push instruction input, reference format:
$redis > lpush myspider:start_urls http://dmoztools.net/
-
Slaver end crawler acquisition request to start crawling.
三、mycrawler_redis (class MyCrawler(RedisCrawlSpider))
This RedisCrawlSpider reptile class inherits RedisCrawlSpider, capable of supporting a distributed crawling. Because the use of crawlSpider, so it is necessary to comply with Rule rules, and callback can not write parse () method.
There is also no longer a start_urls, replaced by redis_key, scrapy-redis from Redis in the key pop out and become url address request.
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy_redis.spiders import RedisCrawlSpider
class MyCrawler(RedisCrawlSpider):
"""Spider that reads urls from redis queue (myspider:start_urls)."""
name = 'mycrawler_redis'
redis_key = 'mycrawler:start_urls'
rules = (
# follow all links
Rule(LinkExtractor(), callback='parse_page', follow=True),
)
# __init__方法必须按规定写,使用时只需要修改super()里的类名参数即可
def __init__(self, *args, **kwargs):
# Dynamically define the allowed domains list.
domain = kwargs.pop('domain', '')
self.allowed_domains = filter(None, domain.split(','))
# 修改这里的类名为当前类名
super(MyCrawler, self).__init__(*args, **kwargs)
def parse_page(self, response):
return {
'name': response.css('title::text').extract_first(),
'url': response.url,
}
note:
The same, RedisCrawlSpider class does not need to write allowd_domains
and start_urls
:
- scrapy-redis in the constructor will
__init__()
be defined dynamically in the crawler crawling gamut may be selected directly writtenallowd_domains
. - You must specify redis_key, namely reptiles start command reference format:
redis_key = 'myspider:start_urls'
- According to the specified format,
start_urls
to the Master lpush end redis-cli in the Redis database, RedisSpider start_urls acquired in the database.
execution way:
-
By runspider method of execution reptile py files (also can be divided into multiple execution), reptiles (they) will be in a wait state ready:
scrapy runspider mycrawler_redis.py
-
In Master end redis-cli push instruction input, reference format:
$redis > lpush mycrawler:start_urls http://www.dmoz.org/
-
Reptiles get url, started.
to sum up:
- If only to re-use and saving Redis, and we chose the first;
- If writing distributed, according to the situation, to select the second, third;
- Typically, the depth of focus will select prepared with crawler third embodiment.