The first step download scrapy modules:
pip install scrapy-redis
The second step to create a project
In the terminal / cmd into the directory to create a project:
cd path:
scrapy startproject douban (project name)
into the pycharm
The third step is to create a reptile
1. Go to your project directory of spiders in the terminal:
Input scrapy genspider douban_spider movie.douban.com (reptiles, crawling range domain)
The fourth step is set
Non-distributed crawler settings:
1. ROBOTSTXT_OBEY = True to False
2. Open pipe:
ITEM_PIPELINES = {
'JD_redis.pipelines.JdRedisPipeline': 300,
'scrapy_redis.pipelines.RedisPipeline': 400, # distributed crawler added
}
3. Turn on the setting and modification:
DEFAULT_REQUEST_HEADERS = {
'the Accept': 'text / HTML, file application / XHTML + XML, file application / XML; Q = 0.9, * / *; Q = 0.8',
'the Accept-Language': 'EN',
'the User-- Agent': "the Mozilla / 5.0 (the Windows 10.0 NT; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 58.0.3029.110 Safari / 537.36 "
}
. Add the following settings. 4
# scrapy-redis to use in recombinant member, not using the default deduplication manner scrapy
= DUPEFILTER_CLASS "scrapy_redis.dupefilter.RFPDupeFilter"
# scrapy-redis used in the scheduler component, not using the default scheduler
sCHEDULER = "scrapy_redis.scheduler.Scheduler"
# Allowed to pause, redis requested record is not lost
= True SCHEDULER_PERSIST
# default form scrapy-redis request queue (priority)
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
connection parameter specifies # redis database
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
The fifth step preparation of project (modifications in the original non-distributed crawler project)
1. Modify the spider file:
The original files are:
import scrapy
from JD.items import JdItem
class BookSpider(scrapy.Spider):
name = 'book'
# allowed_domains = ['jd.com','p.3.cn']
start_urls = ['https://book.jd.com/booksort.html']
修改为:
import scrapy
from scrapy_redis.spiders import RedisSpider
from JD.items import JdItem
class BookSpider(scrapy.Spider):
name = 'book'
allowed_domains = ['jd.com','p.3.cn']
# start_urls = ['https://book.jd.com/booksort.html']
redis_key = 'book:start_urls' # book可以自己随意取
Only modify the two places, one is inherited classes: modified by scrapy.Spider to RedisSpider
Then start_url no longer needed, modified to: redis_key = "xxxxx", where the value of this key temporarily take the name,
General use Project name: start_urls url instead of the initial crawling. Since each of the distributed scrapy-redis requests are taken out from redis, and therefore, in redis database, a redis_key setting value as an initial url, scrapy automatically takes a value in redis_key redis, as the initial url, automatic crawling.
2. Run:
In the command input: scrapy runspider douban_spider.py (reptiles file name)
3 to redirs server, enter the initial url
lpush books:start_urls https://book.jd.com/booksort.html