scripy-redis distributed reptiles

The first step download scrapy modules:

  pip install scrapy-redis

The second step to create a project

  In the terminal / cmd into the directory to create a project: 

cd path:

scrapy startproject douban (project name)
  into the pycharm

The third step is to create a reptile

  1. Go to your project directory of spiders in the terminal:

  Input scrapy genspider douban_spider movie.douban.com (reptiles, crawling range domain)

The fourth step is set

  Non-distributed crawler settings:

  1. ROBOTSTXT_OBEY = True to False

  2. Open pipe:
ITEM_PIPELINES = {
'JD_redis.pipelines.JdRedisPipeline': 300,
'scrapy_redis.pipelines.RedisPipeline': 400, # distributed crawler added
}
  3. Turn on the setting and modification:
DEFAULT_REQUEST_HEADERS = {
'the Accept': 'text / HTML, file application / XHTML + XML, file application / XML; Q = 0.9, * / *; Q = 0.8',
'the Accept-Language': 'EN',
'the User-- Agent': "the Mozilla / 5.0 (the Windows 10.0 NT; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 58.0.3029.110 Safari / 537.36 "
}
  . Add the following settings. 4
# scrapy-redis to use in recombinant member, not using the default deduplication manner scrapy
= DUPEFILTER_CLASS "scrapy_redis.dupefilter.RFPDupeFilter"
# scrapy-redis used in the scheduler component, not using the default scheduler
sCHEDULER = "scrapy_redis.scheduler.Scheduler"
# Allowed to pause, redis requested record is not lost
= True SCHEDULER_PERSIST
# default form scrapy-redis request queue (priority)
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
connection parameter specifies # redis database
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379

The fifth step preparation of project (modifications in the original non-distributed crawler project)

  1. Modify the spider file:

The original files are:

import scrapy
from JD.items import JdItem
class BookSpider(scrapy.Spider):
name = 'book'
# allowed_domains = ['jd.com','p.3.cn']
start_urls = ['https://book.jd.com/booksort.html']
修改为:

import scrapy
from scrapy_redis.spiders import RedisSpider
from JD.items import JdItem
class BookSpider(scrapy.Spider):
name = 'book'
allowed_domains = ['jd.com','p.3.cn']
# start_urls = ['https://book.jd.com/booksort.html']
  redis_key = 'book:start_urls' # book可以自己随意取

Only modify the two places, one is inherited classes: modified by scrapy.Spider to RedisSpider

Then start_url no longer needed, modified to: redis_key = "xxxxx", where the value of this key temporarily take the name,

General use Project name: start_urls url instead of the initial crawling. Since each of the distributed scrapy-redis requests are taken out from redis, and therefore, in redis database, a redis_key setting value as an initial url, scrapy automatically takes a value in redis_key redis, as the initial url, automatic crawling.

 

  2. Run:

    In the command input: scrapy runspider douban_spider.py (reptiles file name)

  3 to redirs server, enter the initial url   

lpush books:start_urls https://book.jd.com/booksort.html

Guess you like

Origin www.cnblogs.com/lnd-blog/p/11695971.html
Recommended