scrapy_redis distributed reptiles summary

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/LXJRQJ/article/details/101169607

scrapy_redis distributed reptiles summary

scrapy_redis: Scrapy_redis realized more and more powerful on the basis of scrapy, embodied in: reqeust de-duplication, reptiles persistent, and easily distributed

installation:

pip3 install scrapy-redis

Scrapy_redis is workflow

Here Insert Picture Description
Official Documents

The first step: start Redis

Redis first need to start up. Use Mac OS / Linux can enter the following command in a terminal and enter the following:redis-server
Using Windows, in CMD cd into the store Redis folder, and run:
redis-server.exe

Step two: Modify reptiles
in previous lessons, we crawler is inherited from the parent class scrapy.Spider. This is the most basic Scrapy inside a reptile, reptile can only achieve the basic functions. Now it needs to be replaced, in order to achieve more advanced functions.

Please compare the following piece of code used Scrapy_redis previously read color code the site reptile head what is the difference:

from scrapy_redis.spiders import RedisSpider

class ReadColorSpider(RedisSpider):name ="readcolorspider"

redis_key ='readcolorspider:start_urls'

As can be seen, where the parent class already reptiles changed RedisSpider, while a more:

redis_key='readcolorspider:start_urls'

All URL redis_key here is actually a variable name, followed by reptiles climb will be saved to the Redis, called == "readcolorspider: start_urls" == list below, reptiles but also to read the pages that follow from this list the URL. The variable name can be modified.

In addition to these two points than in other portions of the code reptiles do not need to make changes.

实际上,这样就已经建立了一个分布式爬虫,只不过现在只有一台电脑。
第三步: 修改设置
现在已经把三轮车换成了挖掘机,但是Scrapy还在按照指挥三轮车的方式指挥挖掘机,所以挖掘机还不能正常工作。因此修改爬虫文件还不行,Scrapy还不能认识这个新的爬虫。现在修改settings.py。

(1)Scheduler
首先是Scheduler的替换,这个东西是Scrapy中的调度员。在settings.py中添加以下代码:

Enables scheduling storing requests queue in redis.

SCHEDULER="scrapy_redis.scheduler.Scheduler"

(2)去重

 Ensure all spiders share same duplicates filter through redis.

DUPEFILTER_CLASS="scrapy_redis.dupefilter.RFPDupeFilter"

设置好上面两项以后,爬虫已经可以正常开始工作了。不过还可以多设置一些东西使爬虫更好用。

(3)不清理Redis队列

Don't cleanup redis queues, allows to pause/resume crawls.

SCHEDULER_PERSIST=True

如果这一项为True,那么在Redis中的URL不会被Scrapy_redis清理掉,这样的好处是:爬虫停止了再重新启动,它会从上次暂停的地方开始继续爬取。但是它的弊端也很明显,如果有多个爬虫都要从这里读取URL,需要另外写一段代码来防止重复爬取。

如果设置成了False,那么Scrapy_redis每一次读取了URL以后,就会把这个URL给删除。这样的好处是:多个服务器的爬虫不会拿到同一个URL,也就不会重复爬取。但弊端是:爬虫暂停以后再重新启动,它会重新开始爬。
第四步: 爬虫请求的调度算法
爬虫的请求调度算法,有三种情况可供选择:

队列

SCHEDULER_QUEUE_CLASS='scrapy_redis.queue.SpiderQueue'

如果不配置调度算法,默认就会使用这种方式。它实现了一个先入先出的队列,先放进Redis的请求会优先爬取。

SCHEDULER_QUEUE_CLASS='scrapy_redis.queue.SpiderStack'

这种方式,后放入到Redis的请求会优先爬取。

优先级队列

SCHEDULER_QUEUE_CLASS='scrapy_redis.queue.SpiderPriorityQueue'

这种方式,会根据一个优先级算法来计算哪些请求先爬取,哪些请求后爬取。这个优先级算法比较复杂,会综合考虑请求的深度等各个因素。

第五步 导出分布式爬虫redis数据库中存储的数据

Data climb back, but on Redis There is no treatment. Before us there is no configuration file to customize their ITEM_PIPELINES, but the use of RedisPipeline, so now these data are stored in redis in, so we need to do another deal.
In the project directory can see a process_items.py file, the file is scrapy-redis the example provided by the reading process is carried out stencil item from redis.
Suppose we want to save the items redis read out the data written into the MongoDB or MySQL, then we can write a process_profile.py file yourself, and then to keep running in the background can be kept crawling back up data warehousing.

Export data is stored into the mongodb

import json 
import redis 
import pymongo
def main():
	# 指定Redis数据库信息
	rediscli = redis.StrictRedis(host='localhost', port=6379, db=0) 
	# 指定MongoDB数据库信息 
	mongocli = pymongo.MongoClient(host='localhost', port=27017) 
	# 指定数据库 
	db = mongocli['数据库名称'] 
	# 指定集合 
	sheet = db['集合名称']
	while True:
		# FIFO模式为 blpop,LIFO模式为 brpop,获取键值 
		source, data = rediscli.blpop(“项目名:items") 
		data = data.decode('utf-8') 
		item = json.loads(data)
		try: 
			sheet.insert(item) 
			print ("Processing:insert successed" % item) 
		except Exception as err: 
			print ("err procesing: %r" % item)
if __name__ == '__main__': 
	main()

Export data into MySQL
first start mysql
create databases and tables

import json 
import redis 
import pymongo
def main(): 
	# 指定redis数据库信息 
	rediscli = redis.StrictRedis(host='localhost', port = 6379, db = 0) 
	# 指定mysql数据库 
	mysqlcli = pymysql.connect(host='localhost', user='用户', passwd='密码', db = '数据库', port=3306, charset='utf8') 
	# 使用cursor()方法获取操作游标 
	cur = mysqlcli.cursor()
	while True: 
	# FIFO模式为 blpop,LIFO模式为 brpop,获取键值 
	source, data = rediscli.blpop("redis中对应的文件夹:items") 
	item = json.loads(data.decode('utf-8')) 
	try: 
		# 使用execute方法执行SQL INSERT语句 
		cur.execute("sql语句"['数据',....]) 
		# 提交sql事务 
		mysqlcli.commit() print("inserted successed") 
	except Exception as err: 
		#插入失败 
		print("Mysql Error",err) 
		mysqlcli.rollback() 
	if __name__ == '__main__': 
		main()

First, the main difference

scrapy is a Python framework reptile, crawling high efficiency, high degree of customization, but does not support distributed.
scrapy-redis redis based on a set of database components running on scrapy framework, allowing scrapy support distributed strategy, Slaver side shared Master end redis database of item queue, the request queue and request a set of fingerprints.

Second, why choose redis database,

Because redis supports master-slave synchronization, and the data is cached in memory, so the redis based distributed crawler, a high frequency of requests and data reading efficiency is very high.
You used crawler module frame or what? They talk about the differences or advantages and disadvantages?
Python comes with: urllib, urllib2

        第 三 方:requests

        框    架:Scrapy

 3. urllib和urllib2模块都做与请求URL相关的操作,但他们提供不同的功能。

       urllib2.:urllib2.urlopen可以接受一个Request对象或者url,(在接受Request对象时候,可以设置一个URL headers),

       urllib.urlopen只接收一个url

       urllib 有urlencode,urllib2没有,因此总是urllib,urllib2常会一起使用的原因

Scrapy advantages and disadvantages:

  1. advantage:

             scrapy 是异步的
    
             采取可读性更强的xpath代替正则
    
             强大的统计和log系统
    
             同时在不同的url上爬行
    
             支持shell方式,方便独立调试
    
             写middleware,方便写一些统一的过滤器
    
             通过管道的方式存入数据库
    
    1. Disadvantages:

           基于python的爬虫框架,扩展性比较差
      
           基于twisted框架,运行中的exception是不会干掉reactor,
      
          并且异步框架出错后是不会停掉其他任务的,数据出错后难以察觉。
      

Guess you like

Origin blog.csdn.net/LXJRQJ/article/details/101169607
Recommended