版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/monkeysheep1234/article/details/80214605
scrapy爬取成功后可以保存在本地或者数据库,保存的格式也是多样的。可参考官方文档
https://docs.scrapy.org/en/latest/topics/item-pipeline.html#writing-your-own-item-pipeline
本文总结保存mysql
首先,setting.py文件配置
ITEM_PIPELINES = {
xxxxx
'ArticleSpider.pipelines.MysqlPipeline': 20,
xxxxx
}
pipelines.py中写数据库保存的具体方法:
MysqlPipeline(采用同步机制,且不添加去重逻辑时,最简单的实现)
class MysqlPipeline(object):
def __init__(self):
self.conn = MySQLdb.connect('xxxx', 'mysql', 'xxxx', 'xxxx', charset="utf8", use_unicode=True)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
#新的url进行存储
insert_novelinfo_sql = """
insert into novel_info(novel_id, novel_url,title,author,introduction,category,picture_url,picture_path,update_time)
VALUES (%s, %s, %s, %s,%s, %s, %s, %s, %s)
"""
insert_noveldetail_sql = """
insert into novel_content(novel_id, chapter_url,chapter_id,chapter_name,novel_detail)
VALUES (%s, %s, %s, %s, %s)
"""
if isinstance(item, NovelInfoItem):
self.cursor.execute(insert_novelinfo_sql, (item["novel_id"], item["novel_url"], item["title"], item["author"],
item["introduction"], item["category"], item["picture_url"],
item["picture_path"],
item["update_date"]))
elif isinstance(item, NovelDetailItem):
if redis_db.sismember(self.redis_data_dict, item["chapter_url"]): # 取item里的chapter_url和里的字段对比,看是否存在,存在就丢掉这个item。不存在返回item给后面的函数处理set
print(item["chapter_id"],'has been finished')
raise DropItem("Duplicate item found: %s" % item)
self.cursor.execute(insert_noveldetail_sql, (item["novel_id"], item["chapter_url"], item["chapter_id"],
item["chapter_name"], item["novel_detail"]))
self.conn.commit()
return item
MysqlPipeline(采用同步机制,且添加去重逻辑时的实现)
去重方式有多种:
- 关系型数据库去重,每来一个url请求,需要查询一次数据库,数据量大后效率很低
- 缓存数据库,保存redis,使用其中的Set数据类型
- 内存去重 还可以具体细分,如:url直接存书到HashSet;url在md5后存入HashSet;Bit-Map方法,建一个BitSet,将每个url经过一个哈希函数映射到某一位。
下面采用的是:缓存数据库去重,因为数据量不大,所以直接把url放在Set中
关键代码如下:
#将每次爬取的连接存入redis
class MysqlPipeline(object):
def __init__(self):
self.redis_data_dict = "crawredUrl"
self.conn = MySQLdb.connect('39.106.143.131', 'mysql', '123456', 'sanguo', charset="utf8", use_unicode=True)
self.cursor = self.conn.cursor()
# crawred_urls = self.exists_urls()#读取mysql中已经保存的数据
# for crawred_url in crawred_urls:
# #当已经爬过的数据不再保存
# if not redis_db.sismember(self.redis_data_dict, crawred_url):
# #按照set类型保存
# redis_db.sadd(self.redis_data_dict, crawred_url)
def process_item(self, item, spider):
#新的url进行存储
insert_novelinfo_sql = """
insert into novel_info(novel_id, novel_url,title,author,introduction,category,picture_url,picture_path,update_time)
VALUES (%s, %s, %s, %s,%s, %s, %s, %s, %s)
"""
insert_noveldetail_sql = """
insert into novel_content(novel_id, chapter_url,chapter_id,chapter_name,novel_detail)
VALUES (%s, %s, %s, %s, %s)
"""
if isinstance(item, NovelInfoItem):
self.cursor.execute(insert_novelinfo_sql, (item["novel_id"], item["novel_url"], item["title"], item["author"],
item["introduction"], item["category"], item["picture_url"],
item["picture_path"],
item["update_date"]))
elif isinstance(item, NovelDetailItem):
if redis_db.sismember(self.redis_data_dict, item["chapter_url"]): # 取item里的chapter_url和里的字段对比,看是否存在,存在就丢掉这个item。不存在返回item给后面的函数处理set
print(item["chapter_id"],'has been finished')
raise DropItem("Duplicate item found: %s" % item)
self.cursor.execute(insert_noveldetail_sql, (item["novel_id"], item["chapter_url"], item["chapter_id"],
item["chapter_name"], item["novel_detail"]))
#每次爬完数据保存redis
redis_db.sadd(self.redis_data_dict, item["chapter_url"])
self.conn.commit()
return item
# 将已经保存在mysql中的数据查出来,放入redis,该方法只在第一次将mysql中url存redis时用,后期新爬的每一个地址,直接保存在redis
def exists_urls(self):
sql = "SELECT chapter_url,chapter_id FROM novel_content"
self.cursor.execute(sql)
results = self.cursor.fetchall()
chapter_urls= []
for row in results:
chapter_url = row[0]
chapter_urls.append(chapter_url)
return chapter_urls
def close_spider(self,spider):
self.conn.close()
另外,在爬虫中需要过滤爬取的地址,当已经爬过的不需要再爬
spiders包下的爬虫
class Novel37Spider(scrapy.Spider):
name = 'novel37'
allowed_domains = ['xxxx']
start_urls = ['xxxx']
redis_data_dict = 'crawredUrl'
def parse(self, response):
xxxx
def parse_per(self, response):
xxxx
chapter_urls = novel_info.xpath('//*[@id="list"]//a/@href').extract()
#访问每一个章节详情页面
for chapter_url in chapter_urls:
#当已经爬取过时,chapter_url不再传入进行回调
url = parse.urljoin(response.url, chapter_url)
if (not self.is_crawred_url(url)):
print(url,'detail to get')
yield Request(url=parse.urljoin(response.url, chapter_url), callback=self.parse_detail)
#获取redis中url是否爬取过
def is_crawred_url (self,crawred_url):
return redis_db.sismember(self.redis_data_dict, crawred_url)