Scrapy爬虫框架(二)

Scrapy爬虫数据持久化

一、本地文件持久化:

  最简单的储存成json格式文件,在运行爬虫时,命令为:scrapy crawl name -o xxx.json

         jsonlines格式:命令为:scrapy crawl name -o xxx.jl

二、数据库持久化:

  1、MySQL存储:

    (1)、settings文件设置MySQL相关的参数,以及打开ITEM_PIPELINES

1 MYSQL_HOST = "localhost"
2 MYSQL_DBNAME = "yyyyyy"
3 MYSQL_USER = "root"
4 MYSQL_PASSWD = "xxxxxx"
5 
6 ITEM_PIPELINES = {
7    'Spider.pipelines.spiderPipeline': 300,
8 }

      (2)、在pipelines文件书写SpiderPipeline类:

      demo:

 1 import pymysql
 2 from Spider.items import xxxspiderItems
 3 from Spider.Spider import settings
 4 from scrapy import log
 5 
 6 
 7 class spiderPipeline(object):
 8     def __init__(self):
 9         self.con = pymysql.connect(
10             host=settings.MYSQL_HOST,
11             db=settings.MYSQL_DBNAME,
12             user=settings.MYSQL_USER,
13             passwd=settings.MYSQL_PASSWD,
14             charset='utf8',
15             use_unicode=True
16         )
17         self.cursor = self.con.cursor()
18 
19     def process_item(self, item, spider):
20         try:
21             self.cursor.execute("select * from table where name=%s", item["name"])
22             ret = self.cursor.fetchone()
23             if ret: # 数据去重
24                 self.cursor.execute("update table set name=%s,author=%s,score=%s",
25                                     (item["name"],
26                                      item["author"],
27                                      item["score"],)
28             else:
29                 self.cursor.execute("insert into table(name,author,score) value (%s,%s,%s)",
30                                     (item["name"],
31                                      item["author"],
32                                      item["score"],)
33             self.con.commit()
34         except Exception as e:
35             log(e)
36         return item
37 
38     def db_close(self):
39         self.cursor.close()
40         self.con.close()

   2、redis存储demo:

1 import redis
2 
3 class ProxyspiderPipeline(object):
4     def __init__(self):
5         self.con = redis.StrictRedis(host="localhost", port=6379, db=15, password="123456")
6 
7     def process_item(self, item, spider):
8         self.con.sadd("usable_proxy", item["proxy"])
9         return item

    demo中,将数据存入了set集合类型中,也可以存成其他redis数据类型,具体可看图片操作,redis库的方法基本和redis库操作一致

  3、MongoDB存储:

    占坑

  4、Sqlites存储:

    占坑

 

猜你喜欢

转载自www.cnblogs.com/lambs/p/9132593.html