python3爬虫会用到的各种存储操作

爬虫完以后，自然就要把结果保存下来。本文重点讲解Python中通用的读写方法（通用的文件及数据库操作）：
一、TXT文本
1、读入

import json

path = r"\test.json"
#打开方式：r\w\a\rb\wb\a+\ab
with open(path, "r",encoding='utf-8') as file:
    r = file.read()
    t= json.loads(r)

2、写入

with open(test.txt, "w",encoding='utf-8') as file:
    r = file.write('text')
    r=file.write(' '.join(['text1','text2'])

二、Json文件
1、读入

import json

with open(path,'r') as file:
    str=file.read()
    #loads将Json文本转换成Json对象
    text=json.loads(str)

2、写入

with open(path,'w') as file:
    #indent是锁紧字符数，ensure_ascii是不强制为ascii，这样才适用utf8
    #dumps将Json对象转成字符串
    file.write(json.dumps(json_object,indent=2,ensure_ascii=False))

Json对象例子如下：

1、data = [ { 'a' : 1, 'b' : 2, 'c' : 3, 'd' : 4, 'e' : 5 } ]
2、{'a': 'Runoob', 'b': 7}

三、Csv文件
1、读入

with open(path,'r',encoding='utf-8') as file:
    reader=csv.reader(file)
    for row in reader:
        print(row)
或者用pandas模块里边的方法更简洁：
import pandas as pd

df=pd.read_csv('csv_file')

2、写入

with open(path,'w',encoding='utf-8') as file:
    writer=csv.DictWriter(file,fieldnames=fieldnames)
    writer.writerow({dicts})
另外用Pandas的to_csv()

四、Mysql
1、写入

import pymysql

db=pymysql.connect(host='localhost',user='root',password='mima',port=int,db='dbname')
cursor=db.cursor()
#sql语句
sql_database='create database db_name default character set utf8'
sql_table='create table if not exists table_name (field1 varchar(n) not null,field2 varchar(n) default values,primary key(field))'
#动态化插入sql语句
table='tablename'
keys=','.join(data.keys())
values=','.join(['%s']*len(data))
sql_insert='insert into {table}({keys}) values ({values})'.format(table=table,keys=keys,values=values)
execute_sql='sql语句'
try:
    cursor.execute(sql_insert,tuple(data.values()))
    db.commit()
except:
    db.rollback()
db.close()
其中data为：
{key1:value1,key2:value2}

类似SQL server的操作，只需饮用pymssql 。

五、MongoDB
1、写入

import pymongo

#设定连接
client=pymongo.MongoClient(host='localhost',port=27017)
client=MongoClient('mongodb://localhost:27017')
#指定数据库，如果没有则会新建
db=client.test #或 db=client['test']
#指定集合（类似表）
collection=db.result #或 db['result']
data={key1:value1,key2:value2}
collection.insert_one(data)
#多个记录
collection.insert_many([data1,data2])

2、读取

result=collection.find_one({'key':'value'})  #类似where key=value
results=collection.find({'key':'value'})  #返回的是生成器格式
for result in results:
    print(result)

如果遇到的是一个范围的查询条件，如大于、正则，则按如下查询：

results=collection.find({'key':{'$gt':20}}) #$gt也就是greater than
#$lt\gte\ne\in\nin\lte
results=collection.find({'key':{'$regex':'[a-z].*'}})
除此之外还有$exists\$type\mod\where等

六、Redis(strictredis)
这是一个机遇内存的高效的菲关系型数据库，貌似pyspider就用到这个数据库。暂时没发现怎么利用这个库，先不学了，有兴趣的朋友请自行百度了。

python3爬虫会用到的各种存储操作

猜你喜欢