import asyncio
from motor.motor_asyncio import AsyncIOMotorClient
import pandas as pd
import nest_asyncio
nest_asyncio.apply()
def client_database(address,port,database):
client = AsyncIOMotorClient(address,port)
db = client[database]
return db
async def do_find(db,collection):
cursor = db[collection].find()
count = []
async for document in cursor:
print(list(document.keys()))
count.append(document)
dataframe = pd.DataFrame(count)
dataframe.set_index('_id',inplace=True)
dataframe.to_csv('dataframe.csv') #保存CSV
return dataframe
if __name__ == '__main__':
address = '127.0.0.1' #地址
port = 27017 #端口
database = 'MachineLearning' #数据库名字
collection = 'Movie' #集合名字
db = client_database(address, port, database)
loop = asyncio.get_event_loop()
dataframe = loop.run_until_complete(do_find(db, collection))
I think there should be a lot of room for optimization when the dictionary data is converted into a dataframe.
Method 1: When converting multiple dictionaries into a pandas dataframe, is it better to read the dictionary data all the time, add all the dictionaries to the list and then convert them into a dataframe? [Will the list memory overflow when the amount of data is large? 】
Method 2: Or directly define an empty dictionary, then convert the read dictionary into a dataframe, keep reading and splicing it to the empty dictionary, or is this better? [Frequent creation, conversion and splicing, maybe more IO-consuming? 】
Method 3: Based on Method 1, is there a better data structure than a list? Or to set an upper limit for the list, cache it after reaching the upper limit, and then convert multiple lists into an array and combine them into a dataframe after multiple operations (or directly into a dataframe without turning into an array?)?