Python reads mongodb data and converts it into pandas dataframe

import asyncio
from motor.motor_asyncio import AsyncIOMotorClient
import pandas as pd
import nest_asyncio
nest_asyncio.apply()

def client_database(address,port,database):
    client = AsyncIOMotorClient(address,port)
    db = client[database]
    return db

async def do_find(db,collection):
    cursor = db[collection].find()
    count = []
    async for document in cursor:
        print(list(document.keys()))
        count.append(document)
    dataframe = pd.DataFrame(count)
    dataframe.set_index('_id',inplace=True)
    dataframe.to_csv('dataframe.csv') #保存CSV
    return dataframe

if __name__ == '__main__':
    address = '127.0.0.1' #地址
    port = 27017 #端口
    database = 'MachineLearning' #数据库名字
    collection = 'Movie' #集合名字
    db = client_database(address, port, database)
    loop = asyncio.get_event_loop()
    dataframe = loop.run_until_complete(do_find(db, collection))

I think there should be a lot of room for optimization when the dictionary data is converted into a dataframe.

Method 1: When converting multiple dictionaries into a pandas dataframe, is it better to read the dictionary data all the time, add all the dictionaries to the list and then convert them into a dataframe? [Will the list memory overflow when the amount of data is large?

Method 2: Or directly define an empty dictionary, then convert the read dictionary into a dataframe, keep reading and splicing it to the empty dictionary, or is this better? [Frequent creation, conversion and splicing, maybe more IO-consuming?

Method 3: Based on Method 1, is there a better data structure than a list? Or to set an upper limit for the list, cache it after reaching the upper limit, and then convert multiple lists into an array and combine them into a dataframe after multiple operations (or directly into a dataframe without turning into an array?)?

Guess you like

Origin blog.csdn.net/qq_42658739/article/details/104595742