Python reads data processing of MongoDB data (time processing + list nested dictionary to data frame)

The MongoDB database stores unstructured data, which brings many processes that require data preprocessing after the data is read. Let's summarize it together:

1. MongoDB database connection, reading of related data in the data table, including screening of related content (in)

from pymongo import MongoClient
import datetime
import csv
import pandas as pd
client1=MongoClient('mongodb://账户:密码@IP:端口/database名')
db1=client1.university#数据库名
mycol=db1.studyrecord#数据表名
cur=mycol.find({
    
    "courseid":{
    
    "$in":[16967,……]},\
               "username":{
    
    "$in":['F2847673',……]}},\
               {
    
    '_id':0,"courseid" :1,"wareid":1,"username" :1,"starttime":1,\
               "endtime" :1,"increasetime" :1,"createdate" :1})
study=pd.DataFrame(list(cur),columns=['courseid',"wareid",'username',"starttime",\
                                     "endtime","increasetime","createdate"])#学习记录
client1.close()

2. MongoDB database stores time in three common formats: ISODate("2021-02-05T03:18:00.509Z"), Date(-62135596800000), these two time formats appear in one column, how to filter time, see the article for details "pymongo multi-condition screening time error reporting 'module' object is not callable" https://blog.csdn.net/zxxxlh123/article/details/108747680, NumberLong(1612559021241).

For a time format similar to ISODate("2021-02-05T03:18:00.509Z"), use datetime(2021,2,5,0,0,0,0) as a filter condition, as follows:

cur=mycol.find({
    
    "createdate":{
    
    "$gte":datetime(2021,2,5,0,0,0,0)},
                "courseid":{
    
    "$in":[16942,……]}},\
               {
    
    '_id':0,"courseid" :1,"username" :1})#时间筛选

Let's take a look at how to convert data in NumberLong(1612559021241) format into time format:

study['starttime'][0]#输出1611633530811
import time
time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(study['starttime'][0]/1000))
#输出'2021-01-26 11:58:50'

3. When reading MongoDB data into a data frame, you will encounter a certain column as a dictionary, and you need to take out the dictionary key-value pairs and convert them into multiple columns.

pd.DataFrame(exam['_id'].tolist()
exam=pd.merge(pd.DataFrame(exam['_id'].tolist()),exam[['max_sal']],how= 'inner',  left_index=True,right_index=True)#通过索引再与其他列关联

Before processing
insert image description here
4. When reading MongoDB data into a data frame, you will encounter a column that is a list nested dictionary, which needs to be converted into a data frame

The [' subject '] column is a list nested dictionary

df_list = [pd.DataFrame(d) for d in exam['subject']]
df = pd.concat(df_list, keys=exam.index).reset_index(level=1,drop=False)
exam_=pd.merge(exam[['courseid','username',"timestart","timeend","createdate","timeinterval","score"]],df,how='left',left_index=True,right_index=True)#用索引关联原数据框其他列

Before processing:
insert image description here

insert image description here

After processing: insert image description here
**It is not easy to sort out the content. I passed by and found the course content to be good. Please help to like and bookmark it! Thanks♪(・ω・)ノ****If you need to reprint, please indicate the source

References:
https://www.it1352.com/1725492.html
https://blog.csdn.net/Poppy_tester/article/details/105064093

Guess you like

Origin blog.csdn.net/zxxxlh123/article/details/114694544