pandas data analysis 35 - Cartesian product of multiple data frames

What is Cartesian product. It is the possibility of traversing all combinations.

For example, the first box has [1,2,3] three numbered balls, and the second box has [4,5] two numbered balls. Then there are two possibilities of 3*2 to take a ball from each box, and the set is {[1,4],[2,4],[3,4],[1,5],[2, 5],[3,5]}, this is the Cartesian product.

The same is true for the three boxes. For example, the third box has [6,7,8] balls, then there are 3*2*3, 18 possibilities. The collection of these possibilities is the Cartesian product.

Let’s take an example in pandas first, two data frames, each row and each row combination:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({"a":[1,2],"b":[3,4]})
df2 = pd.DataFrame({"c":[11,22],"d":[33,44],"e":[55,66]})

df1['value']=1
df2['value']=1
df3 = df1.merge(df2,how='left',on='value')
del df3['value']

df3

 It is equivalent to combining each row of df1 and df2 in pairs: the first row of df1 is matched with the first row of df2, the first row of df1 is matched with the second row of df2, and the second row of df1 is matched with the first row of df2 One row, the second row of df1 is matched with the second row of df2.

So its Cartesian product df3 is four rows.


student course case

What is the use of the Cartesian product above?

For example, say I have two tables:

df_student

 The above is the student information table, and there is a class schedule:

df_course

 I want to prepare to generate a new dataframe including grades for all courses for all students.

Then there should be 12 (number of students) * 10 (number of courses) data.

It can be achieved by the following method: (in fact, a new column of temp is added as a temporary key, and then deleted after merging)

#笛卡尔积
df_stu_cour=pd.merge(df_student[['# ID','name']].assign(temp=1),df_course.assign(temp=1),on='temp',how='left').drop(columns=['temp'])
df_stu_cour

 In this way, a Cartesian product table is generated, 120 entries are no problem, ID_x is the ID of the student, and ID_y is the ID of the course.

At this time, each student's test scores for each course can be filled in at the back.


If I already have a score table for each student and each course corresponding to them, but instead of the name and course name, but the student ID and course ID, what should I do if I need to merge it with the table I just made?

Check out my score sheet first:

df_score

 Merge, install the two keywords of student ID and course ID:

df_stu_cour.merge(df_score,left_on=['# ID_x','# ID_y'],right_on=['# s_id','c_id'],how='outer').tail(30)

 I am using merge and merge, so there must be 120 items. If there is no student's course score in the score table, it will be a NAN null value.

Guess you like

Origin blog.csdn.net/weixin_46277779/article/details/128996373