I have multiple dataframes, example data:
df1:
user_id username firstname lastname
123 abc abc abc
456 def def def
789 ghi ghi ghi
df2:
user_id username firstname lastname
111 xyz xyz xyz
456 def def def
234 mnp mnp mnp
df3:
user_id username firstname lastname
789 ghi ghi ghi
456 def def def
222 uwv uwv uwv
I want to concatenate those dataframe, drop duplicate rows, and keep track of the rows' origins by adding more columns. Desired output:
df1:
user_id username firstname lastname df1 df2 df3
123 abc abc abc 1 0 0
456 def def def 1 1 1
789 ghi ghi ghi 1 0 1
111 xyz xyz xyz 0 1 0
234 mnp mnp mnp 0 1 0
222 uwv uwv uwv 0 0 1
I can concatenate do the first step by using:
pd.concat([df1, df2, df3]).drop_duplicates('user_id').reset_index(drop=True)
How can I do the last step (making the origin columns)? I don't know how to do this without forloop, which is not practical for big dataframe. Thank you
As you need the details of df's
names after concatenating, you should not drop them.
You need to add df
name as a column, so that after concatenating, we can know, which row is from which df
.
>>> cols = df1.columns.to_list() # Store columns before adding `origin` column to use later in groupby
>>> df1['origin'] = 'df1'
>>> df2['origin'] = 'df2'
>>> df3['origin'] = 'df3'
Now, using groupby
your task becomes easy.
>>> df = pd.concat([df1, df2, df3])\
.reset_index(drop=True)\
.groupby(by = cols, axis=0)\
.apply(lambda x: x['origin'].values)\
.to_frame('origin')\
.reset_index()
Output:
>>> df
user_id username firstname lastname origin
0 111 xyz xyz xyz [df2]
1 123 abc abc abc [df1]
2 222 uwv uwv uwv [df3]
3 234 mnp mnp mnp [df2]
4 456 def def def [df1, df2, df3]
5 789 ghi ghi ghi [df1, df3]
You need to do one hot encoding to get the expected results like, (referenced from here)
Using sklearn.preprocessing.MultiLabelBinarizer (You can also use pd.get_dummies
as well, check this)
>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> mlb = MultiLabelBinarizer()
>>> expandedLabelData = mlb.fit_transform(df["origin"])
>>> labelClasses = mlb.classes_
>>> encoded_df = pd.DataFrame(expandedLabelData, columns=labelClasses)
df1 df2 df3
0 0 1 0
1 1 0 0
2 0 0 1
3 0 1 0
4 1 1 1
5 1 0 1
Finally,
>>> pd.concat([df.drop('origin', axis=1), encoded_df], axis=1)
user_id username firstname lastname df1 df2 df3
0 111 xyz xyz xyz 0 1 0
1 123 abc abc abc 1 0 0
2 222 uwv uwv uwv 0 0 1
3 234 mnp mnp mnp 0 1 0
4 456 def def def 1 1 1
5 789 ghi ghi ghi 1 0 1