Concat python dataframes, drop duplicate and keep record of rows origin

Thanh Nguyen :

I have multiple dataframes, example data:

df1:

user_id    username firstname lastname 
 123         abc      abc       abc
 456         def      def       def 
 789         ghi      ghi       ghi

df2:

user_id     username  firstname lastname
 111         xyz       xyz       xyz
 456         def       def       def
 234         mnp       mnp       mnp

df3:

user_id     username  firstname lastname
 789         ghi       ghi       ghi       
 456         def       def       def
 222         uwv       uwv       uwv       

I want to concatenate those dataframe, drop duplicate rows, and keep track of the rows' origins by adding more columns. Desired output:

df1:

user_id    username firstname lastname df1 df2 df3
 123         abc       abc       abc    1   0   0
 456         def       def       def    1   1   1
 789         ghi       ghi       ghi    1   0   1
 111         xyz       xyz       xyz    0   1   0
 234         mnp       mnp       mnp    0   1   0
 222         uwv       uwv       uwv    0   0   1      

I can concatenate do the first step by using:

pd.concat([df1, df2, df3]).drop_duplicates('user_id').reset_index(drop=True)

How can I do the last step (making the origin columns)? I don't know how to do this without forloop, which is not practical for big dataframe. Thank you

shaik moeed :

As you need the details of df's names after concatenating, you should not drop them.

You need to add df name as a column, so that after concatenating, we can know, which row is from which df.

>>> cols = df1.columns.to_list() # Store columns before adding `origin` column to use later in groupby
>>> df1['origin'] = 'df1'
>>> df2['origin'] = 'df2'
>>> df3['origin'] = 'df3'

Now, using groupby your task becomes easy.

>>> df = pd.concat([df1, df2, df3])\
    .reset_index(drop=True)\
    .groupby(by = cols, axis=0)\
    .apply(lambda x: x['origin'].values)\
    .to_frame('origin')\
    .reset_index()

Output:

>>> df

   user_id username firstname lastname           origin
0      111      xyz       xyz      xyz            [df2]
1      123      abc       abc      abc            [df1]
2      222      uwv       uwv      uwv            [df3]
3      234      mnp       mnp      mnp            [df2]
4      456      def       def      def  [df1, df2, df3]
5      789      ghi       ghi      ghi       [df1, df3]

You need to do one hot encoding to get the expected results like, (referenced from here)

Using sklearn.preprocessing.MultiLabelBinarizer (You can also use pd.get_dummies as well, check this)

>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> mlb = MultiLabelBinarizer()
>>> expandedLabelData = mlb.fit_transform(df["origin"])
>>> labelClasses = mlb.classes_
>>> encoded_df = pd.DataFrame(expandedLabelData, columns=labelClasses)
   df1  df2  df3
0    0    1    0
1    1    0    0
2    0    0    1
3    0    1    0
4    1    1    1
5    1    0    1

Finally,

>>> pd.concat([df.drop('origin', axis=1), encoded_df], axis=1)
   user_id username firstname lastname  df1  df2  df3
0      111      xyz       xyz      xyz    0    1    0
1      123      abc       abc      abc    1    0    0
2      222      uwv       uwv      uwv    0    0    1
3      234      mnp       mnp      mnp    0    1    0
4      456      def       def      def    1    1    1
5      789      ghi       ghi      ghi    1    0    1

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=27733&siteId=1