how to remove rows that appear same in two columns simultaneously in dataframe?

Rishabh Sahrawat :

I have a Dataframe, DF1

   Id1   Id2  
0  286   409 
1  286   257  
2  409   286    
3  257   183   

In this DF, for me rows 286,409 and 409,286 are same. I only want to keep one of these rows. All this I am doing is to build a network graph using Networkx python library.

I have tried achieving it by creating another df with interchanged columns like, DF2

   Id2   Id1
0  409   286
1  257   286
2  286   409
3  183   257

then I compare these two DFs using isin function something like this

DF1[DF1[['Id1', 'Id2']].isin(DF2[['Id2', 'Id1']])] but it prints DF1 as it was.

Expected output DF:

   Id1   Id2  
0  286   409 
1  286   257     
3  257   183  

Any help would be appreciated, Thanks.

jezrael :

I believe you need sorting both columns by np.sort and filter by DataFrame.duplicated with inverse mask:

df1 = pd.DataFrame(np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1), index=DF1.index)

df = DF1[~df1.duplicated()]
print (df)
   Id1  Id2
0  286  409
1  286  257
3  257  183

Detail : If use numpy.sort with axis=1 it sorting per rows, so first and third 'row' are same:

print (np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1))
[[286 409]
 [257 286]
 [286 409]
 [183 257]]

Then use DataFrame.duplicated function (working with DataFrame, so used DataFrame constructor):

df1 = pd.DataFrame(np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1), index=DF1.index)
print (df1)
     0    1
0  286  409
1  257  286
2  286  409
3  183  257

Third value is duplicate:

print (df1.duplicated())
0    False
1    False
2     True
3    False
dtype: bool

Last is necessary invert mask for remove duplicates, output is filtered in boolean indexing:

print (DF1[~df1.duplicated()])
   Id1  Id2
0  286  409
1  286  257
3  257  183

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=194544&siteId=1