I have a Dataframe, DF1
Id1 Id2
0 286 409
1 286 257
2 409 286
3 257 183
In this DF, for me rows 286,409
and 409,286
are same. I only want to keep one of these rows. All this I am doing is to build a network graph using Networkx
python library.
I have tried achieving it by creating another df with interchanged columns like, DF2
Id2 Id1
0 409 286
1 257 286
2 286 409
3 183 257
then I compare these two DFs using isin
function something like this
DF1[DF1[['Id1', 'Id2']].isin(DF2[['Id2', 'Id1']])]
but it prints DF1 as it was.
Expected output DF:
Id1 Id2
0 286 409
1 286 257
3 257 183
Any help would be appreciated, Thanks.
I believe you need sorting both columns by np.sort
and filter by DataFrame.duplicated
with inverse mask:
df1 = pd.DataFrame(np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1), index=DF1.index)
df = DF1[~df1.duplicated()]
print (df)
Id1 Id2
0 286 409
1 286 257
3 257 183
Detail : If use numpy.sort
with axis=1
it sorting per rows, so first and third 'row'
are same:
print (np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1))
[[286 409]
[257 286]
[286 409]
[183 257]]
Then use DataFrame.duplicated
function (working with DataFrame, so used DataFrame constructor):
df1 = pd.DataFrame(np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1), index=DF1.index)
print (df1)
0 1
0 286 409
1 257 286
2 286 409
3 183 257
Third value is duplicate:
print (df1.duplicated())
0 False
1 False
2 True
3 False
dtype: bool
Last is necessary invert mask for remove duplicates, output is filtered in boolean indexing
:
print (DF1[~df1.duplicated()])
Id1 Id2
0 286 409
1 286 257
3 257 183