nehalem :
I try to compute the number of intersections of two data frames but the results differ depending on whether I intersect A with B or B with A. How can that be?
a_b= a.index.intersection(b.index)
b_a= b.index.intersection(a.index)
len(a_b), len(b_a)
returns
(10735, 10927)
Unfortunately the documentation isn't quite helpful on this one.
Serge Ballesta :
Index are often supposed to only contain unique values, and weird things happen when this requirement is not met. I assume that you are experiencing it. Here is a short example exhibiting the problem:
>>> dfa = pd.DataFrame(1, index=list('ABCDAC'), columns=['X'])
>>> dfa
>>> dfb = pd.DataFrame(1, index=list('ABCEC'), columns=['X'])
>>> dfa.index.intersection(dfb.index)
Index(['A', 'B', 'C', 'C'], dtype='object')
>>> dfb.index.intersection(dfa.index)
Index(['A', 'A', 'B', 'C', 'C'], dtype='object')
>>>
Behaviour with non unique indexes is not explicit in the documentation, and I would not rely on it without that.
So my advice (if it is relevant in your use case) is to use unique()
on both indexes:
a_b= a.index.unique().intersection(b.index.unique())
b_a= b.index.unique().intersection(a.index.unique())
len(a_b), len(b_a)