Number of intersections dependent on order of intersecting data frames

nehalem :

I try to compute the number of intersections of two data frames but the results differ depending on whether I intersect A with B or B with A. How can that be?

a_b= a.index.intersection(b.index)
b_a= b.index.intersection(a.index)
len(a_b), len(b_a)

returns

(10735, 10927)

Unfortunately the documentation isn't quite helpful on this one.

Serge Ballesta :

Index are often supposed to only contain unique values, and weird things happen when this requirement is not met. I assume that you are experiencing it. Here is a short example exhibiting the problem:

>>> dfa = pd.DataFrame(1, index=list('ABCDAC'), columns=['X'])
>>> dfa
>>> dfb = pd.DataFrame(1, index=list('ABCEC'), columns=['X'])
>>> dfa.index.intersection(dfb.index)
Index(['A', 'B', 'C', 'C'], dtype='object')
>>> dfb.index.intersection(dfa.index)
Index(['A', 'A', 'B', 'C', 'C'], dtype='object')
>>> 

Behaviour with non unique indexes is not explicit in the documentation, and I would not rely on it without that.

So my advice (if it is relevant in your use case) is to use unique() on both indexes:

a_b= a.index.unique().intersection(b.index.unique())
b_a= b.index.unique().intersection(a.index.unique())
len(a_b), len(b_a)

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=8168&siteId=1