Find most similar sets efficiently (Python, data structures)

John Chrysostom :

Suppose I have a couple thousand Python sets in a list called my_sets. For every set "A" in my_sets, I want to find the five sets (sets "B") in my_sets that contain the highest percentage of set A's members.

I'm currently storing the data as sets and looping over them twice to calculate overlap...

from random import randint
from heapq import heappush, heappop

my_sets = []

for i in range(20):
    new_set = set()

    for j in range(20):
        new_set.add(randint(0, 50))

    my_sets.append(new_set)

for i in range(len(my_sets)):
    neighbor_heap = []

    for j in range(len(my_sets)):
        if i == j:
            continue

        heappush(neighbor_heap, (1 / len(my_sets[i] & my_sets[j]), j))

    results = []

    while len(results) < 5:
        results.append(heappop(neighbor_heap)[1])

    print('Closest neighbors to set {} are sets {}'.format(i, results))

However, this is obviously an O(N**2) algorithm, so it blows up when my_sets gets long. Is there a better data structure or algorithm that can be implemented in base Python for tackling this? There is no reason that my_sets has to be a list, or that each individual set actually has to be a Python set. Any way of storing whether or not each set contains members from a finite list of options would be fine (e.g., a bunch of bools or bits in a standardized order). And building a more exotic data structure to save time would also be fine.

(As some people will likely want to point out, I could, of course, structure this as a Numpy array where rows are sets and columns are elements and cells are a 1/0, depending on whether that element is in that set. Then, you'd just do some Numpy operations. This would undoubtedly be faster, but I haven't really improved my algorithm at all, I've just offloaded complexity to somebody else's optimized C/Fortran/whatever.)

EDIT After a full test, the algorithm I originally posted runs in ~586 seconds under the agreed test conditions.

Chris Hall :

Could you:

invert the sets, to produce for each set element, a list (or set) of the sets which contain it.

which is O(n*m) -- for n sets and on average m elements per set.
for each set S, consider its elements, and (using 1) construct a list (or heap) of other sets and how many elements each one shares with S -- pick the 'best' 5.

which is O(n*m*a), where a is the average number of sets each element is a member of.

How far removed from O(n*n) that is obviously depends on m and a.

Edit: Naive implementation in Python runs in 103 seconds on my machine...

old_time = clock()

my_sets = []

for i in range(10000):
    new_set = set()

    for j in range(200):
        new_set.add(randint(0, 999))

    my_sets.append(new_set)

my_inv_sets = [[] for i in range(1000)]

for i in range(len(my_sets)):
    for j in range(1000):
        if j in my_sets[i]:
            my_inv_sets[j].append(i)

for i in range(len(my_sets)):
    counter = Counter()

    for j in my_sets[i]:
        counter.update(my_inv_sets[j])

    print(counter.most_common(6)[1:])

print(clock() - old_time)

Find most similar sets efficiently (Python, data structures)

Guess you like