Apriori calculation speed improvement

When using the Apriori algorithm to calculate the association rules, since the traversal search is involved in the calculation of large data, the complexity is high. This paper mainly makes modifications in the existing python code on the Internet to increase the calculation speed by 100 times. The total data set is more than 7 million, and the frequent itemset K=5

1. The association rules need to cross the K-dimensional pair according to the user's needs, and delete the number of support < minimum support after the first round of iterations.

 
 
#L1 is the itemset that meets the minimum support after the first iteration
#data_set is the original dataset
new_l1 = []
for p in L1:
     new_l1.extend(list(p))
data_set2 = []
for t in data_set:
    new_t = [s for s in t if s in new_l1]
    if len(new_t)>=5:
        data_set2.append(new_t)

After the first step of data deletion, in this example, delete 20% of the data (the minimum support selection is small, so there is not much data to delete)

2. The highest computational complexity is that each row needs to be looped, and all itemsets that meet the intersection conditions are looped to calculate the frequent items of each item set. In this example, when K=2, the number of cycles is 700W*3400. According to the characteristics of the data (the average number of data in each row is 10), when k<=4, all possible crossover modes of this row of data are calculated first in each iteration, and then it is checked whether the crossed itemsets are frequent itemsets. The number of cycles is reduced to 700W*45.

 
 
Lk = set()
 item_count = {}
 for t in data_set:
    if k==1:
        tmp = set()
        for item in t:
            item_set = frozenset([item])
            tmp.add(item_set)
     elif k<=4:
         tmp = combine(t, k)
     else:
         tmp = Ck.copy()
     for item in tmp:
         if  item in Ck and item.issubset(t):
             if item not in item_count:
                 item_count[item] = 1
              else:
                  item_count[item] += 1
 
 
def combine(l, n):
    l.sort()
    answers = []
    one = [0] * n
    def next_c(li = 0, ni = 0):
        if ni == n:
            answers.append(copy.copy(one))
            return
        for lj in xrange (li, len (l)):
            one [ni] = l [lj]
            next_c (lj + 1, ni + 1)
    next_c()
    tmp  =set()
    for i in answers:
        tmp.add(frozenset(i))
    return tmp

After completing the above two steps, the speed will increase by 100 times.


 
 
 
 


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324457848&siteId=291194637