Introduction to Apriori Algorithm (Python Implementation)

This article is reproduced from: http://www.cnblogs.com/llhthinker/p/6719779.html Author: llhthinker Please indicate the statement when reprinting.

Guide :

As the concept of big data is hot, the story of beer and diapers is widely known. How do we find out that people who buy beer tend to buy diapers too? The Apriori algorithm for mining frequent itemsets and association rules in data mining can tell us. This article first introduces the Apriori algorithm, and then further introduces the relevant basic concepts, and then introduces the specific strategies and steps of the Apriori algorithm in detail, and finally gives the Python implementation code.

Github code address: https://github.com/llhthinker/MachineLearningLab/tree/master/Frequent%20Itemset%20Mining

1.Introduction to Apriori Algorithm

Apriori algorithm is a classic data mining algorithm for mining frequent itemsets and association rules. A priori means "from before" in Latin. When defining a problem, prior knowledge or assumptions are often used, which is called "a priori". The name of the Apriori algorithm is based on the fact that the algorithm uses the prior property of the property of frequent itemsets, that is, all non-empty subsets of frequent itemsets must also be frequent. The Apriori algorithm uses an iterative approach called layer-by-layer search, where k itemsets are used to explore (k+1) itemsets. First, find the set of frequent 1 itemsets by scanning the database, accumulating the count of each item, and collecting items that satisfy the minimum support. This set is denoted L1. Then, use L1 to find the set L2 of frequent 2 itemsets, use L2 to find L3, and so on, until no more frequent k itemsets can be found. Each Lk found requires a full scan of the database. The Apriori algorithm uses the prior properties of frequent itemsets to compress the search space.

2. Basic Concepts

  • Items and item sets : Let itemset={item1, item_2, …, item_m} be the set of all items, where item_k(k=1,2,…,m) becomes the item. A collection of items is called an itemset, and an itemset containing k items is called a k-itemset.
  • Transactions and transaction sets : A transaction T is an itemset, which is a subset of itemset, and each transaction is associated with a unique identifier Tid. The different transactions together make up the transaction set D, which constitutes the transaction database discovered by the association rules.
  • Association rules : Association rules are implication of the form A=>B, where A and B are both subsets of itemset and neither is an empty set, and A and B are empty.
  • Support : The support of an association rule is defined as follows:

    where represents the probability that a transaction contains the union of sets A and B (that is, contains each item in A and B). Note the difference from P(A or B), which represents the probability that a transaction contains A or B.

  • Confidence : The confidence of an association rule is defined as follows:

  • Itemset occurrence frequency (support count) : The number of transactions containing the itemset, referred to as the frequency, support count or count of the itemset for short.
  • Frequent itemset : If the relative support of the itemset I meets the pre-defined minimum support threshold (that is, the occurrence frequency of I is greater than the corresponding minimum occurrence frequency (support count) threshold), then I is Frequent itemsets.
  • Strong association rule : The association rule that satisfies the minimum support and minimum confidence, that is, the association rule to be mined.

3. Implementation steps

Generally speaking, mining association rules is a two-step process:

    1. find all frequent itemsets
    2. Generate strong association rules from frequent itemsets

3.1 Mining frequent itemsets

3.1.1Related Definitions

  • Connection step: The self-connection of frequent (k-1) itemsets Lk-1 generates candidate k itemsets Ck

    The Apriori algorithm assumes that the items in the itemset are sorted lexicographically. If the first (k-2) items of some two elements (itemsets) itemset1 and itemset2 in Lk-1 are the same, then itemset1 and itemset2 are said to be connectable. So the result itemset generated by the connection of itemset1 and itemset2 is {itemset1[1], itemset1[2], …, itemset1[k-1], itemset2[k-1]}. The connection step is contained in the create_Ck function in the code below.

  • Pruning strategy

Due to the existence of a priori properties:Any infrequent (k-1) itemset is not a subset of frequent k itemsets . Therefore, if a subset of (k-1) items of a candidate k-item set Ck is not in Lk-1, the candidate is also unlikely to be frequent and thus can be removed from Ck, to obtain the compressed Ck. The is_apriori function in the following code is used to determine whether the priori properties are met. The create_Ck function includes a pruning step, that is, if the priori properties are not met, pruning is performed.

  • delete policy

Based on the compressed Ck, all transactions are scanned, each item in Ck is counted, and items that do not satisfy the minimum support are removed to obtain frequent k itemsets. The deletion strategy is contained in the generate_Lk_by_Ck function in the code below.

3.1.2 Steps

  1. Each item is a member of the set C1 of candidate 1 itemsets. The algorithm scans all transactions, obtains each item, and generates C1 (see create_C1 function in the code below). Then count each item. Then, the unsatisfied items are removed from C1 according to the minimum support, so as to obtain the frequent 1 itemset L1.
  2. Execute on the collection generated by the self-connection of L1The pruning strategy produces a set C2 of candidate 2 item sets, then scans all transactions and counts each item in C2. Similarly, the unsatisfied items are deleted from C2 according to the minimum support, so as to obtain the frequent 2-item set L2.
  3. The pruning strategy is performed on the set generated by the self-connection of L2 to generate a set C3 of candidate 3 item sets. Then, all transactions are scanned and each item of C3 is counted. Similarly, the unsatisfied items are removed from C3 according to the minimum support, so as to obtain the frequent 3-item set L3.
  4. By analogy, the pruning strategy is performed on the set generated by the self-connection of Lk-1 to generate a candidate k-item set Ck, and then all transactions are scanned and each item in Ck is counted. Then the unsatisfied items are removed from Ck according to the minimum support, so as to obtain frequent k itemsets.

3.2 Generating association rules from frequent itemsets

Once frequent itemsets are found, strong association rules can be generated directly from them. The generation steps are as follows:

  • For each frequent itemset itemset, generate all non-empty subsets of itemset (these non-empty subsets must be frequent itemsets);
  • For each non-empty subset s of itemset , output if , where min_conf is the minimum confidence threshold.

4. Sample and Python implementation code

The following figure is a sample diagram of mining frequent itemsets in Data Mining: Concepts and Techniques (Third Edition).

Based on the data of this sample, this paper writes Python code to implement the Apriori algorithm. The code needs to pay attention to the following two points:

  • Since the Apriori algorithm assumes that the items in the item set are sorted lexicographically, and the set itself is unordered, we need to convert set and list when necessary;
  • Since a dictionary (support_data) is to be used to record the support of an item set, the item set needs to be used as the key, and the variable set cannot be used as the key of the dictionary, so the items set should be converted to a fixed set frozenset at the right time.
"""
# Python 2.7
# Filename: apriori.py
# Author: llhthinker
# Email: hangliu56[AT]gmail[DOT]com
# Blog: http://www.cnblogs.com/llhthinker/p/6719779.html
# Date: 2017-04-16
"""


def load_data_set():
    """
    Load a sample data set (From Data Mining: Concepts and Techniques, 3th Edition)
    Returns: 
        A data set: A list of transactions. Each transaction contains several items.
    """
    data_set = [['l1', 'l2', 'l5'], ['l2', 'l4'], ['l2', 'l3'],
            ['l1', 'l2', 'l4'], ['l1', 'l3'], ['l2', 'l3'],
            ['l1', 'l3'], ['l1', 'l2', 'l3', 'l5'], ['l1', 'l2', 'l3']]
    return data_set


def create_C1(data_set):
    """
    Create frequent candidate 1-itemset C1 by scaning data set.
    Args:
        data_set: A list of transactions. Each transaction contains several items.
    Returns:
        C1: A set which contains all frequent candidate 1-itemsets
    """
    C1 = set()
    for t in data_set:
        for item in t:
            item_set = frozenset([item])
            C1.add(item_set)
    return C1


def is_apriori(Ck_item, Lksub1):
    """
    Judge whether a frequent candidate k-itemset satisfy Apriori property.
    Args:
        Ck_item: a frequent candidate k-itemset in Ck which contains all frequent
                 candidate k-itemsets.
        Lksub1: Lk-1, a set which contains all frequent candidate (k-1)-itemsets.
    Returns:
        True: satisfying Apriori property.
        False: Not satisfying Apriori property.
    """
    for item in Ck_item:
        sub_Ck = Ck_item - frozenset([item])
        if sub_Ck not in Lksub1:
            return False
    return True


def create_Ck(Lksub1, k):
    """
    Create Ck, a set which contains all all frequent candidate k-itemsets
    by Lk-1's own connection operation.
    Args:
        Lksub1: Lk-1, a set which contains all frequent candidate (k-1)-itemsets.
        k: the item number of a frequent itemset.
    Return:
        Ck: a set which contains all all frequent candidate k-itemsets.
    """
    Ck = set()
    len_Lksub1 = len(Lksub1)
    list_Lksub1 = list(Lksub1)
    for i in range(len_Lksub1):
        for j in range(1, len_Lksub1):
            l1 = list(list_Lksub1[i])
            l2 = list(list_Lksub1[j])
            l1.sort()
            l2.sort()
            if l1[0:k-2] == l2[0:k-2]:
                Ck_item = list_Lksub1[i] | list_Lksub1[j]
                # pruning
                if is_apriori(Ck_item, Lksub1):
                    Ck.add(Ck_item)
    return Ck


def generate_Lk_by_Ck(data_set, Ck, min_support, support_data):
    """
    Generate Lk by executing a delete policy from Ck.
    Args:
        data_set: A list of transactions. Each transaction contains several items.
        Ck: A set which contains all all frequent candidate k-itemsets.
        min_support: The minimum support.
        support_data: A dictionary. The key is frequent itemset and the value is support.
    Returns:
        Lk: A set which contains all all frequent k-itemsets.
    """
    Lk = set()
    item_count = {}
    for t in data_set:
        for item in Ck:
            if item.issubset(t):
                if item not in item_count:
                    item_count[item] = 1
                else:
                    item_count[item] += 1
    t_num = float(len(data_set))
    for item in item_count:
        if (item_count[item] / t_num) >= min_support:
            Lk.add(item)
            support_data[item] = item_count[item] / t_num
    return Lk


def generate_L(data_set, k, min_support):
    """
    Generate all frequent itemsets.
    Args:
        data_set: A list of transactions. Each transaction contains several items.
        k: Maximum number of items for all frequent itemsets.
        min_support: The minimum support.
    Returns:
        L: The list of Lk.
        support_data: A dictionary. The key is frequent itemset and the value is support.
    """
    support_data = {}
    C1 = create_C1(data_set)
    L1 = generate_Lk_by_Ck(data_set, C1, min_support, support_data)
    Lksub1 = L1.copy()
    L = []
    L.append(Lksub1)
    for i in range(2, k+1):
        Ci = create_Ck(Lksub1, i)
        Li = generate_Lk_by_Ck(data_set, Ci, min_support, support_data)
        Lksub1 = Li.copy()
        L.append(Lksub1)
    return L, support_data


def generate_big_rules(L, support_data, min_conf):
    """
    Generate big rules from frequent itemsets.
    Args:
        L: The list of Lk.
        support_data: A dictionary. The key is frequent itemset and the value is support.
        min_conf: Minimal confidence.
    Returns:
        big_rule_list: A list which contains all big rules. Each big rule is represented
                       as a 3-tuple.
    """
    big_rule_list = []
    sub_set_list = []
    for i in range(0, len(L)):
        for freq_set in L[i]:
            for sub_set in sub_set_list:
                if sub_set.issubset(freq_set):
                    conf = support_data[freq_set] / support_data[freq_set - sub_set]
                    big_rule = (freq_set - sub_set, sub_set, conf)
                    if conf >= min_conf and big_rule not in big_rule_list:
                        # print freq_set-sub_set, " => ", sub_set, "conf: ", conf
                        big_rule_list.append(big_rule)
            sub_set_list.append(freq_set)
    return big_rule_list


if __name__ == "__main__":
    """
    Test
    """
    data_set = load_data_set()
    L, support_data = generate_L(data_set, k=3, min_support=0.2)
    big_rules_list = generate_big_rules(L, support_data, min_conf=0.7)
    for Lk in L:
        print "="*50
        print "frequent " + str(len(list(Lk)[0])) + "-itemsets\t\tsupport"
        print "="*50
        for freq_set in Lk:
            print freq_set, support_data[freq_set]
    print
    print "Big Rules"
    for item in big_rules_list:
        print item[0], "=>", item[1], "conf: ", item[2]

代码运行结果截图如下:

==============================

参考:

《数据挖掘:概念与技术》(第三版)

《机器学习实战》

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324064529&siteId=291194637