Data mining - in-depth analysis of FP-Growth algorithm

Table of contents

1 Introduction

2. What is data mining?

3. Traditional Association Rules Mining Algorithms

3.1 Apriori calculation

3.2 Disadvantages and limitations

4. Introduce FP-Growth algorithm

4.1 Overview of FP-Growth Algorithm

frequent itemsets

FP-Tree

conditional pattern base

4.2 Construction of FP-Tree

5. FP-Growth Algorithm Process

5.1 Raw dataset preprocessing

Data Format

Data Preprocessing Steps

5.2 Building FP-Tree

FP-Tree construction steps

5.3 Mining frequent itemsets from FP-Tree

conditional pattern base

Mining frequent itemsets steps

6. Advantages of FP-Growth algorithm

6.1 Compressed storage based on FP-Tree

6.2 Reduce the number of times to scan the dataset

6.3 Ability to handle large-scale datasets

7. Cases of FP-Growth in practical applications

7.1 Market Basket Analysis

7.2 Sequence Analysis in Bioinformatics

8. Summary

9. Code implementation


1 Introduction

As an important field of computer science, data mining aims to discover hidden patterns, associations and valuable information from large-scale data sets. As an excellent association rule mining algorithm, the FP-Growth algorithm can efficiently mine frequent itemsets on large data sets by constructing a compact data structure and efficient processing methods. This article will deeply analyze the principles and advantages of the FP-Growth algorithm. , and introduce its case in practical application.

2. What is data mining?

Data mining is the process of automatically discovering patterns, associations, and information from large amounts of data. It involves various fields such as machine learning, statistics, database systems, etc. The main goal of data mining is to extract useful knowledge from data sets, which can be used to predict future trends, make decisions, or optimize business processes.

3. Traditional Association Rules Mining Algorithms

Traditional association rule mining algorithms mainly include Apriori algorithm. The Apriori algorithm discovers frequent itemsets by scanning the data set layer by layer, and then generates association rules based on frequent itemsets. However, the Apriori algorithm has some disadvantages and limitations.

3.1 Apriori calculation

The Apriori algorithm uses a layer-by-layer search method, starting from a single element item set, and gradually generating frequent item sets containing more elements. The main steps of the algorithm include:

1. Scan the data set to obtain the support (frequency of occurrence) of a single element item.
2. Generate frequent 1-itemsets.
3. Based on frequent 1-itemsets, generate candidate 2-itemsets and calculate support.
4. Iteratively generate higher-order candidate item sets and calculate support.
5. Repeat the above steps until no more frequent itemsets are generated.

3.2 Disadvantages and limitations

Although the Apriori algorithm is a classic association rule mining algorithm, it also has some disadvantages:

1. Under large-scale data sets, the generation of candidate item sets and the calculation of support are expensive, resulting in low algorithm efficiency.
2. It needs to scan the data set multiple times, which has a large IO overhead, especially in the case of limited memory.
3. The generated candidate item set may be very large, taking up a lot of storage space.

4. Introduce FP-Growth algorithm

With the advent of the era of big data, data mining has become an important means to obtain valuable information from massive data. Association rule mining is an important task in the field of data mining. Its goal is to find frequently occurring item sets in the data set, and there may be potential association rules between these item sets. However, traditional association rule mining algorithms, such as the Apriori algorithm, are inefficient when dealing with large-scale datasets. In order to overcome the limitations of traditional algorithms, FP-Growth algorithm came into being.

4.1 Overview of FP-Growth Algorithm

The FP-Growth (Frequent Pattern Growth) algorithm is a frequent itemset mining algorithm based on the FP-Tree structure, which was proposed by Jiawei Han et al. in 2000 [1]. Different from Apriori algorithm, FP-Growth algorithm efficiently mines frequent itemsets by constructing FP-Tree, thus avoiding the process of generating candidate itemsets.

frequent itemsets

In association rule mining, frequent itemsets refer to itemsets whose occurrence frequency in the data set is higher than a preset threshold (support threshold). Frequent itemsets are the basis of association rule mining, which can be used to generate interesting association rules.

FP-Tree

FP-Tree is the core data structure of FP-Growth algorithm, which is used to store frequent itemsets and support counts. FP-Tree consists of root nodes, internal nodes and leaf nodes.

  1. Root Node: The root node does not store any information and is only used to connect different transaction paths.

  2. Internal nodes: Internal nodes store element items and their corresponding support counts. The same element item in multiple transactions shares a node, and the frequency of its occurrence is counted by counting.

  3. Leaf nodes: leaf nodes store element items.

The process of constructing FP-Tree is as follows:

  1. Create a root node.

  2. For each transaction, elements are inserted into the FP-Tree in descending order of support.

conditional pattern base

In the FP-Growth algorithm, the conditional pattern base refers to the prefix path ending with a certain element item. The conditional pattern base is used to construct a new conditional FP-Tree, so as to realize recursive mining of frequent itemsets.

4.2 Construction of FP-Tree

The construction of FP-Tree is the first stage of FP-Growth algorithm. It mainly involves two scans of the data set: the first time is used to count the support counts of each element item, and sorted in descending order of support; the second time is used to reconstruct the FP-Tree.

The specific steps to construct FP-Tree are as follows:

  1. Scan the data set for the first time, count the support counts of each element item, and sort them in descending order of support.

  2. The data set is scanned for the second time, and the FP-Tree is reconstructed in descending order of support for each transaction (or basket). For each transaction, insert its element items into FP-Tree.

In the process of constructing FP-Tree, since the element items have been arranged in descending order of support, the same element items will appear adjacent to each other, which makes the construction process of FP-Tree very efficient. The final constructed FP-Tree will be used in the second stage, which is to mine frequent itemsets.

5. FP-Growth Algorithm Process

FP-Growth algorithm is an efficient frequent itemset mining algorithm, which can efficiently mine frequent itemsets from large-scale data sets by constructing FP-Tree structure and recursive method. This section will introduce the process of the FP-Growth algorithm in detail, including preprocessing the original data set, constructing the FP-Tree and mining frequent itemsets from the FP-Tree.

5.1 Raw dataset preprocessing

The first step of the FP-Growth algorithm is to preprocess the original data set to ensure that there are no duplicates in the data set and sort them in descending order of support. Such preprocessing is to improve the efficiency of the algorithm and reduce the number of repeated scans of the dataset.

Data Format

The data format accepted by the FP-Growth algorithm is usually a transaction database, in which each transaction (transaction) represents a shopping basket or transaction record, and each transaction consists of several items (item). Items can be commodities, labels, genetic sequences, etc.

Data Preprocessing Steps

  1. Deduplication: Deduplication is performed on the original data set to ensure that items in each transaction do not appear repeatedly.

  2. Statistical support count: Count the support count of each item, that is, the frequency of occurrence in the data set.

  3. Sort in descending order of support: According to the support count of items, sort in descending order to get a list of items in descending order of support.

5.2 Building FP-Tree

Building FP-Tree is the second step of FP-Growth algorithm, which converts the preprocessed data set into a compact FP-Tree data structure. During the construction of FP-Tree, frequent itemsets are compressed and stored, which greatly reduces the storage space occupied.

FP-Tree construction steps

  1. Create root node: The root node of FP-Tree does not store any information, it is only used to connect different transaction paths.

  2. For each transaction, insert element items into the FP-Tree in descending order of support: For each transaction, insert element items into the FP-Tree from the root node according to the list of items in descending order of support. If an item already exists in the FP-Tree, increase the support count of the node corresponding to the item. If the item does not exist in the FP-Tree, a new node is added in the tree to represent the item, and the support count is initialized to 1.

  3. Link identical items: The same element items in multiple transactions share the same node in the FP-Tree. In this way, FP-Tree realizes the compressed storage of frequent itemsets.

In the process of constructing FP-Tree, since the element items have been arranged in descending order of support, the same element items will appear adjacent to each other, which makes the construction process of FP-Tree very efficient.

5.3 Mining frequent itemsets from FP-Tree

After the FP-Tree is built, the FP-Growth algorithm enters the third step, which is to recursively mine frequent itemsets from the FP-Tree. This process is the core of the FP-Growth algorithm. By recursively traversing the FP-Tree and using the conditional pattern base to construct a new conditional FP-Tree, efficient frequent itemset mining is realized.

conditional pattern base

In the FP-Growth algorithm, the conditional pattern base refers to the prefix path ending with a certain element item. The conditional pattern base is used to construct a new conditional FP-Tree, so as to realize recursive mining of frequent itemsets.

Mining frequent itemsets steps

The main steps of mining frequent itemsets are as follows:

  1. For each element item in the FP-Tree, find out its corresponding conditional pattern base. The conditional pattern base refers to all prefix paths ending with this element item.

  2. Construct a new conditional FP-Tree based on the conditional pattern base.

  3. Continue to recursively mine frequent itemsets on the new conditional FP-Tree.

The recursive process will generate new frequent itemsets at each layer, and finally get all frequent itemsets.

6. Advantages of FP-Growth algorithm

As an efficient frequent itemsets mining algorithm, FP-Growth algorithm has many advantages in large-scale data sets. It achieves efficient frequent itemsets mining by constructing FP-Tree structure and utilizing conditional pattern basis. This section will detail the advantages of the FP-Growth algorithm, including FP-Tree-based compressed storage, reducing the number of times to scan datasets, and the ability to process large-scale datasets.

6.1 Compressed storage based on FP-Tree

The FP-Growth algorithm compresses and stores frequent itemsets through the FP-Tree structure, and does not need to generate candidate itemsets, thus saving a lot of storage space. In traditional association rule mining algorithms, such as the Apriori algorithm, in order to find frequent itemsets, it is necessary to generate all possible candidate item sets, and then count the support of the candidate item sets. Since the set of candidate items may be very large, this will take up a lot of storage space and computing resources.

In contrast, the FP-Growth algorithm replaces the process of generating candidate itemsets by constructing an FP-Tree. FP-Tree stores frequent itemsets in the form of trees, thus avoiding the overhead of generating a large number of candidate itemsets. Since FP-Tree compresses and stores the same element items, such a structure can represent large-scale frequent itemsets in a small storage space, thereby saving storage resources.

6.2 Reduce the number of times to scan the dataset

The FP-Growth algorithm only needs to scan the data set twice by constructing the FP-Tree, instead of repeating the scan multiple times like the Apriori algorithm. In traditional association rule mining algorithms, in order to find frequent itemsets, it is necessary to scan the data set multiple times. First, the data set needs to be scanned once to count the support count of each item; then, the data set needs to be scanned multiple times to generate the candidate item set, and the support count of the candidate item set is calculated.

In the process of constructing FP-Tree, the FP-Growth algorithm can count the support count of each item through a data set scan, and represent the data set in the form of a tree. In this way, in the process of mining frequent itemsets, only the recursive traversal of the FP-Tree is required, without repeated scanning of the data set. Since the scan of the data set is one of the main expenses in the process of frequent itemset mining, the FP-Growth algorithm greatly improves the efficiency of the algorithm by reducing the number of scans.

6.3 Ability to handle large-scale datasets

The FP-Growth algorithm is suitable for processing large-scale data sets, especially in the case of limited memory, and its efficiency is higher. In large-scale data sets, traditional association rule mining algorithms, such as the Apriori algorithm, need to generate a large number of candidate item sets, which will occupy a large amount of storage space and computing resources. In addition, multiple scans of the dataset will also result in high IO overhead.

In contrast, the FP-Growth algorithm avoids the problem of generating a large number of candidate item sets and multiple scan data sets through the FP-Tree structure and recursion. The compressed storage of FP-Tree can save storage space, and the advantage of only requiring two data set scans greatly reduces IO overhead. Therefore, the FP-Growth algorithm performs well when dealing with large-scale data sets, especially in the case of memory constraints, and its efficiency is higher.

7. Cases of FP-Growth in practical applications

7.1 Market Basket Analysis

The FP-Growth algorithm can be applied to supermarket shopping basket data to discover frequently purchased commodity combinations. Based on the frequent itemsets mined, supermarkets can formulate more effective commodity matching and promotion strategies.

7.2 Sequence Analysis in Bioinformatics

The FP-Growth algorithm is also used in bioinformatics to mine frequent patterns from DNA or protein sequence data to help discover associations between genes and functional proteins.

8. Summary

As an efficient frequent itemset mining algorithm, FP-Growth algorithm successfully solves the shortcomings of traditional Apriori algorithm by constructing FP-Tree and compressing and storing frequent itemsets. In practical applications, the FP-Growth algorithm has demonstrated powerful mining capabilities in market basket analysis, bioinformatics and other fields. With the advent of the era of big data, the FP-Growth algorithm will continue to play an important role in the field of data mining.

9. Code implementation

The implementation of FP-Growth algorithm involves the construction of FP-Tree and the mining of frequent itemsets. The following is a simple Python implementation, including codes for building FP-Tree and mining frequent itemsets from FP-Tree. Please note that this is a simplified version of the implementation, and more optimizations and improvements can be made to the algorithm in practice.


class TreeNode:
    def __init__(self, item, count, parent):
        self.item = item  # 元素项
        self.count = count  # 支持度计数
        self.parent = parent  # 父节点
        self.children = {}  # 子节点

def create_tree(data, min_support):
    # 第一次扫描数据集,统计每个元素项的支持度计数
    header_table = {}
    for transaction in data:
        for item in transaction:
            header_table[item] = header_table.get(item, 0) + data[transaction]

    # 移除支持度小于min_support的元素项
    for item in list(header_table.keys()):
        if header_table[item] < min_support:
            del header_table[item]

    # 如果所有元素项的支持度都小于min_support,则无频繁项集
    if len(header_table) == 0:
        return None, None

    # 对header_table排序,按照支持度降序排列
    sorted_items = sorted(header_table.items(), key=lambda x: x[1], reverse=True)

    # 建立FP-Tree的根节点
    root = TreeNode(None, 1, None)
    header_table = {}

    # 第二次扫描数据集,构建FP-Tree
    for transaction, count in data.items():
        filtered_transaction = [item for item in transaction if item in header_table]
        if len(filtered_transaction) > 0:
            update_tree(filtered_transaction, root, header_table, count)

    return root, header_table

def update_tree(items, node, header_table, count):
    # 更新FP-Tree
    if items[0] in node.children:
        node.children[items[0]].count += count
    else:
        new_node = TreeNode(items[0], count, node)
        node.children[items[0]] = new_node
        if header_table[items[0]][1] is None:
            header_table[items[0]][1] = new_node
        else:
            update_header(header_table[items[0]][1], new_node)

    # 递归更新剩余元素项
    if len(items) > 1:
        update_tree(items[1:], node.children[items[0]], header_table, count)

def update_header(node_to_test, target_node):
    # 更新header_table中相同元素项的链表指针
    while node_to_test.node_link is not None:
        node_to_test = node_to_test.node_link
    node_to_test.node_link = target_node

def ascend_tree(node, prefix_path):
    # 从叶子节点向上追溯,得到条件模式基
    if node.parent is not None:
        prefix_path.append(node.item)
        ascend_tree(node.parent, prefix_path)

def find_prefix_paths(base_path, header_table):
    # 从header_table中得到条件模式基
    conditional_patterns = {}
    node = header_table[base_path]
    while node is not None:
        prefix_path = []
        ascend_tree(node, prefix_path)
        if len(prefix_path) > 1:
            conditional_patterns[frozenset(prefix_path[1:])] = node.count
        node = node.node_link
    return conditional_patterns

def mine_fp_tree(header_table, min_support, prefix, frequent_itemsets):
    # 递归挖掘FP-Tree得到频繁项集
    sorted_items = [item[0] for item in sorted(header_table.items(), key=lambda x: x[1])]
    for item in sorted_items:
        new_prefix = prefix.copy()
        new_prefix.add(item)
        frequent_itemsets.append(new_prefix)
        conditional_patterns = find_prefix_paths(item, header_table)
        conditional_tree, conditional_header = create_tree(conditional_patterns, min_support)
        if conditional_header is not None:
            mine_fp_tree(conditional_header, min_support, new_prefix, frequent_itemsets)

def fp_growth(data, min_support):
    # FP-Growth算法入口
    root, header_table = create_tree(data, min_support)
    if root is None:
        return []
    frequent_itemsets = []
    mine_fp_tree(header_table, min_support, set(), frequent_itemsets)
    return frequent_itemsets

# 测试代码
data = {
    frozenset(['a', 'b', 'c']): 4,
    frozenset(['a', 'c', 'd']): 2,
    frozenset(['a', 'b', 'd']): 2,
    frozenset(['b', 'c', 'd']): 3,
    frozenset(['b', 'd']): 5,
    frozenset(['c', 'd']): 3,
    frozenset(['b', 'c']): 3,
    frozenset(['a', 'c']): 3,
    frozenset(['a', 'd']): 2,
    frozenset(['a', 'b', 'c', 'd']): 2
}

min_support = 3
frequent_itemsets = fp_growth(data, min_support)
print("Frequent Itemsets:")
for itemset in frequent_itemsets:
    print(itemset)

In this implementation, we define a `TreeNode` class to represent the node of FP-Tree, including information such as element item, support count, parent node and child node. Then, we construct FP-Tree by scanning the dataset twice, and implement the function of recursively mining frequent itemsets from FP-Tree. Finally, we use a simple test dataset for testing and output the frequent itemsets mined.

Please note that this is just a simple implementation. In practice, the algorithm can be optimized and improved according to specific situations to meet the processing requirements of more complex data mining tasks and large-scale data sets.

Guess you like

Origin blog.csdn.net/m0_61789994/article/details/131837391