Frequent Itemset Algorithm

Table of contents

​edit

foreword

Basic knowledge

text

1. Apriori algorithm

2. FP-Tree algorithm

1) The first scan of the data counts the 1-itemsets:

2) Build FP-Tree

3) FP-Tree gets frequent itemsets

Summarize


foreword

Frequent itemset mining is a very important research basis in data mining research. It can tell us the variables that often appear together in the data set, and provide some support for possible decision-making. Frequent itemset mining is the basis of many important data mining tasks such as association rules, correlation analysis , causality , sequence itemsets, local periodicity, and episodes. Therefore, frequent itemsets have a wide range of applications, such as: shopping basket data analysis, web page prefetching, cross-shopping, personalized website, network intrusion detection, etc.

Basic knowledge

Such as the item support form in the supermarket:

user Spicy sticks (A) Coke (B) Pencil (C) Badminton (D) Laundry Detergent (E)
1
2
3
4
5

Support: the percentage of a single item in the total item set, for example, the support of spicy strips = 4/5*100%=80%, and the support of cola = 3/5*100%=60%.

Confidence: Spicy bar>=confidence of badminton=3/4*100%=75%, Coke>=confidence of badminton=3/3*100%=100%.

Itemset: The most basic pattern is the itemset, which is a collection of items .

Frequent patterns: Refers to itemsets, sequences, or substructures that occur frequently in a dataset.

Frequent itemsets: Refers to the set whose support is greater than or equal to the minimum support (min_sup). Support is the frequency with which a set appears in all transactions. A classic application of frequent itemsets is the shopping basket model.


text

1. Apriori algorithm

Assuming minsupport=0.2, get frequent itemsets:

1) 1-itemset C1={A, B, C, D, E}, 1-frequent itemset L1={A, B, C, D};

2) 1-frequent itemsets are spliced ​​to get 2-itemsets C2={(A,B), (A,C), (A,D), (B,C), (B,D), (C, D)}, 2-frequent itemsets L2={(A,B),(A,C),(A,D),(B,D),(C,D)};

3) 2-frequent itemsets are spliced ​​to get 3-itemsets C3={(A,B,C), (A,B,D), (A,C,D), (B,C,D)}, 3 - frequent itemset L3={(A,B,D)};

4) Finally get all frequent itemsets L={(A,B), (A,C), (A,D), (B,D), (C,D), (A,B,D)} .

Assuming minconfidence = 60%, get the association rule:

Here we only calculate the largest frequent itemsets (B, C, D) to find out whether there are strong association rules:

B>=CD, confidence=33%, is not a strong association rule; BC>=D, confidence=100%, strong association rule;

C>=BD, confidence=33%, not a strong association rule; CD>=B, confidence=50%, not a strong association rule;

D>=BC, confidence=25%, is not a strong association rule; BD>=C, confidence=33%, not a strong association rule.


2. FP-Tree algorithm

1) The first scan of the data counts the 1-itemsets:

We still use the above example, user 1: ABD, user 2: ACDE, user 3: ABD, user 4: BCD, user 5: AC

2) Build FP-Tree

So far, we have completed the construction of FP-Tree.

3) FP-Tree gets frequent itemsets

   Get frequent items from bottom to top by node:

 In fact, the FP-Tree of {(C, D)} in the above appears twice, we can conclude that it is a frequent 2-item set, then there is a frequent item 2-item set of C: {(A, C ), (C, D)};

 Node D

To sum up, all frequent items are: {(A,B),(A,C),(A,D),(B,D),(C,D),(A,B,D)}.


Summarize

1. The research direction of frequent itemset mining algorithm can be roughly summarized into the following four aspects:

a. Take bottom-up, top-down and mixed traversal in the traversal direction

b. Adopt depth-first and breadth-first strategies in the search strategy

c. Focus on whether a candidate itemset will be generated in the generation of itemsets;

d. In the layout of the database, consider the layout of the database from both vertical and horizontal directions.

2. For different traversal methods, the search strategy and layout of the database will produce different methods. Research shows that no mining algorithm can be better than other mining algorithms for all domains and data types at the same time, that is to say , for each relatively excellent algorithm, it has its specific applicable scenarios and environments.

Guess you like

Origin blog.csdn.net/flyTie/article/details/127146048