table of Contents
1. Definition of Association Mining
2.2 Measurement of evaluation rules
Four, frequent itemsets generation algorithm
1. Definition of Association Mining
Association mining is defined as predicting the probability of occurrence of other items based on the occurrence of other items in the transaction.
The input is generally:
(1) Transaction database
(2) Support and confidence
The output is: all the rules that represent common occurrences.
2. Association Rules
2.1 Definition of rules
The following expression X-> the Y, X- , the Y is a set of items ( itemsets ). Transaction database as shown below:
The association rule is: {Milk,Diaper}-> {Beer}.
2.2 Measurement of evaluation rules
- Support (s) : The ratio of X and Y included in the transaction.
- Confidence (c) : The ratio of Y appearing in transactions that contain X.
For the above transaction database, we calculate the support and confidence of the association rule {Milk,Diaper}-> {Beer} :
Three, frequent itemsets
3.1 itemsets
A collection of one or more items.
k-itemset represents an item set containing k items.
3.2 Support count ( )
That is, the frequency of occurrence of the item set, such as:
3.3 Support
That is, the proportion of items appearing, such as:
3.4 frequent itemsets
That is, the item set whose support degree is greater than or equal to minsup
For a given transaction set T, the task of association rule mining is transformed into finding all the rules that meet the following conditions:
Four, frequent itemsets generation algorithm
For a set of size d, the following figure shows that there are a total of 2^d candidate item sets:
Need to calculate the support for each candidate item set, and then determine the frequent item set,
The calculation efficiency in this way is very low! ! !
4.1 Apriori algorithm
The Apriori algorithm is designed according to the Apriori principle,
The principle of Apriori is as follows:
- If an item set is frequent, then all its subsets are frequent
- And has the nature of anti-monotonicity, that is, the support of the item set is not greater than the support of its subset
- Expressed by mathematical formula as:
For example, when calculating the support of the above candidate itemsets, if the itemsets {AE} are found to be infrequent sets,
Then the itemsets containing {AE} do not need to be calculated, and they must all be infrequent sets.
We use the following transaction set as an example to demonstrate how to use the Apriori algorithm to generate frequent itemsets:
Time | Items |
10 | A,C,D |
20 | B,C,E |
30 | A,B,C,E |
40 | B,E |
Where SUPmin=2, that is, the frequent itemset is counted when the support count of the itemset is greater than or equal to 2.
Next, we scan the transaction set for the first time, and get an itemset C1 with an itemset size of 1.
Itemset | sup |
{A} | 2 |
{B} | 3 |
{C} | 3 |
{D} | 1 |
{E} | 3 |
Remove itemsets {D} with support counts less than 2, and get frequent itemsets table L1,
Itemset | sup |
{A} | 2 |
{B} | 3 |
{C} | 3 |
{E} | 3 |
Generate an item set of size 2 from table L1, and calculate their support to get table C2,
Itemset | sup |
{A,B} | 1 |
{A,C} | 2 |
{A, E} | 1 |
{B,C} | 2 |
{B,E} | 3 |
{THERE IS} | 2 |
Remove itemsets {A, B} and {A, E} with support counts less than 2, and get frequent itemsets table L2,
Itemset | sup |
{A,C} | 2 |
{B,C} | 2 |
{B,E} | 3 |
{THERE IS} | 2 |
Generate an item set of size 3 from table L2, and calculate their support to get table C3,
Itemset | sup |
{B,C,E} | 2 |
Therefore, frequent itemsets of different sizes can be obtained.
4.2 FP-Growth
FP-Growth solves the problem from the direction of the tree. It compresses the transaction database into a tree structure, and obtains the frequent itemsets through the FP-tree.
We illustrate how to generate FP-tree through the following transaction set:
First, we count the frequency of 1-itemset, and get the frequency counting order f-list=I2,I1,I3,I4,I5,
Then rearrange the elements in the itemset Items of the database in this order to get the (ordered) FI column,
And remove the elements lower than min_support=2,
Time | Items | (ordered)FI |
100 | {I1,I2,I5,I6} | {I2,I1,I5} |
200 | {I2,I4} | {I2,I4} |
300 | {I2,I3} | {I2,I3} |
400 | {I1,I2,I4} | {I2,I1,I4} |
500 | {I1,I3,I7} | {I1, I3} |
600 | {I2,I3} | {I2,I3} |
700 | {I1, I3} | {I1, I3} |
800 | {I1,I2,I3,I5} | {I2,I1,I3,I5} |
900 | {I1,I2,I3} | {I2,I1,I3} |
Then according to the obtained (ordered) FI information, the FP-tree is generated row by row, and the steps for generating it are as follows:
After getting the FP-tree that we finally generated, perform a suffix search on the elements I1, I2, I3, I4, and I5 respectively,
Let's take I5 as an example. The paths with I5 as the suffix are I2->I1->I5 and I2->I2->I3->I5, and then count the elements other than I5 to get a table of header elements greater than min_support:
Item | frequency |
I2 | 2 |
I1 | 2 |
Then generate FP-tree according to the table, and get the following structure FP-tree:
Then put I5 behind the corresponding node to get the frequent set sequence:
- {I2,I5:2}
- {I1,I5:2}
- {I2,I1,I5:2}
By analogy, the suffix frequent set sequence of all nodes is obtained (I2 is no node before, so there is no need to calculate the suffix frequent set sequence):
Item | Frequent Patterns Generated |
I1 | {I2,I1:4} |
I3 | {I1, I3: 4} {I2,I1,I3:2} {I2,I3:4} |
I4 | {I2,I4:2} |
I5 | {I1,I5:2} {I2,I1,I5:2} {I2,I5:2} |