Machine learning and data mining study notes (5) association mining

table of Contents

1. Definition of Association Mining

2. Association Rules

2.1 Definition of rules

2.2 Measurement of evaluation rules

Three, frequent itemsets

3.1 itemsets

3.2 Support count()

3.3 Support

3.4 frequent itemsets

Four, frequent itemsets generation algorithm

4.1 Apriori algorithm

4.2 FP-Growth


 

1. Definition of Association Mining

Association mining is defined as predicting the probability of occurrence of other items based on the occurrence of other items in the transaction.

The input is generally:

(1) Transaction database

(2) Support and confidence

The output is: all the rules that represent common occurrences.

 

2. Association Rules

2.1 Definition of rules

The following expression X-> the Y, X- , the Y is a set of items ( itemsets ). Transaction database as shown below:

The association rule is: {Milk,Diaper}-> {Beer}.

 

2.2 Measurement of evaluation rules

  • Support (s) : The ratio of X and Y included in the transaction.
  • Confidence (c) : The ratio of Y appearing in transactions that contain X.

For the above transaction database, we calculate the support and confidence of the association rule {Milk,Diaper}-> {Beer} :

                                                            s=\frac{\sigma(Milk,Diaper,Beer))}{|T|}=\frac{2}{5}

                                                            c=\frac{\sigma(Milk,Diaper,Beer))}{\sigma(Milk,Diaper))}=\frac{2}{3}

 

Three, frequent itemsets

3.1 itemsets

A collection of one or more items.

k-itemset represents an item set containing k items.

 

3.2 Support count ( \sigma)

That is, the frequency of occurrence of the item set, such as:

                                                           \sigma(\{Milk,Bread,Diaper\})=2

 

3.3 Support

That is, the proportion of items appearing, such as:

                                                          s(\{Milk,Bread,Diaper\})=\frac{2}{5}

 

3.4 frequent itemsets

That is, the item set whose support degree is greater than or equal to minsup

 

For a given transaction set T, the task of association rule mining is transformed into finding all the rules that meet the following conditions:

  • s \ geq minsup
  • c\geq minconf

 

Four, frequent itemsets generation algorithm

For a set of size d, the following figure shows that there are a total of 2^d candidate item sets:

Need to calculate the support for each candidate item set, and then determine the frequent item set,

The calculation efficiency in this way is very low! ! !

 

4.1 Apriori algorithm

The Apriori algorithm is designed according to the Apriori principle,

The principle of Apriori is as follows:

  • If an item set is frequent, then all its subsets are frequent
  • And has the nature of anti-monotonicity, that is, the support of the item set is not greater than the support of its subset
    • Expressed by mathematical formula as:\forall X,Y:(x\subseteq Y)\Rightarrow s(X)\geq s(Y)

 

For example, when calculating the support of the above candidate itemsets, if the itemsets {AE} are found to be infrequent sets,

Then the itemsets containing {AE} do not need to be calculated, and they must all be infrequent sets.

 

We use the following transaction set as an example to demonstrate how to use the Apriori algorithm to generate frequent itemsets:

Time Items
10 A,C,D
20 B,C,E
30 A,B,C,E
40 B,E

Where SUPmin=2, that is, the frequent itemset is counted when the support count of the itemset is greater than or equal to 2.

Next, we scan the transaction set for the first time, and get an itemset C1 with an itemset size of 1.

Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3

Remove itemsets {D} with support counts less than 2, and get frequent itemsets table L1,

Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3

Generate an item set of size 2 from table L1, and calculate their support to get table C2,

Itemset sup
{A,B} 1
{A,C} 2
{A, E} 1
{B,C} 2
{B,E} 3
{THERE IS} 2

Remove itemsets {A, B} and {A, E} with support counts less than 2, and get frequent itemsets table L2,

Itemset sup
{A,C} 2
{B,C} 2
{B,E} 3
{THERE IS} 2

Generate an item set of size 3 from table L2, and calculate their support to get table C3,

Itemset sup
{B,C,E} 2

Therefore, frequent itemsets of different sizes can be obtained.

 

4.2 FP-Growth

FP-Growth solves the problem from the direction of the tree. It compresses the transaction database into a tree structure, and obtains the frequent itemsets through the FP-tree.

We illustrate how to generate FP-tree through the following transaction set:

First, we count the frequency of 1-itemset, and get the frequency counting order f-list=I2,I1,I3,I4,I5,

Then rearrange the elements in the itemset Items of the database in this order to get the (ordered) FI column,

And remove the elements lower than min_support=2,

Time Items (ordered)FI
100 {I1,I2,I5,I6} {I2,I1,I5}
200 {I2,I4} {I2,I4}
300 {I2,I3} {I2,I3}
400 {I1,I2,I4} {I2,I1,I4}
500 {I1,I3,I7} {I1, I3}
600 {I2,I3} {I2,I3}
700 {I1, I3} {I1, I3}
800 {I1,I2,I3,I5} {I2,I1,I3,I5}
900 {I1,I2,I3} {I2,I1,I3}

Then according to the obtained (ordered) FI information, the FP-tree is generated row by row, and the steps for generating it are as follows:

After getting the FP-tree that we finally generated, perform a suffix search on the elements I1, I2, I3, I4, and I5 respectively,

 

Let's take I5 as an example. The paths with I5 as the suffix are I2->I1->I5 and I2->I2->I3->I5, and then count the elements other than I5 to get a table of header elements greater than min_support:

Item frequency
I2 2
I1 2

Then generate FP-tree according to the table, and get the following structure FP-tree:

Then put I5 behind the corresponding node to get the frequent set sequence:

  • {I2,I5:2}
  • {I1,I5:2}
  • {I2,I1,I5:2}

By analogy, the suffix frequent set sequence of all nodes is obtained (I2 is no node before, so there is no need to calculate the suffix frequent set sequence):

Item

Frequent Patterns Generated

I1

{I2,I1:4}

I3

{I1, I3: 4}

{I2,I1,I3:2}

{I2,I3:4}

I4

{I2,I4:2}

I5

{I1,I5:2}

{I2,I1,I5:2}

{I2,I5:2}

Guess you like

Origin blog.csdn.net/weixin_39478524/article/details/109552964