[Machine Learning] Introduction to Association Mining

 

Association mining, also known as association analysis, is to find frequent patterns, associations, correlations, or causal structures among item collections or object collections in transaction data, relational data or other information carriers.

The following are several transaction records of a supermarket:

Middle: TID stands for transaction serial number, and Items stands for commodities in one transaction.

Related concepts:

1. Transaction : Each transaction is called a transaction. For example, the data set in Example 1 contains four transactions.

2. Item: Each item traded is called an item, such as Cola, Egg, etc.

3. Item set: A set containing zero or more items is called an item set, such as {Cola, Egg, Ham}.

4. k-item set: an item set containing k items is called a k-item set, for example {Cola} is called a 1-item set, {Cola, Egg} is called a 2-item set.

5. Support count : When an item set appears in several transactions, its support count is how many. For example, {Diaper, Beer} appears in transactions 002, 003, and 004, so its support count is 3.

6. Support (support) : The support represents the probability that the item set {A, B} appears in the total items set. Represents the probability that A and B occur simultaneously in the total number I, the formula is:

  support(A→B) = P(A,B) / P(I) = P(A∩B) / P(I) = num(A∩B) / num(I)

Among them, I represents the total transaction set. num() represents the number of times a specific item set appears in the transaction set.

That is, the support count is divided by the total number of transactions. For example, in the above example, the total number of transactions is 4, and the support count of {Diaper, Beer} is 3, so its support is 3÷4=75%, indicating that 75% of people bought Diaper and Beer at the same time.

which is:

7. Frequent itemsets : itemsets whose support degree is greater than or equal to a certain threshold are called frequent itemsets. For example, when the threshold is set to 50%, since the support of {Diaper, Beer} is 75%, it is a frequent item set.

8. Antecedent and consequent: For the rule {Diaper}→{Beer}, {Diaper} is called the antecedent and {Beer} is called the subsequent.

9. Confidence : Confidence indicates the probability that B is derived from the association rule "A→B" when the prerequisite A occurs. Indicates the probability that B will occur at the same time in the item set where A occurs, that is, the proportion of the number of A and B occurring at the same time in the number of occurrences of only A, the formula is:

confidence(A→B) = P(B|A)  = P(A,B) / P(A) = P(A∩B) / P(A)

For the rule {Diaper}→{Beer}, the support count of {Diaper, Beer} is divided by the support count of {Diaper}, which is the confidence level of this rule, indicating how many transactions B are done at the same time when doing A transaction.

For example, the confidence of the rule {Diaper}→{Beer} is 3÷3=100%. It means that 100% of people who bought Diaper also bought Beer.

10. Strong association rules: Rules that are greater than or equal to the minimum support threshold ( minsup) and minimum confidence threshold ( minconf) are called strong association rules. The ultimate goal of association analysis is to find strong association rules.

11. Lift : The lift of A transaction to B transaction, which means that with A as a prerequisite, what effect does it have on the probability of B appearing. For the rule {Diaper}→{Beer}, the support of {Diaper, Beer} is divided by the product of the support of {Beer} and the support of {Diaper }.

Lift(A→B) = P(A|B) / P(B) /P(A) That is, A's confidence in B divided by the product of B's ​​support and A's support

The promotion degree reflects the correlation between A and B in the association rules. The promotion degree> 1 and the higher the higher the positive correlation, the promotion degree <1 indicates that the A transaction and the B transaction are repulsive (that is, the purchase of A is not the same as the purchase of B) , Lift=1 indicates that A and B are not related in any way.

note:

  1. Confidence is very high. It may be that in all transactions, the two appear frequently, so the correlation between them may be just a coincidence. At this time, it is necessary to compare the degree of improvement.
  2. The low confidence may be due to the fact that the item set accounts for too small a proportion of the total transaction. At this time, it is also necessary to compare the promotion.
  3. Lift is a very simple means of judging the relationship, but it is more affected by zero transaction in the actual application process . Zero transaction is the transaction that contains neither A nor B in the above example. The greater the zero-affected transaction, the greater the promotion. In order to avoid the impact of zero events in practical applications, KULC measurement + IR is generally used for measurement.

KULC=0.5*P(B|A)+0.5*P(A|B) is the average value of two-way confidence;

IR=P(B|A)/P(A|B)

limitation:

The calculation is too large!

Guess you like

Origin blog.csdn.net/henku449141932/article/details/110817740