"Data Mining Concepts and Techniques" Chapter 7 Advanced Pattern Mining

Frequent pattern mining is the basic goal of frequent item mining in data mining.
In addition, it includes closed frequent item mode and extremely frequent item mode .

In addition to mining basic frequent itemsets and associations, you can also mine advanced pattern forms, which are described in this chapter:

  • multilevel association
  • multidimensional association
  • Quantitative Association Rules
  • Rare Mode
  • negative mode
  • high-dimensional mode
  • Mode Compression and Approximate Modes

multilevel association

Multilevel associations involve data in multiple abstraction layers. For example, a Dell computer can be abstracted to a computer, and a Sony headset can be abstracted to a headset. These can be mined using multiple minimum support thresholds.
For multi-layer correlation patterns, the choice
of threshold: the same threshold can be used to mine correlation patterns; it can also be lowered layer by layer to mine correlation patterns to avoid losing information contained in correlation patterns in lower layers; the smallest threshold in all layers can be used .

A side effect in multilevel associations is that some redundant rules at multiple levels of abstraction may arise due to "ancestor" relationships between items. E.g

Buy a computer => buy a HP printer (support 8%, confidence 70%) ————— (1.1)
Buy a Dell computer => buy a HP printer (support 2%, confidence 72%) ————— --(1.2)

At present, the rules (1.1) and (1.2) have been excavated, so among these two rules, is the latter sub-rule useful?
Among them, the computer is the "ancestor" of the Dell computer, and the rule (1.1) is the "ancestor" of the rule (1.2).
Here, a redundancy definition is given:

Rule R1 is an ancestor of rule R2, and R2 is redundant if R1 can be obtained by substituting items in R2 with its ancestors in the concept hierarchy.
According to this definition, a rule is considered redundant if its support and confidence are close to the "expected value" according to the rule's ancestors.

In this example, rule (1.1) has 70% confidence and 8% support, and about 1/4 of the computers are Dell computers (1/4 is hypothetical), then we can expect rule (1.2) to have about 70% Confidence and 2% (8%*1/4) support, if this is indeed the case, then rule (1.2) is not interesting, it provides no additional information, and is less general than its ancestor.

multidimensional association

Multidimensional associations contain multiple dimensions. Techniques for mining such associations vary depending on how duplicate predicates are handled.
For example, the rule
age(X,"20,...,29") ^ occupation(X,"student") => buy(X,"laptop")
where age, occupation, and buy are all predicates, involving two or more Association rules for dimensions or predicates are called multidimensional association rules. Different dimensions in multi-dimensional, that is, with non-repeating predicates, are called inter-dimension association rules ; with repeating predicates, they are called mixed-dimension association rules .

method:

  1. Quantitative attributes are discretized using a predefined hierarchy of concepts. Mining multidimensional association rules using static discretization of quantitative attributes .
    Age discretized into intervals (20-30, 30-40, …)
  2. Discretize or cluster quantitative attributes into "bins" based on the data distribution.

In multidimensional associations, we search for sets of frequent predicates . Not searching for frequent itemsets. A k-predicate set is a set containing k conjunction predicates. {age,occupation,buy} is a 3-predicate set.

quantitative association

Quantitative association rules design quantitative properties.
Discretization, clustering, and statistical analysis to reveal abnormal behavior can be integrated with the pattern mining process.

  1. Data cube- based mining
    Construct data cubes from the transformed multidimensional data for data mining.
  2. Cluster
    -based top-down approach.

For each quantized dimension, a standard clustering algorithm can be used to find clusters on that dimension that satisfy the minimum supported threshold. For each such cluster, we examine the two-dimensional space generated by combining the cluster with a cluster or nominal attribute value in another dimension to see if this combination satisfies the minimum support threshold. If satisfied, continue to examine higher dimensional spaces. In this process, we can use a priori pruning, if at any point, the combined support does not meet the minimum support, then its further division or combination with other dimensions also does not meet the minimum support.

这里对于划分不满足明白,但对于与其他维组合不满足不清楚。
  1. The use of statistical theory
    means that the results do not match real-world experience.

Rare Mode and Negative Mode

Rare modes are rare but especially interesting. A negative pattern is one whose members exhibit negatively correlated behavior.
Negative modes should be defined with care, taking into account zero invariance.
Rare and negative patterns can highlight anomalous behavior of data

zero invariance

That is, a value may be erroneously affected by a zero transaction, where a zero transaction is a transaction that does not contain any items of the itemset under consideration.
Here's an interesting example.
When the total amount of transactions is considered different, there will be different situations, and the same can be proved in Definition 7.3.

insert image description here

Constraint-Based Mining

Constraint-based mining strategies can be used to guide the mining process, mining patterns that are intuitively consistent with the user or satisfy certain constraints, and many user-specified constraints can be advanced into the mining process.
Constraints are divided into schema pruning constraints and data pruning constraints .
The properties of these constraints include monotonicity, anti-monotonicity, brevity, variability, and data anti-monotonicity.

Among them, monotonicity is based on satisfaction and anti-monotonicity is based on violation of conditions. If a cell violates a condition, any superset of it will also violate that condition, which is called anti-monotonic.
Satisfying inverse monotonicity are: sum(S)<=v.
Since the sum of the current unit is already greater than v, any superset of it cannot satisfy the condition that it is less than or equal to v.
Monotonic ones are: count(I)>10.
The quantity in the collection is greater than 10, then adding more items to the collection will increase the quantity, which also satisfies the condition.
Conciseness: means that the set of constraints is enumerable.
Transformable constraint:
constraint avg(I.piece) < 50. That is, the average price of the price does not exceed 50, the constraint is neither monotonic nor anti-monotonic, but if the items in the transaction are added in increasing order of unit price, then the constraint becomes anti-monotonous.

High-dimensional space and pattern fusion

High-dimensional space mining, row-enumeration-based pattern growth methods for mining datasets with large dimensions but few tuples, and mining of giant patterns by pattern fusion methods.
(I don't quite understand this)

Mode Compression and Approximate Modes

To reduce the number of patterns returned by mining, we can mine compressed or approximate patterns instead.
Compressed patterns can be mined by defining representative patterns based on clustering concepts, while approximate patterns can be mined by extracting perceptually redundant top-k patterns.

pattern compression

Pattern compression is performed by finding representative patterns in clusters . That is, pattern compression can be achieved by pattern clustering.
Close mode is lossless compression of frequent mode sets, while maximal mode is lossy compression.
But the disadvantage of using closed itemsets and maximal itemsets for compression is that
when there are no closed itemsets in the dataset, we choose maximal itemsets to represent the compressed version of the data. But we have known before that the maximal itemsets do not carry the support information of the item sets, and we will lose the entire support information.
Thus, we propose that, since the closed itemset is a lossless compression of the original frequent pattern set, we find representative patterns on the closed pattern set .

Here proposed, a distance metric between closed modes is calculated .
Let P1 and P2 be two closed patterns, and their supporting transaction sets are T(P1) and T(P2), respectively. Then the pattern distance between P1 and P2 is:
Pat_Dist(P1,P2)=1- (|T(P1)∩T(P2)|) / (|T(P1)∪T(P2)|)

Pattern distance is a valid distance metric defined on a collection of transactions. Contains pattern support information.

Example:
Suppose P1, P2 are two patterns, such that T(P1)={t1,t2,t3,t4,t5}, ​​T(P2)={t1,t2,t3,t4,t6}, where ti is the database transactions in.
Then the distance between P1 and P2 is:
Pat_Dist(p1,p2)=1-|{t1,t2,t3,t4}|/|{t1,t2,t3,t4,t5,t6}| = 1 - 4/6 = 1/3

insert image description here
That is: given a transaction database, the minimum support min_sup, and the clustering quality measure o, the pattern compression problem finds a set R of representative patterns in time, so that for each frequent pattern P, there is a representative pattern Pr belonging to R, which covers P , and |R| is minimized.

Perceptually redundant top-k mode

A small set of k representative patterns that are not only highly salient but also have low redundancy with each other.

In the figure below, saliency is represented in grayscale, and the closer the balls are, the higher the redundancy. Assume that the number of representative modes to be found is 3, that is, k=3.
The arrows in the figure are used to indicate the selected mode, b is the mode selected by the perceptually redundant top-k mode, and c is the traditional top-k mode selection mode, d is the mode selected by the k-summary mode.
insert image description here
It is obvious that the mode selection in c only depends on saliency, and the three most significant modes are selected from the set without considering redundancy.
Mode selection in d relies only on non-redundant selection modes, first dividing the set into 3 clusters and finding the most representative mode is the one closest to the "center" of each cluster.
While in b, there is a balance between significance and relevance.

So, how should salience and redundancy be measured?

significance measure

The saliency measure S is a function that maps a pattern p ∈ P to a real value such that S(p) is the interestingness (or usefulness) of the pattern p. This measure can be either subjective or objective. S(p,q) is the joint significance of patterns p and q, and S(p|q)=S(p,q)-S(q) is the relative significance given q, p.
It should be noted here that S(p,q) is the co-saliency of the two modes.

redundancy

Given a saliency measure S, the redundancy R between two modes p and q is defined as R(p,q)=S§+S(q)-S(p,q).
That is, S(p|q) = S(p) -R(p,q).

The joint significance of the two modes is assumed to be no less than the significance of either mode, and no more than the sum of the significance of the two modes. That is to say, the redundancy between the two modes should satisfy:
(here can be deduced according to the formula given above)
0≤R(p,q)≤min(S(p),S(q))
Since An ideal redundancy measure R is hard to come by. We can use the distance between patterns to approximate redundancy. The problem is eventually transformed into the problem of finding k-mode sets that maximize edge saliency .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326386274&siteId=291194637