Apriori algorithm and FP-Tree algorithm Introduction

Apriori association analysis algorithm

Apriori algorithm is the basic algorithm for mining frequent item sets needed to produce association rules, one of the most famous association analysis algorithm.

1. Apriori algorithm

Apriori algorithm uses an iterative search of the layer by layer method, which uses k- itemsets explore (k + 1) - item set. To improve search and generate hierarchical processing efficiency of the corresponding set of items frequently, Apriori algorithm takes advantage of an important properties which can effectively reduce the search space for frequent item sets.

Apriori property: all non-empty subsets of a frequent itemset must also be frequent item sets. That is, if A does not satisfy the minimum support itemsets threshold, i.e., A is not frequent, if the added entry item set B to set A, then the new item set (AUB) can not be frequent.

Apriori simple algorithm is mainly in the following steps.

1) by a single-pass scan data set to determine the degree of support for each item. Once this is done, you can get a collection of F1 all frequent 1-itemsets.

2) the use of frequent iteration (k-1) found - itemsets to generate new candidate k- itemsets.

3) In order to support count candidate set, once again scans the database, a subset function to determine all of the candidate included in each of the transactions t k- itemsets.

4) After calculating candidate support count option set, the algorithm will delete all candidates support count is less than the set threshold value of support.

5) Repeat steps (2), (3), (4 ), when no new frequent itemsets generation algorithm ends.

Apriori algorithm is a step by step algorithm, which uses "generate - test" strategy to find frequent item sets. By (k-1) - process produces k- itemsets in the set of items "k- itemsets newly generated must be sure that it all (k-1) - item subset are frequently, if there is a not frequent, so it can focus from the current candidates removed.

The method of generating candidate items are the following.

1) brute force method

After starting from 2-itemsets all entries from 1-item sets are completely spell out. For example, the spell 3- itemset 1-item 3, to list all possibilities. Then pruning algorithm according to prune, i.e. determining the set of all current items (k-1) - if the itemset is frequent.

2)  Law

Generating k- itemset itemsets, and then prune - and the 1-item (k-1). This method is complete, because each k- frequent itemset is a frequent by the (k-1) - and a frequent itemset 1-item produced. Because of the order, this method will produce large itemsets k- repeated frequently.

3) Act

Two frequent (k-1) - k- itemset generating a candidate set of items, but two frequent (k-1) - before itemsets k-2 must be the same item, last item must be different. Since each candidate are set by a frequent (k-1) - a collection of items and formed, it requires additional steps to ensure that the candidate pruning remaining k-2 candidate subsets are frequent.

2. Generate association rules from frequent itemsets

Once centralized find frequent item sets from the transaction data, it can generate strong association rules directly from them, that meet the rules of minimum support and minimum confidence. Computing association rules confidence does not need to scan data set things again, because the support count these two items have already been set at the time of frequent item set to produce.

Suppose there are frequent itemset Y, X is a subset of Y, then if the rule X → Y → X does not meet the confidence threshold, then the form X1 → Y1 → X1 rules must not meet the confidence threshold, wherein, the X1 X is a subset of. According to this property, it is assumed to generate association rules from frequent itemsets {a, b, c, d}, association rule {b, c, d} → {a} having low confidence, the latter can be discarded member comprise all associated a, rules, such as {c, d} → {a, b}, {b, d} → {a, c} and so on.

3. algorithm advantages and disadvantages

Apriori algorithm produced as classical frequent itemsets algorithm using transcendental nature, greatly improving the efficiency of frequent itemsets produced layer by layer, it is simple and easy to understand, low data set requirements. But with the application of in-depth, its shortcomings gradually exposed, the main performance bottleneck in the following two points.

  • Repeatedly scan the transaction data set requires a lot of I / O load. For each cycle k, ck candidate set of each element must be verified by a scan data set whether the added lk.
  • May produce large candidate sets. The number of candidate sets is exponential growth, such a large set of candidate time and space is a challenge.

FP-Tree correlation analysis algorithm

In 2000, Han Jiawei, who proposed FP-Growth algorithm based on frequent pattern tree (Frequent Pattern Tree, FP-Tree) found frequent patterns. The idea is to construct an FP-Tree, mapping data in the dataset to a tree, and then find all frequent item sets based on the tree FP-Tree.

FP-Growth algorithm means, through two scans the transaction data set, the frequent items included in each firm support in descending order according to their compression and storage in the FP-Tree.

After finding frequent patterns in the process, you do not need to scan the transaction data set, but can only find in the FP-Tree. FP-Growth calls by direct method frequently recursive mode, the entire discovery process also need not generate candidate pattern. Since the only data set scan twice, FP-Growth algorithm Apriori algorithm overcomes the problems in the implementation of efficiency is also significantly better than Apriori algorithm.

1. FP-Tree structure

To reduce I / O times, FP-Tree algorithm into a number of data structures to temporarily store data. This data structure comprises three parts: header table entries, FP-Tree nodes and links as shown in FIG.

FP-Tree data structure
1 FP-Tree data structure of FIG.

The first part is a header table entry, the number of all recording 1- frequent itemsets appear, in accordance with descending order. For example, in FIG. 1, A appears eight times in all 10 sets of data, and therefore in the first place.

The second part is the FP-Tree, the original data set mapping it to an FP-Tree memory.

The third part is a node list. Head table all frequent 1-item is a list of knot points, which in turn points to the location in the FP-Tree 1-itemsets appear frequently. This is mainly to facilitate contact between the head table entries and FP-Tree to find and update.

Establish item 1) header table

The establishment of FP-Tree items need to establish head table. The first scan data set to give 1- count all frequent itemsets. Then remove the support below the entry threshold, frequently 1-item header table entries into, and support descending order.

The second scan data set, the read original data excluding infrequent 1-item, and support descending order.

In this example, there are 10 data, the first data and the first scan 1-item count found F, O, I, L, J, P, M, N only appears once, support below a threshold ( 20%), they do not appear in the first table entry. The remaining A, C, E, G, B, D, F in descending order according to the size of the support, the composition of the head table entries.

Followed by a second scan data for each data excluding infrequent 1-item, and support descending order. For example, the data items A, B, C, E, F, O O in frequent non-1-item, thus removed, only the A, B, C, E, F. Support the sorted order, it becomes A, C, E, B, F, and so other data items. The raw data set of frequent itemsets 1- sort order when creating the back of FP-Tree may be a common ancestor nodes as possible.

After two scans, the first set of items has been established, the sorted data set has been, as shown in FIG.

FP-Tree term head in Italy
FIG 2 FP-Tree term header is intended to represent

2) FP-Tree establishment

With the item header table and sorted data sets, you can begin FP-Tree is established.

FP-Tree is no data at the beginning, to be read in sorted data set to an article establishing FP-Tree, and inserted into the FP-Tree. When inserted, the higher-ranking node is the ancestor node, and rearward is a descendant node. If there is a common ancestor, the corresponding common ancestor node count by one. After insertion, if there appears a new node, the header table entry corresponding to the node will pass the node linked list the new node. Until all the data is inserted into the FP-Tree, FP-Tree is created.

The following description is exemplified in the process of establishing FP-Tree. First, insert the first data A, C, E, E, F, as shown in FIG. In this case no node FP-Tree, thus A, C, E, B, F is an independent path, all nodes are counted as one, the first term on the table by the node linked list corresponding to the new node.

FP-Tree 1 a schematic configuration
3 FP-Tree 1 a schematic structural diagram

接着插入数据 A、C、G,如图 4 所示。由于 A、C、G 和现有的 FP-Tree 可以有共有的祖先结点序列 A、C,因此只需要增加一个新结点 G,将新结点 G 的计数记为 1,同时 A 和 C 的计数加 1 成为 2。当然,对应的 G 结点的结点链表要更新。

FP-Tree a schematic configuration 2
图 4  FP-Tree的构造示意2

用同样的办法可以更新后面 8 条数据,最后构成的 FP-Tree,如图 1 所示。由于原理类似,就不再逐步描述。

2. FP-Tree 的挖掘

下面讲解如何从 FP-Tree 挖掘频繁项集。基于 FP-Tree、项头表及结点链表,首先要从项头表的底部项依次向上挖掘。对于项头表对应于 FP-Tree 的每一项,要找到它的条件模式基。

条件模式基是指以要挖掘的结点作为叶子结点所对应的 FP 子树。得到这个 FP 子树,将子树中每个结点的计数设置为叶子结点的计数,并删除计数低于支持度的结点。基于这个条件模式基,就可以递归挖掘得到频繁项集了。

还是以上面的例子来进行讲解。先从最底部的 F 结点开始,寻找 F 结点的条件模式基,由于 F 在 FP-Tree 中只有一个结点,因此候选就只有图 5 左边所示的一条路径,对应 {A:8,C:8,E:6,B:2,F:2}。接着将所有的祖先结点计数设置为叶子结点的计数,即 FP 子树变成 {A:2,C:2,E:2,B:2,F:2}。

条件模式基可以不写叶子结点,因此最终的 F 的条件模式基如图 5 右边所示。

基于条件模式基,很容易得到 F 的频繁 2-项集为 {A:2,F:2},{C:2,F:2},{E:2,F:2}, {B:2,F:2}。递归合并 2—项集,可得到频繁 3—项集为{A:2,C:2,F:2}, {A:2,E:2,F:2}, {A:2,B:2,F:2}, {C:2,E:2, F:2}, {C:2,B2, F:2}, {E:2,B2, F:2}。递归合并 3-项集,可得到频繁 4—项集为{A:2,C:2,E:2,F:2},{A:2,C:2,B2,F:2}, {C:2,E:2,B2,F:2}。一直递归下去,得到最大的频繁项集为频繁 5-项集,为 {A:2,C:2,E:2,B2,F:2}。

FP-Tree 1 schematically Mining
图 3  FP-Tree的挖掘示意1

F 结点挖掘完后,可以开始挖掘 D 结点。D 结点比 F 结点复杂一些,因为它有两个叶子结点,因此首先得到的 FP 子树如图 5左边所示。

接着将所有的祖先结点计数设置为叶子结点的计数,即变成 {A:2,C:2,E:1 G:1,D:1,D:1}。此时,E 结点和 G 结点由于在条件模式基里面的支持度低于阈值,所以被删除,最终,去除了低支持度结点和叶子结点后的 D 结点的条件模式基为 {A:2,C:2}。通过它,可以很容易得到 D 结点的频繁 2-项集为 {A:2,D:2},{C:2,D:2}。递归合并 2-项集,可得到频繁 3-项集为 {A:2,C:2,D:2}。D 结点对应的最大的频繁项集为频繁 3_项集。

用同样的方法可以递归挖掘到 B 的最大频繁项集为频繁 4-项集 {A:2,C:2,E:2,B2}。继续挖掘,可以递归挖掘到 G 的最大频繁项集为频繁 4-项集 {A:5,C:5,E:4,G:4},E 的最大频繁项集为频繁 3-项集 {A:6,C:6,E:6},C 的最大频繁项集为频繁 2-项集{A:8,C:8}。由于 A 的条件模式基为空,因此可以不用去挖掘了。

FP-Tree 2 schematically Mining
图 4  FP-Tree的挖掘示意2

至此得到了所有的频繁项集,如果只是要最大的频繁 k-项集,则从上面的分析可以看到,最大的频繁项集为 5-项集,包括{A:2,C:2,E:2,B:2,F:2}。

3. MLlib 的 FP-Growth 算法实例

Spark MLlib 中 FP-Growth 算法的实现类 FPGrowth 具有以下参数。

class FPGrowth private (
private var minSupport: Double,
private var numPartitions: Int) extends Logging with Serializable

变量的含义如下。

  • minSupport 为频繁项集的支持度阈值,默认值为0.3。
  • numPartitions 为数据的分区个数,也就是并发计算的个数。

首先,通过调用 FPGrowth.run 方法构建 FP-Growth 树,树中将会存储频繁项集的数据信息,该方法会返回 FPGrowthModel;然后,调用 FPGrowthModel.generateAssociationRules 方法生成置信度高于阈值的关联规则,以及每个关联规则的置信度。

实例:导入训练数据集,使用 FP-Growth 算法挖掘出关联规则。该实例使用的数据存放在 fpg.data 文档中,提供了 6 个交易样本数据集。样本数据如下所示。

r z h k p
z y x w v u t s
s x o n r
x z y m t s q e
z
x z y r q t p

数据文件的每一行是一个交易记录,包括了该次交易的所有物品代码,每个字母表示一个物品,字母之间用空格分隔。

实现的代码如下所示。

  1. import org.apache.spark.mllib.fpm.FPGrowth import org.apache.spark.{SparkConf,SparkContext}
  2.  
  3. object FP_GrowthTest {
  4. def main(args:Array[String]){
  5. val conf = new SparkConf().setAppName(“FPGrowthTest”).setMaster(“local[4]”)
  6. val sc = new SparkContext(conf)
  7. //设置参数
  8. val minSupport = 0.2 //最小支持度
  9. val minConfidence = 0.8 //最小置信度
  10. val numPartitions = 2 //数据分区数
  11. //取出数据
  12. val data = sc.textFile(“data/mllib/fpg.data”)
  13. //把数据通过空格分割
  14. val transactions = data.map (x=>x.split (“”))
  15. transactions.cache()
  16. //创建一个 FPGrowth 的算法实列
  17. val fpg = new FPGrowth()
  18. fpg.setMinSupport(minSupport)
  19. fpg.setNumPartitions(numPartitions)
  20.  
  21. //使用样本数据建立模型
  22. val model = fpg.run(transactions)
  23. //查看所有的频繁项集,并且列出它出现的次数
  24. model.freqItemsets.collect().foreach(itemset=>{
  25. printIn (itemset.items.mkString(“[“,“,”,“]”)+itemset.freq)
  26. })
  27.  
  28. //通过置信度筛选出推荐规则
  29. //antecedent 表示前项,consequent 表示后项
  30. //confidence 表示规则的置信度
  31. model.generateAssociationRules(minConfidence).collect().foreach(rule=>{printIn(rule.antecedent.mkString(“,”)+“–>” + rule.consequent.mkString(“”)+“–>”+rule.confidence)
  32. })
  33.  
  34. //查看规则生成的数量
  35. printIn(model.generateAssociationRules (minConfidence).collect().length)

运行结果会打印频繁项集和关联规则。

部分频繁项集如下。

[t] , 3
[t, x] ,3
[t, x, z] , 3
[t, z] , 3
[s] , 3
[s, t] , 2
[s, t, x] , 2
[s, t, x, z] , 2
[s, t, z], 2
[s, x] , 2
[s, x, z] , 2

部分关联规则如下。

s, t, x –> z –> 1.0
s, t, x –> y –> 1.0
q, x –> t –> 1.0
q, x –> y –> 1.0
q, x –> z –> 1.0
q, y, z –> t –> 1.0
q, y, z –> x –> 1.0
t, x, z –> y –> 1.0
q, x, z –> t –> 1.0
q, x, z –> y –> 1.0

57 association rules data mining analysis
58. The Apriori algorithm and FP-Tree algorithm
59. Based on a large data precision marketing of
60. based personalized recommendation system big data
61. Big Data predictive
62. The other big data applications
63. big data can be used in which industries
64. in the financial industry, large application data
65. big data applications in the Internet industry
66. the application of big data in the logistics industry

Guess you like

Origin blog.csdn.net/yuyuy0145/article/details/92430160