Data Mining Algorithm: Association Analysis II (FP-tree Algorithm)

3. FP-tree algorithm

  The following introduces an algorithm FP-tree that uses a completely different method from Apriori to find frequent itemsets. The FP-tree algorithm does not generate candidate sets like Apriori in the process, but uses a more compact data structure to organize the tree, and then directly extracts frequent itemsets from this structure. The process of the FP-tree algorithm is:

First, the support is calculated for each item in the transaction, the infrequent items are discarded, and the support of each item is sorted in reverse order. At the same time, the items in each transaction are also sorted in reverse order.

According to the new order of transaction items in each transaction, they are inserted into a tree rooted at Null. Also record the support of each transaction item. After this process is completed, we get an FP-tree tree structure.

For the constructed FP-tree, convert the previous path into a conditional FP-tree for each item from the top to the bottom of the tree structure.

According to each conditional FP-tree, find all frequent itemsets.

This description of the FP-tree algorithm process is relatively abstract. Let's use the following example to specifically understand how the FP-tree algorithm finds frequent itemsets.

(source: Data Mining: Concepts and Techniques Jiawei, Han)

First, the support is calculated for all itemsets in practice, and then sorted in reverse order, as shown in the green table in the figure below. Then, the items in each transaction are rearranged according to this reverse order. For example, for the T100 transaction, it was originally disordered I1, I2, I5, but because the support of I2 is arranged before I1 in reverse order, the order after reordering is I2, I1, I5. The itemsets of the reordered transactions are shown in the third column of the following table.

Rescan the transaction database and insert it into the tree with NULL as the root node in the order of the reordered itemsets. For transaction T100, three nodes I2, I1, and I5 are created in sequence, and then a path of NULL→I2→I1→I5 can be formed, and the frequency count of all nodes on the path is recorded as 1. For transaction T200, node I2 already exists in the FP-tree, so a path of NULL→I2→I4 is formed, and a node of I4 is created at the same time. In this case, the frequency count on the node I2 is increased by 1, which is recorded as 2, and the frequency count on the node I4 is recorded as 1. Following the same process, after scanning all transactions in the library, the tree structure shown in the figure below can be obtained.

For the constructed FP-tree, build the conditional FP-tree for each item in turn starting from the bottom of the tree. First, we find node I5 in the above figure, and find that there are two paths that can reach I5: {I2, I1, I5:1} and {I2, I1, I3, I5:1}.

The conditional tree for constructing I5 based on these two-day paths is as shown in the figure below, in which I3 should be discarded, because the count of I3 is 1, which cannot satisfy the condition of frequent itemsets. Then use the prefix {I2, I1:2} of I5 to list all the combinations with the suffix I5, and finally get {I2, I5}, {I2, I1} and {I2, I1, I5} three frequent itemsets.

Performing the above steps for all items, we can get the frequent itemsets generated by all items.

https://www.cnblogs.com/zhengxingpeng/p/6679280.html

Advantages and disadvantages evaluation:

Compared with the Apriori algorithm, the time complexity and space complexity of the FP-tree algorithm are significantly improved. However, for massive data sets, the time and space complexity is still very high, and techniques such as database partitioning are required at this time.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324730086&siteId=291194637