Machine Learning Algorithms - Association Analysis Using the Apriori Algorithm

Reference blog: click to open the link , click to open the link

1. What is Association Analysis?

Association analysis is the task of finding interesting relationships in large-scale datasets, which can take two forms: frequent itemsets or association rules .

Frequent item set: a collection of items that often appear together

Association rules: imply that there may be a strong relationship between two items

2 Apriori theory

The general process of the algorithm:

  • Collect data: use any method
  • Prepare data: any data type is fine, since we only save collections
  • Analyze data: use any method
  • Training Algorithm: Use the Apriori algorithm to find frequent itemsets
  • Testing Algorithms: No Testing Process Required
  • Use algorithm: used to discover frequent itemsets and association rules between items

  Support : The percentage of records in the dataset that contain this dataset. The support of a set refers to what proportion of transaction records contain the set.

  Credibility (Confidence):

       

        Using the Apriori algorithm, first calculate the support of a single element, and then select a single element whose confidence is greater than the value we require, such as 0.5 or 0.7. Then increase the number of single element combinations, as long as the support of the combination item is greater than the value we require, add it to our frequent item set , recursively in turn. Then, the association rules are generated according to the frequent itemsets selected from the calculated support.

Generally speaking, data with high support does not necessarily constitute frequent itemsets, but data with too low support certainly does not constitute frequent itemsets

3. Apriori principle:

Principle function: This principle is to traverse all the data when calculating the support degree , which will cause too much calculation and very time-consuming. Using the Apriori principle can reduce the number of traversals and reduce the time.

Principle content: Apriori principle says that if an item set is frequent, then all its subsets are also frequent .

That is to say: if {0,1} is frequent, then {0}, {1} must also be frequent, and vice versa: if an itemset is an infrequent set, then all its supersets are also infrequent. frequently. (this is very useful)

For example: if we know that {2,3} is infrequent, then {0,1,2,3}{1,2,3}{0,2,3} is also infrequent. That is to say, once the support of {2,3} is calculated and it is known that it is infrequent, there is no need to calculate {0,1,2,3}{1,2,3}{0,2,3} of support.

4. Use the Apriori algorithm to find frequent sets

Association analysis includes two items: discovering frequent itemsets and discovering association rules . We need to find frequent itemsets before we can get association rules, so let's talk about how to find frequent itemsets first.

Apriori algorithm: The input parameters of this algorithm are the minimum support and the data set , respectively . The algorithm first generates a list of itemsets for all individual items. Then scan the transaction records to see which itemsets meet the minimum support requirements, and those sets that do not meet the support will be removed. Then, the remaining sets are combined to produce an itemset with two elements. Next, rescan the transaction records to remove itemsets that do not satisfy the minimum support. This process is repeated until all itemsets are removed.

5.     Advantages and disadvantages of Apriori algorithm

* 优点:易编码实现
* 缺点:在大数据集上可能较慢
* 适用数据类型:数值型 或者 标称型数据。

总结:虽然我们用Apriori原理来减少在数据库上进行检查的集合的数目。这样可以提高速度。但是,每次增加频繁项集的大小,Apriori算法都会重新扫描整个数据集,当数据集很大的时候,还是会降低频 繁项集的发现的速度。如果有需要,可以看下FPfrowth算法,该算法对数据库进行两次遍历,能够显著加快发现频繁项集的速度。




Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324605473&siteId=291194637