Top 10 Classical Algorithms of Data Mining

The IEEE International Conference on Data Mining (ICDM), an international authoritative academic organization, selected ten classic algorithms in the field of data mining in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN , Naive Bayes, and CART.

Not only the top ten algorithms selected, in fact, the 18 algorithms that participated in the selection, in fact, any one of them can be called a classic algorithm, and they have had a profound impact in the field of data mining.

 

1.  C4.5

The C4.5 algorithm is a classification decision tree algorithm in the machine learning algorithm, and its core algorithm is the ID3 algorithm. The C4.5 algorithm inherits the advantages of the ID3 algorithm, and improves the ID3 algorithm in the following aspects:

1) Use the information gain rate to select attributes, which overcomes the deficiencies of choosing attributes with more values ​​when using information gain to select attributes;
    2) Perform pruning in the process of tree construction;
    3) It can complete the discretization of continuous attributes ;
    4) Ability to process incomplete data.

The C4.5 algorithm has the following advantages: the generated classification rules are easy to understand and have high accuracy. The disadvantage is that in the process of constructing the tree, the data set needs to be sequentially scanned and sorted for many times, which leads to the inefficiency of the algorithm.

 

2.  The k-means algorithm  is the K-Means algorithm

The k-means algorithm is a clustering algorithm that divides n objects into k partitions according to their attributes, k < n. It is similar to the expectation-maximization algorithm that deals with mixed normal distributions in that they both try to find the centers of natural clusters in the data. It assumes that object attributes come from spatial vectors, and the goal is to minimize the sum of mean squared errors within each group.

 

3. Support vector machines

Support vector machine, English is Support Vector Machine, referred to as SV machine (generally referred to as SVM in the paper). It is a supervised learning method that is widely used in statistical classification and regression analysis. Support vector machines map vectors into a higher dimensional space where a maximum margin hyperplane is established. Two parallel hyperplanes are built on both sides of the hyperplane separating the data. The separating hyperplane maximizes the distance between two parallel hyperplanes. It is assumed that the larger the distance or gap between parallel hyperplanes, the smaller the total error of the classifier. An excellent guide is CJC Burges' Guide to Support Vector Machines for Pattern Recognition. van der Walt and Barnard compared support vector machines to other classifiers.

 

4. The Apriori algorithm

Apriori algorithm is one of the most influential algorithms for mining frequent itemsets of Boolean association rules. Its core is a recursive algorithm based on the idea of ​​two-stage frequency sets. The association rules belong to single-dimensional, single-level, Boolean association rules in classification. Here, all itemsets whose support degree is greater than the minimum support degree are called frequent itemsets, or frequency sets for short.

 

5.  Expectation Maximum (EM) Algorithm

In statistical computing, the Expectation–Maximization (EM) algorithm is an algorithm for finding maximum likelihood estimates of parameters in probabilistic models that rely on unobservable hidden variables (Latent Variabl). Maximum expectation is often used in the field of data clustering in machine learning and computer vision.

 

6. PageRank

PageRank is an important part of Google's algorithm. In September 2001, it was granted a US patent by Larry Page, one of the founders of Google. Therefore, the page in PageRank does not refer to a web page, but refers to Page, that is, this ranking method is named after Page.

PageRank measures the value of a website based on both the quantity and quality of its external and internal links. The concept behind PageRank is that every link to a page is a vote for that page, and more links mean more votes from other sites. This is called "link popularity" - a measure of how many people are willing to link their website to yours. The concept of PageRank is derived from the frequency of citations of a paper in academics - that is, the more times it is cited by others, the higher the authority of the paper is generally judged.

 

7. AdaBoost

Adaboost is an iterative algorithm whose core idea is to train different classifiers (weak classifiers) for the same training set, and then combine these weak classifiers to form a stronger final classifier (strong classifier). The algorithm itself is realized by changing the data distribution. It determines the weight of each sample according to whether the classification of each sample in each training set is correct and the accuracy of the last overall classification. The new data set with modified weights is sent to the lower classifier for training, and finally the classifiers obtained from each training are finally fused as the final decision classifier.

 

8. kNN: k-nearest neighbor classification

The K-Nearest Neighbor (KNN) classification algorithm is a theoretically mature method and one of the simplest machine learning algorithms. The idea of ​​this method is: if most of the k most similar samples in the feature space (that is, the closest neighbors in the feature space) belong to a certain category, the sample also belongs to this category.

 

9. Naive Bayes

在众多的分类模型中,应用最为广泛的两种分类模型是决策树模型(Decision Tree Model)和朴素贝叶斯模型(Naive Bayesian Model,NBC)。 朴素贝叶斯模型发源于古典数学理论,有着坚实的数学基础,以 及稳定的分类效率。同时,NBC模型所需估计的参数很少,对缺失数据不太敏感,算法也比较简单。理论上,NBC模型与其他分类方法相比具有最小的误差率。 但是实际上并非总是如此,这是因为NBC模型假设属性之间相互独立,这个假设在实际应用中往往是不成立的,这给NBC模型的正确分类带来了一定影响。在属 性个数比较多或者属性之间相关性较大时,NBC模型的分类效率比不上决策树模型。而在属性相关性较小时,NBC模型的性能最为良好。

 

10. CART: 分类与回归树

CART, Classification and Regression Trees。 在分类树下面有两个关键的思想。第一个是关于递归地划分自变量空间的想法;第二个想法是用验证数据进行剪枝。

 

本文来源:http://blog.csdn.net/aladdina/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326329874&siteId=291194637