8 Classic Data Mining Algorithms

It took about 2 months to learn and implement the code of 18 classic algorithms of big data mining , involving decision classification, clustering, link mining, association mining, pattern mining and so on. It can be regarded as a small introduction to the field of data mining. The following is a small summary, followed by the blog post link of my own corresponding algorithm, hoping to help everyone learn.

1.C4.5 algorithm. Like the ID3 algorithm, the C4.5 algorithm is a mathematical classification algorithm, and the C4.5 algorithm is an improvement of the ID3 algorithm. The ID3 algorithm uses information gain for decision-making, while C4.5 uses the gain rate.

Detailed introduction link : http://blog.csdn.NET/androidlushangderen/article/details/42395865

2.CART algorithm. The full name of the CART algorithm is the classification and regression tree algorithm. It is a binary classification. It uses a Gini index similar to entropy as a classification decision. After the decision tree is formed, it needs to be pruned. I use it when implementing the entire algorithm. is the cost complexity algorithm,

Detailed introduction link : http://blog.csdn.Net/androidlushangderen/article/details/42558235

3. KNN (K Nearest Neighbors ) algorithm. Given some already trained data, input a new test data point, calculate the classification of the nearest points contained in this test data point, which classification type is the majority, then the classification of this test point is the same as this, so Here , sometimes different weights can be copied for different classification points. The weight of the near point is heavy, and the far point is naturally small.

Detailed introduction link: http://blog.csdn.net/androidlushangderen/article/details/42613011

4. Naive Bayes ( Naive Bayes ) algorithm. Naive Bayesian algorithm is a relatively simple classification algorithm in Bayesian algorithm. It uses a relatively important Bayesian theorem, which can be summarized in a simple sentence is the mutual conversion derivation of conditional probability.

Detailed introduction link: http://blog.csdn.net/androidlushangderen/article/details/42680161

5. SVM ( Support Vector Machine ) algorithm. The support vector machine algorithm is a method for classifying linear and nonlinear data. When classifying nonlinear data, it can be processed by converting it to linear through the kernel function. One of the key steps is to search for the maximum edge hyperplane.

Detailed introduction link: http://blog.csdn.net/androidlushangderen/article/details/42780439

6. EM ( Expectation Maximization ) algorithm. The expectation maximization algorithm can be split into 2 algorithms, 1 E-Step expectation step , and 1 M-Step maximization step. It is an algorithmic framework that approximates the maximum likelihood or maximum a posteriori estimate of the parameters of a statistical model after each calculation of the result.

Detailed introduction link: http://blog.csdn.net/androidlushangderen/article/details/42921789

7. Apriori algorithm. Apriori algorithm is an association rule mining algorithm. It mines frequent itemsets through connection and pruning operations, and then obtains association rules according to frequent itemsets. The export of association rules needs to meet the requirements of minimum confidence.

Detailed introduction link: http://blog.csdn.net/androidlushangderen/article/details/43059211

8. FP-Tree ( Frequent Pattern Tree ) algorithm. This algorithm is also known as the FP-growth algorithm. This algorithm overcomes the shortcomings of the Apriori algorithm of generating too many candidate sets. It recursively generates a frequency pattern tree, and then mines the tree. The following process is consistent with the Apriori algorithm.

Detailed introduction link: http://blog.csdn.net/androidlushangderen/article/details/43234309

9.PageRank ( page importance / ranking ) algorithm. The PageRank algorithm was first produced by Google. The core idea is to use the number of incoming links of a webpage as a criterion for determining how fast a webpage is. If a webpage contains multiple links pointing to the outside, the PR value will be divided equally. PageRank algorithm Also vulnerable to Link Span attacks .

Detailed introduction link: http://blog.csdn.net/androidlushangderen/article/details/43311943

10. HITS algorithm. The HITS algorithm is another linking algorithm. Some of the principles are similar to the PageRank algorithm. The HITS algorithm introduces the concepts of authority value and central value. The HITS algorithm is affected by user query conditions. It is generally used for small-scale data link analysis. , and more vulnerable to attack.

Detailed introduction link: http://blog.csdn.net/androidlushangderen/article/details/43311943

11.K-Means(K均值)算法。K-Means算法是聚类算法,k在在这里指的是分类的类型数,所以在开始设定的时候非常关键,算法的原理是首先假定k个分类点,然后根据欧式距离计算分类,然后去同分类的均值作为新的聚簇中心,循环操作直到收敛。

详细介绍链接:http://blog.csdn.net/androidlushangderen/article/details/43373159

12.BIRCH算法。BIRCH算法利用构建CF聚类特征树作为算法的核心,通过树的形式,BIRCH算法扫描数据库,在内存中建立一棵初始的CF-树,可以看做数据的多层压缩。

详细介绍链接:http://blog.csdn.net/androidlushangderen/article/details/43532111

13.AdaBoost算法。AdaBoost算法是一种提升算法,通过对数据的多次训练得到多个互补的分类器,然后组合多个分类器,构成一个更加准确的分类器。

详细介绍链接:http://blog.csdn.net/androidlushangderen/article/details/43635115

14.GSP算法。GSP算法是序列模式挖掘算法。GSP算法也是Apriori类算法,在算法的过程中也会进行连接和剪枝操作,不过在剪枝判断的时候还加上了一些时间上的约束等条件。

详细介绍链接:http://blog.csdn.net/androidlushangderen/article/details/43699083

15.PreFixSpan算法。PreFixSpan算法是另一个序列模式挖掘算法,在算法的过程中不会产生候选集,给定初始前缀模式,不断的通过后缀模式中的元素转到前缀模式中,而不断的递归挖掘下去。

详细介绍链接:http://blog.csdn.net/androidlushangderen/article/details/43766253

16.CBA(基于关联规则分类)算法。CBA算法是一种集成挖掘算法,因为他是建立在关联规则挖掘算法之上的,在已有的关联规则理论前提下,做分类判断,只是在算法的开始时对数据做处理,变成类似于事务的形式。

详细介绍链接:http://blog.csdn.net/androidlushangderen/article/details/43818787

17. RoughSets ( rough set ) algorithm. Rough set theory is a relatively new idea of ​​data mining. What is used here is the algorithm of attribute reduction using rough sets, and the invalid attributes are deleted through the judgment of the upper and lower approximate sets, and the output of regulation is carried out.

Detailed introduction link: http://blog.csdn.net/androidlushangderen/article/details/43876001

18. gSpan algorithm. The gSpan algorithm belongs to the field of graph mining algorithms. , which is mainly used for frequent subgraph mining. Compared with other graph algorithms, subgraph mining algorithm is one of their prerequisites or basic algorithms. The gSpan algorithm uses concepts such as DFS encoding, Edge quintuple, and subgraph expansion of the rightmost path. The algorithm is abstract and complex.

Detailed introduction link: http://blog.csdn.net/androidlushangderen/article/details/43924273

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326394029&siteId=291194637