Data Mining (b) - classical algorithm

Data mining algorithms classic

To describe the basics of various algorithms, we will launch the subsequent detailed description of all individual algorithms derived code.

C4k5

C4.5 decision tree algorithm is a classification algorithm machine learning algorithm, the core algorithm is the ID3 algorithm. C4.5 algorithm inherits the advantages of ID3 and ID3 algorithm to improve in the following areas:

1. attribute information gain ratio is selected to overcome the bias of the selected attribute values of multiple attribute information for selecting a gain deficiency;
2. prune the tree construction process;
3. able to complete processing of the discrete continuous attributes ;
4. can process incomplete data.

C4.5 algorithm has the following advantages: generating classification rules easier to understand, high accuracy rate.
The disadvantage is that: during the construction of the tree, it is necessary to set the data sequentially scanning and sorting a plurality of times, resulting in inefficient algorithm (opposite CART algorithm scans two data sets, the following advantages and disadvantages of the decision tree only ).
Advantages: computational complexity is not high, the output is easy to understand, deletion of the intermediate values is insensitive data processing irrelevant features.
Cons: may cause over-matching problem.
Applicable Data Type: numeric and nominal type.

K-means algorithm

k-means algorithm is a clustering algorithm, the object is divided into n k divided according to their attributes, k <n. It is very similar to the expectation-maximization algorithm to handle mixed normal distribution, because they are trying to find natural clusters in the data center. It is assumed that the object properties from the space vector, and the goal is that the inside of each group the sum of the minimum mean square error.

Advantages: easy to implement.
Disadvantages: may converge to a local minimum, slow convergence on large data sets.
Applicable Data Type: numerical data.

Guess you like

Origin www.cnblogs.com/cpg123/p/11999841.html