Machine Learning | algorithm summary

Foreword

This series of machine learning algorithms review and summarize the purpose of clarity elaborate algorithm theory, while accompanying code examples to get started, easy to understand.

table of Contents

   Decision Tree
   Naive Bayes
   K-Means
  Machine learning algorithm summary
 
This chapter summarizes the top ten algorithms, and comes with Python and R classical algorithm implementation logic.

A, C4.5

C4.5, is a decision tree classification algorithm machine learning algorithm,
it is a decision tree (decision tree node between the organization that is making decisions like a tree, in fact, is an inverted tree) core algorithm
improved algorithm of ID3 so basically we know the half of it can construct a decision tree construction method.
Every decision tree construction method is actually a good feature and select the split point as the classification conditions of the current node.

C4.5 compared to the ID3 improvement are:

  • 1. attribute information gain ratio is selected. ID3 attribute with the selected subtree information gain, there may be many ways to define the information, ID3 using entropy (Entropy, entropy is a measure of the purity criterion is not), i.e. the change of entropy, while C4.5 use the information gain ratio. The difference is that one is the information gain, a gain of information . Generally the rate is taken for balancing, as the variance of the role is almost like there are two runners, a starting point is 10m / s person, which after 10s is 20m / s; from another person speed It is 1m / s, after which 1s of 2m / s. If the difference is so firmly count on the great gap between the two, if the rate of increase in the use of speed (acceleration, that is to 1m / s ^ 2) to measure the acceleration is the same two individuals. Therefore, C4.5 overcome the bias value and more choice when ID3 attribute information gain shortage select Properties.
  • 2. tree pruning in the construction process, when constructing a decision tree, hung with several elements of those nodes, without considering the best, otherwise easily lead to overfitting.
  • 3. can also handle non-discrete data.
  • 4. The process can be performed on the incomplete data.

Two, The k-means algorithm K-Means algorithm i.e.

k-means algorithm is a clustering algorithm, the n object divided into k (k <n) in accordance with their properties . It is very similar to the normal distribution deal with mixed expectation-maximization algorithm (Article V of the top ten algorithms), because they are trying to find natural clusters in the data center.
It is assumed that the object properties from the space vector, and the goal is that the inside of each group the sum of the minimum mean square error.

三、 Support vector machines

SVM, English as Support Vector Machine, referred to as the SV machine (paper generally referred to as SVM).

It is a method of supervised learning, which is widely used in statistical classification and regression analysis.
SVM vector will be mapped to a higher dimensional space, the establishment of a maximum interval hyperplane in this space . In hyperplane parted data has two hyperplanes parallel to each other, the two spaced parallel hyperplanes hyperplane to maximize distance. Assumed that the greater the distance or gap between the parallel hyperplanes, the smaller the total error of the classification.

An excellent guide is CJC Burges of "pattern recognition SVM Guide." van der Walt and Barnard and other support vector machine classifier was compared.

四、The Apriori algorithm

Apriori algorithm is one of the most influential mining Boolean association rules frequent item set of algorithms. The core frequency is set based on a two-stage recursive algorithm thought. The association rule belongs to the one-dimensional, single, Boolean association rules on classification. Here, all the support is greater than the minimum support itemsets called frequent itemsets, referred to as a frequency set .

Fifth, the maximum expected (EM) algorithm

In statistical calculations, the maximum expected (EM, Expectation-Maximization) algorithm is to find the parameters of maximum likelihood estimation algorithm in probability (probabilistic) model, where the probability model relies on unobservable hidden variables (Latent Variabl).

Maximum expected is often used in computer vision, machine learning and data gathering (Data Clustering) field.

六、 PageRank

Google PageRank is an important part of the algorithm. September 2001, was granted US patents, patent is Google co-founder Larry • Page (Larry Page). Therefore, PageRank is not in the page refers to the page, but to Paige, that this method is based on grade Page named. PageRank based on the number and quality of external and internal links of the site, measure the value of the site. The concept behind PageRank is that each link to a page is a vote for the page, the more linked, it means that other sites are more vote.

This is the so-called "link popularity" - a measure of how much people are willing to link their website and your website. PageRank concept drawn from the academic paper quoted frequency - ie the more the number of others cited general judgment of this paper the higher authority.

Seven, AdaBoost

Adaboost is an iterative algorithm, the core idea is the same for different training set a classifier is trained (weak classifiers), then these weak classifiers together, constitute a stronger final classifier (strong classifier). The algorithm itself is achieved by changing the data distribution, classified according to whether it is correct in every training session set for each sample, as well as the accuracy of the previous overall classification to determine the weight of each sample. The revised weights of the new data set to the lower classifier training, each training will finally get classifier integrate, as a final decision classifier.

八、 kNN: k-nearest neighbor classification

K-nearest neighbor (k-Nearest Neighbor, KNN) classification algorithm, is a more mature approach in theory, one of the most simple machine learning algorithm. Thinking KNN method: If a sample in feature space k most similar (i.e. nearest feature space K) of the majority of samples belonging to a certain category, the sample may also fall into this category .

九、 Naive Bayes

Among the classification model, the two most widely used classification model is a decision tree model (Decision Tree Model) and Naïve Bayes model (Naive Bayesian Model, NBC) . Naive Bayes model originated in classical mathematical theory, it has a solid mathematical foundation and a stable classification efficiency. Meanwhile, NBC needed to estimate model parameters little less sensitive to missing data, the algorithm is relatively simple. Theoretically, NBC compared with other classification model having the smallest error rate. But the fact is not always the case, because the model assumes NBC are independent properties, this assumption is in practice often is not established, which gives the model correctly classified NBC brought a certain extent. When the correlation between a large number of attributes or more attributes, the efficiency of the classification model compare NBC decision tree model. And when property less relevant, NBC performance model of the most good.

Ten, CART: Classification and Regression Trees

CART, Classification and Regression Trees. In the classification tree Here are two key ideas: the first is the idea of recursively partitioning the argument space; second idea is to prune with validation data .

 

Below posted online to find a summary of the algorithm

 

reference:

http://www.csuldw.com/2015/03/18/2015-03-18-machine-learning-top10-algorithms/

 

 

Guess you like

Origin www.cnblogs.com/geo-will/p/11203156.html