Summarize the classic machine learning algorithm

1. KNN algorithm

K The nearest neighbor algorithm (KNN) is a basic classification and regression method.  The core idea of ​​the algorithm is that if most of the nearest adjacent samples of a sample in the feature space belong to a category, then the sample also belongs to this category and has the characteristics of samples in this category KNN . kIn determining the classification decision, this method only determines the category of the sample to be classified according to the category of the nearest one or several samples. As shown below:

In  KNN , the distance between objects is calculated as the non-similarity index between objects, which avoids the matching problem between objects. Here, the distance generally uses Euclidean distance or Manhattan distance. The formula is as follows:

At the same time, KNN makes decisions based on the dominant category of k objects, rather than a single object category decision. These two points are the advantages of the KNN algorithm.

1.1, Selection of k value

  • k If the value is small, the model will become complex and prone to overfitting
  • k If the value is large, the model is relatively simple, and it is easy to underfit

So  k worth choosing is also a kind of tuning parameter?

1.2, KNN algorithm idea

KNN The idea is to input the test data when the data and labels in the training set are known, compare the features of the test data with the corresponding features in the training set, and find the most similar previous data in the training set, then the test data corresponds  K to The category of is the category with the most occurrences among the K data, and its algorithm is described as:

  1. Calculate the distance between the test data and each training data;
  2. Sort according to the increasing relationship of distance;
  3. K Select the points with the smallest distance  ;
  4. Determine  K the frequency of occurrence of the category of the previous point;
  5. Returns  K the most frequent class among the previous points as the predicted class for the test data.

Second, support vector machine algorithm

Algorithms in Machine Learning (2) - Support Vector Machine (SVM) Basics

2.1, Brief description of support vector machine

  • Support Vector Machines (SVM) is a binary classification model. Its basic model is a linear classifier with the largest margin defined on the feature space, which distinguishes it from perceptrons; support vector machines also include kernel tricks, which make them essentially nonlinear classifiers.
  • SVM The learning strategy of is to find the maximum interval (the sum of the distances of two heterogeneous support vectors to the hyperplane γ=2||w is called "interval"), which can be formalized as a problem of solving convex quadratic programming, which is also equivalent to The minimization problem of a regularized hinge loss function.
  • SVM The optimization algorithm for is an optimization algorithm for solving convex quadratic programs.

2.2, SVM basic type

min12||w||2s.t.yi(wTxi+b)≥1,i=1,2,...,m

SVM The optimization algorithm of is a convex quadratic programming problem, which can be converted into a dual problem by using the Lagrange multiplier method on the above formula, and can be solved optimally.

2.3, Solving the dual problem

SVM Add the Lagrangian multiplier αi≥0 to  each constraint of the basic formula, then the Lagrangian function of the formula is as follows:

L(w,b,a)=12||w||2−∑i=1nαi(yi(wTxi+b)−1)

After derivation (refer to the machine learning watermelon book), the dual problem of the basic type of SVM can be obtained:

maxα∑i=1m−12∑i=1m∑j=1mαiαjyiyjxT_ixj

s.t.∑i=1m=αiyi=0

αi≥0,i=1,2,...,m

To continue to optimize this problem, there are  SMO methods. The basic idea of ​​SMO is to fix all parameters except αi first, and then find the extreme value on αi.

Three, K-means clustering algorithm

3.1, classification and clustering algorithms

  • In simple terms, classification is to divide text into existing categories according to the characteristics or attributes of the text. That is to say, these categories are known, and by training and learning the known classified data, the features of these different classes are found, and then the unclassified data are classified.
  • Clustering means that you do not know how many categories the data will be divided into. Clustering the data or users into several groups through cluster analysis is called clustering. Clustering does not require training and learning on data.
  • Classification belongs to supervised learning, while clustering belongs to unsupervised learning. Common classifications such as decision tree classification algorithm, Bayesian classification algorithm and other clustering algorithms have the most basic systematic clustering and K-means mean clustering.

3.2, K-means clustering algorithm

The purpose of clustering is to find the potential class y of each sample x, and put the samples x of the same class y together. In a clustering problem, suppose the training samples are x1,...,xm, each xi∈Rn, without y. K-means The algorithm is to cluster the samples into k clusters ( cluster), the algorithm process is as follows:

  1. Randomly select k cluster centers ( cluster centroids) as μ1,μ1,...,μk∈Rn.
  2. Repeat the following process until the centroid does not change or changes very little:
    • For each sample i, calculate its category: ci=argminj||xi−μj||2
    • For each class j, recalculate the centroid of the class: μj=∑i=1m1ci=jxi∑i=1m1ci=j

K is the number of clusters we have given in advance, ci represents the class with the closest distance between sample i and k classes, and the value of ci is one of 1 to k. The centroid μj represents our guess about the center point of the samples belonging to the same class.

Guess you like

Origin blog.csdn.net/Java_LingFeng/article/details/128682496