Artificial Intelligence - Algorithm for the next day --KNN

I. Overview

  1. Concept: K nearest neighbor (k-Nearest Neighbor, referred to as KNN) algorithm is a very simple machine learning algorithm supervision.

  2. The main idea: instant Given a training data set, for the new data sample, concentrate the sample to find the nearest k samples in the training, most of the statistics which belongs to the class k samples, put the sample comes down to this It belongs to the class.

  3. According to the analysis illustrated Wikipedia

          

  As shown above, a red triangle, a blue square two types of data, the green dots represent figures with classification data, we classified below according to the green dot to the k-nearest neighbor algorithm

  If k = 3, three points nearest neighbor is a green dot 2 Red 1 Blue, the result is obvious, the majority belonging to the minority, the green dot is determined that a Class red triangle

  If k = 5, the green dot is nearest zero point five blue 2 red 3, green dot is determined finally blue square class

  4. The principle problem involved

    We can see from the above example, k nearest neighbor algorithm idea is very simple, very easy to understand, but to use the project, or use need attention, such as how to determine the k, k is the number of the best of it? The so-called nearest neighbor and how to judge given it?

Two, K value selection, characterized in normalized

  1. First, look at the parameter k is very difficult to choose.

    If we select a smaller k value, it will mean that our overall model becomes complicated, prone to over-fitting

    Is called overfitting on the training set very high accuracy, and low accuracy in the test set, after the cases, we can get too small will lead to overfitting k, it is easy to some of the noise (above from five gon close black dots) learning to the model, while ignoring the real distribution!

    If we select the larger value of k is equivalent to the training data to predict a large neighborhood, and at this time (not similar) input training examples will be far example prediction function, so that the prediction error, k value increases means that the overall model becomes simple.

    k value increases how easy it means that the model, we guess, if k = N (N is the number of training samples), then no matter what the input instance is simply predict it will belong to the largest in the training examples the type. In this case, the model is not very simple, which is equivalent to the model you are they not training ah! Training statistics directly take a look at the various categories of data, looking for the biggest nothing.

    k values ​​neither too large nor too small, so we generally how to select it?

    Dr. Li Hang Scripture says that we generally choose a smaller number, usually take the cross-validation method to select the optimal value of k. (In other words, it is important to select the key value of k is an experimental parameter adjustment, which is similar to how many layers of neural networks choose to get a better result by adjusting hyperparameter)

  2. The feature normalization  

    First example below, I used a person Height (cm) with a foot code (Size) size as a characteristic value category for men or women. If we now have five training samples, distributed as follows:

      A [(179,42), M] B [(178,43), M] C [(165,36) F] D [(177,42), M] E [(160,35), F]

    Through the above training samples, it is easy to see that the first dimension is about four times the height feature code feature of the second dimension feet, then taking a distance measurement, we would prefer to first-dimensional feature. Such is not the equivalent result in two important features may eventually lead to the distance calculation error, leading to the prediction error.

    Test verification: a test sample F (167,43), let us to predict that he is male or female, we take k = 3 to predict.

    Here we use Euclidean distance Euclidean distance F were calculated from the training sample, and then select the nearest three, most categories is our final result, calculated as follows:

      

 

    Can be obtained from the calculation, the first three are the most recent C, D, E three samples, then by the C, E female, D for men, more women than men get the results we want to predict for women.

    So the question becomes, the possibility of a woman's feet 43 yards, far less than the possibility of male feet 43 yards, then why still forecasting algorithm F is women? That is because the weight of each feature different amounts right here leads to the importance of the height of the foot has been far greater than the code, which is not objective.

    So we should let each feature is, equally important, which is why we want to normalization of a reason.

  Normalization formula :

      

 

Three, KNN algorithm summary

  The core idea 1.k nearest neighbor is instantly given a training set of data, the new data sample, concentrate the sample to find the nearest k samples in the training, most of the statistics which belongs to the class k samples, put this comes down to the sample belongs to the class.

  2.与该实例最近邻的k个实例,这个最近邻的定义是通过不同距离函数来定义,我们最常用的是欧式距离,还可以采用闵可夫斯基距离, 曼哈顿距离,切比雪夫距离,余弦距离等等

  3.为了保证每个特征同等重要性,我们这里对每个特征进行归一化。

  4.k值的选取,既不能太大,也不能太小,何值为最好,需要实验调整参数确定!

 

---下一篇:基于面向对象的KNN算法实现

 

Guess you like

Origin www.cnblogs.com/cmxbky1314/p/12349564.html