Small scale --- k- nearest neighbor (kNN) algorithm

 

First is that the concept of what k-nearest neighbor?

The core idea of ​​kNN classification algorithm is the closest k samples of data to represent the classification target data.

 

The principle Specifically, the presence of a training sample set , the data set for each sample in the training sample data contain data features and the target variable (i.e. classification value), and no target data input new variable, the features and training data sample set each sample to compare to find the most similar to the k data, the k data that attended the most number of classification, classification of data that is input with characteristic values.

 

Here is a classic chart:

Use kNN algorithm, the nearest distance classification k samples to represent the classification of test data

When k = 3, the nearest three samples in the solid line, having 2 red and triangles . 1 blue squares **, so it is classified as a red triangle. 
When k = 5, the nearest five samples within a dashed line, having 2 red and triangles . 3 blue squares **, so it is classified as blue square.

The advantages of 
(1) supervised learning: you can see, kNN algorithm first need a training sample set, this collection contains classified information, so it belongs to supervised learning. 
(2) to measure the similarity between the sample by computing the distance, the algorithm is simple, easy to understand and implement. 
(3) is not sensitive to outliers

Disadvantage 
(4) need to set the value of k, the k value is affected by the results can be seen by the above example, the k different values, the resulting classification results vary. k is generally not more than 20. 
(5) large amount of calculation is necessary to calculate the sample concentration from each sample, to obtain the k most recent data samples. 
(6) training samples lead to inaccurate results imbalance problem. When the sample is concentrated primarily a category, the category number is too large, resulting in k samples neighbors always the class, but not close to the target classification
 

2.kNN algorithm flow

In general, kNN the following procedure: 
(1) Data collection: determining the training sample set of test data; 
(2) calculating the test data from the training set and each sample data;

Common distance formula: 
Euclidean distance equation: d (x, y) = Σni = 1 (xi-yi) 2~~~~~~~~~~~~~√d (x, y) = Σi = 1n (xi-yi) 2 
Manhattan distance equation: D (X, Y) = Σni. 1 = | XI-Yi | D (X, Y) = [sigma] l = 1N | XI-Yi |
(. 3) is incremented by distance the sequential ordering; 
(4) select the closest k points; 
(5) determining the frequency of the k th point in the classification information; 
(6) returns the highest classification frequency before k points occurred, as the classification of the current test data .
--------------------- 
 

 

Data normalization:

Since different data is large differences in size, in the calculation of the Euclidean distance, a higher proportion of the overall larger data occupied detail, it is necessary to normalize the data.

 

#归一化处理函数
def autoNorm(dataSet):
    minVals=dataSet.min(0) #最小值
    maxVals=dataSet.max(0) #最大值
    ranges=maxVals-minVals #最大最小值之差
    normDataSet=zeros(shape(dataSet)) #构造零矩阵
    m=dataSet.shape[0] #行数,shape返回[nrow,ncol]
    normDataSet=dataSet-tile(minVals,(m,1))#tile复制minval
    normDataSet=normDataSet/tile(ranges,(m,1))
    return normDataSet,ranges,minVals

 

Guess you like

Origin blog.csdn.net/u011510825/article/details/86532351