kNN k- nearest neighbor algorithm

In recent summer training to participate in big data, record it to learn something.

 

introduction

  Lazy learning: simple data storage and waits until a given test data before proceeding Fan of order and test data are classified according to the similarity of the training tuples stored. Lazy learning only a small amount of work in the training tuples, and do more work during the classification or value prediction. Laziness learning or memory tuple training instances, also called case-based learning .

  K- nearest neighbor: belong lazy learning, given a training data set, centralized find examples of instances with unknown nearest K training focused on the training data, the majority of the K instance belongs to a class, put the unknown examples of this class to the classification.

  Algorithm Description: 

    1, data standardization

    2, calculating the distance between the respective test data and training data

    3, sorted in order of increasing distance relationship

    4, select the smallest distance K points

    5, before determining the point at which the K classes in frequency of occurrence

    6, the highest frequency class prediction classified as the test data is returned occurred before the K points

  

  Three basic elements of K nearest neighbor algorithm: selecting the k value, a measure of distance, the classification decision rule

    

 

 

First, data standardization

  Role : to prevent an attribute weight is too large

  For example, the coordinates of the point x, y. X in the range [0, 1], but the range is [y] 100, 1000, conducting its calculation of the distance y when the weights important than the right to x.

  Here it is the most simple of a standardized method:

 

  Min-max Standardization:    X '= (X -min) / (max-min)

  for example:

    data set x [1, 2, 3, 5], the min = 1, max = 5 

    1: (1-min) / (max-min) = 0

    2: (2-min) / (max-min) = 00:25

       3: (3-min) / (max-min) = 0.5

     5: (5 min) / (max-min) = 1

    The data set is updated to x x '[0, 0.25, 0.5, 1]

  Min-max method of x 'is normalized in [0, 1] interval

 

 

 

Second, the distance metric

  Calculating a distance from Continental, Manhattan distance and the like, Euclidean distance is generally used.

    Euclidean distance formula:

    

 

    For example it is (x1, y1) and (x2, y2) of the distance 

      

   Multidimensional is the same, subtracting the corresponding coordinates, squared, summed, and then Root No.

 

 

   Manhattan Distance:

    Manhattan distance formula:

    

    Point (x1, y1) and (x2, y2) Manhattan distance: | X1-X2 | + | Y1-Y2 |

    Multidimensional corresponding subtraction sum of the absolute coordinates

 

Selected according to the needs of different distance measures

 

 

In determining the K value

  The results will select the k-nearest neighbor k values ​​have a great impact, the k value means that only small training examples closer and enter the instance will work to predict the results, but prone to over-fitting; if the value of k greater advantage is to reduce the estimation error learning, learning disadvantage approximation error increases, then the training examples farther from the input instance also predict the prediction error. In practice, a selected k value is generally smaller value, cross-validation method to select the optimal value of k.

 

 

Fourth, the classification decision rule

  Classification decision rule is only to determine which category the current instance according to what rules.

  k-nearest neighbor algorithm, the classification decision rule is often a majority vote, that is determined by the K nearest training examples input instance of the class with input category instance.

 

 

Fifth, the advantages and disadvantages

   Pros: Simple, easy to understand, no modeling and training, easy to implement; suitable for rare events are classified.

   Cons: Lazy algorithm, large memory overhead, the calculation of the amount of the test sample classification is relatively large, lower performance; interpretability poor, can not give that kind of decision tree rule.

  

 

Guess you like

Origin www.cnblogs.com/zoey686/p/11201888.html