Machine learning notes 08---k nearest neighbor learning

    k-Nearest Neighbor (kNN for short) learning is a commonly used supervised learning method, and its working mechanism is very simple: given a test sample, find the k training samples closest to it in the training set based on a certain distance metric, Predictions are then made based on the information of these k "neighbors". Usually, the "voting method" can be used in the classification task, that is, the class label that appears most in the k samples is selected as the prediction result; the "average method" can be used in the regression task, that is, the real-valued output of the k samples is marked The average value of is used as the prediction result; weighted average or weighted voting can also be performed based on the distance, and the closer the sample weight is, the greater the weight.

    Compared with other learning methods, k-nearest neighbor learning has an obvious disadvantage: it does not seem to have an explicit training process! In fact, it is a well-known representative of "lazy learning". This type of learning technology only saves samples during the training phase, with zero training time overhead, and processes them after receiving the test samples; correspondingly, Those methods that learn to process samples during the training phase are called "eager learning".

    Below is a schematic diagram of the k-nearest neighbor classifier. Obviously, k is an important parameter, when the value of k is different, the classification results will be significantly different. On the other hand, if different distance calculation methods are used, the found "nearest neighbors" may be significantly different, which will also lead to significant differences in classification results.

  (Schematic diagram of the k-nearest neighbor classifier. The circles show equidistant lines; samples are identified as "-" cases when k=1 or k=5, and "+" cases when k=3.) (PS) Gu There is a saying: Those who are close to vermilion are red and those who are close to ink are black.

Assuming for the time being that the distance calculation is "appropriate", that is, the k nearest neighbors can be properly found, let's make a simple discussion on the performance of the "nearest neighbor classifier" (1NN, k=1) on the binary classification problem.

    Given a test sample x, if its nearest neighbor sample is z, the probability that the nearest neighbor classifier makes an error is the probability that the class labels of x and z are different, namely:

     Assuming that the samples are independent and identically distributed, and for any x and any small integer σ, a training sample can always be found within the distance σ around x; in other words, for any test sample, a training sample z can always be found within an arbitrarily close range. Let c*=argmaxP(c|x) represent the result of Bayesian optimal classifier, there are:

 The surprising conclusion is that the nearest neighbor classifier is simple, but its generalization error rate is no more than twice the error rate of the Bayesian optimal classifier.

Refer to Zhou Zhihua's "Machine Learning"

Guess you like

Origin blog.csdn.net/m0_64007201/article/details/127591334