[Classic Machine Learning Algorithm] K Nearest Neighbor (KNN): Core and Summary

1. First introduction to K nearest neighbors

K-Nearest Neighbors (KNN) can be said to be one of the simplest algorithms in the entire machine learning algorithm, but it can be seen even in the currently popular deep learning algorithms. Shadow is simple and easy to use, just learn it and you’re done. To put it simply,Given a training set, find the corresponding vector that maps to the feature space and save it. When a new test instance comes in, find its nearest K instances in the feature space, see which category belongs to the most among these K instances, and assign the corresponding label to the test sample.

When K=1, it is the nearest neighbor algorithm.


As shown in the figure above, the training set has two categories (blue, red). A new sample (green) has arrived, and it is necessary to determine which category it belongs to. KNNThe algorithm is to find the nearest K reference objects (K=3) in the entire feature space. Which category of these K objects belongs to the most is to assign the corresponding label to the new sample. For example, in this example, the new sample is in the red category.

2. Get to know each other

2.1 Three elements of K nearest neighbor

Three basic elements: K value selection, distance measurement, classification decision rules

Model: The model is very simple and has been introduced in the previous section. For a more formulaic definition, you can refer to Teacher Li Hang’s "Statistical Learning Methods". Let me mention the essence of the model here. Put aside the testing process. After determining the three elements,KNNdivide the feature space into different subspaces according to the training set samples. The category of the subspace is determined, as shown in the figure below. (If you don’t understand this sentence, you can consider the nearest neighbor situation)
Insert image description here
Distance measure: The commonly used distance measure is Euclidean distance, but in fact any distance measure Can. Show it here L p L_p LpDistance, different p has different meanings, I won’t go into details here, but post a picture.

Insert image description here

As shown in the figure below, it shows L p L_p LpWhen p is different, the distance is R 2 R^2 R2 A contour line in space (the distance between the points on the line and the origin is equal). Interested students can extend it to the L1 and L2 regularization in the loss function. .

It should be noted that the nearest neighbors obtained under different metrics may be different.

Selection of k value: When the K value is small, the range of fields selected during testing is large, and the approximation error of learning is reduced (only instances that are close to the input will have an effect on the prediction results), and the estimation error increases (it is very sensitive to nearby instance points, and the income is affected by noise); when the K value is too large, the selected field range will be large, which will reduce the estimation error of the model, but It will increase the approximation error of learning.

It can also be said that the smaller the k value, the more complex the model is and the easier it is to overfit; the larger the K value is, the simpler the model is. In the extreme case of K=N, all outputs are the same.

Approximation error and estimation error:https://blog.csdn.net/qq_35664774/article/details/79448076

The k value generally takes a relatively small value, and the cross-validation method is usually used to select the optimal k value.

Classification decision rule: majority voting (equivalent to empirical risk minimization, see "Statistical Learning Methods" for proof)

2.2 KD tree

The simplest implementation method of k nearest neighbor method islinear scan (linear scan), that is, calculating the distance between the input instance and each training instance . But this method is very time-consuming when the training set is large. And KD树(k-dimensional tree) can improve search efficiency and reduce the amount of calculation. The kd tree is a tree data structure that stores instance points in k-dimensional space for fast retrieval. Its essence is a binary tree. The nearest neighbor search time can be O ( N ) O(N) O(N)fall low O ( N l o g N ) O(NlogN) O(NlogN)

2.2.1 Construction of kd tree

The core idea of ​​kd tree construction isrecursively divide the data vertically. The specific process is as follows:

Insert image description here
Among them, (1) represents the initial condition of the recursion, (2) represents the recursive logic, and (3) is the termination condition. The above algorithm process may be difficult to understand. In fact, the simple understanding is: for N k-dimensional training data, knn needs to divide it into N regions, first select the first dimension< /span> similarly divides the two areas (find the median Divide by several nodes), and continue recursively in this way until no instances exist in all sub-areas. [The whole process is similar to the construction of a balanced binary search tree] respectively. In this way, it has been divided into two areas first, and then the second dimension is selected, and smaller than the node are divided into the left subtree, and values ​​greater than the node are divided into the right subtree , sort all data according to the entire dimension, and select the node corresponding to the median as the root node. In this dimension, values ​​

Highly recommend this video:https://www.bilibili.com/video/BV1No4y1o7ac?p=21, after watching it you will Got it, it also includes explanations with examples.

Insert image description here
Insert image description here

2.2.2 Search of kd tree

The kd tree is mainly to speed up the retrieval process of KNN, that is to say, given a test sample, find its corresponding K nearest neighbor samples. Taking nearest neighbor as an example (easily extended to k nearest neighbors), it is still a recursive process. The specific process is as follows:

Insert image description here
For the constructed kd tree, first find a leaf node according to the division corresponding to each dimension, and use the leaf node as ans. Recursively perform rollback: firstcheck whether the node's sibling node is closer than the leaf node, if closer, transfer, otherwise is the nearest neighbor node. . Finally, when it returns to the root node, the search ends, and the returned , then calculate whether the distance is closer, and update if it is closerGo back to the parent nodeansans

Corresponding video and example:https://www.bilibili.com/video/BV1No4y1o7ac?p=22

3. Summary

KNN is the simplest classification and regression algorithm. Although only the classification steps are introduced above, the regression method is similar. KNN is performed and the average value is returned as the result. There are three major elements of KNN: distance measurement, k value selection, and classification decision rules. The smaller the K value, the more complex the model is and the approximation error is low. The larger the K value is, the simpler the model is and the estimation error is low.

Although the K nearest neighbor algorithm is simple, its limitations are also very obvious. The algorithm does not significantly build a model based on the data, but stores all the training set features. When the training data is large, the time and space complexity will be very high. Therefore, in order to further improve calculation efficiency, kd tree is used for optimization.

Guess you like

Origin blog.csdn.net/qq_36560894/article/details/122970143