Introduction to k-NN algorithm

1. Introduction to the algorithm

        The k-nearest neighbor method is a basic classification and regression method.

Algorithmic idea:

        Given a training set, the instance categories in it have been determined. When classifying, new instances are predicted based on the categories of their k nearest training instances through majority voting and other methods. (Given a training set, for a newly input instance, find the k nearest neighbors of the instance in the training data set. If most of these k instances belong to a certain class, the instance will be classified into this class.)

        Therefore, the k-nearest neighbor method does not have an explicit learning process. knn actually uses the training data set to divide the feature vector space and serves as the "model" for its classification.

3. k nearest neighbor model

        The model used by knn corresponds to a division of the feature space based on the training data set, and its model consists of three elements:

3.1 Distance measure

        Such as Manhattan distance, Euclidean distance, etc.;

3.2 Selection of k value

        Choosing different k values ​​will have a great impact on the results of knn. The decrease in the k value means that the overall model becomes complex (especially sensitive to noise points) and is prone to overfitting. The increase in the k value means that the overall model becomes simpler.

        The choice of k value reflects the trade-off between approximation error and estimation error; in applications, k generally takes a relatively small value, and cross-validation is usually used to select the optimal k value.

3.3 Classification decision rules

        Majority voting,corresponds to empirical risk minimization.

4. kd tree

        The implementation of KNN needs to consider how to quickly search k nearest neighbor points. The kd tree is a tree data structure that stores instance points in k-dimensional space for quick retrieval. Kd tree is a binary tree, which represents a division of k-dimensional space.

4.1 Construction of kd tree

        Constructing a kd tree is equivalent to continuously dividing the k-dimensional space with hyperplanes perpendicular to the coordinate axis to form a series of k-dimensional hyperrectangular regions. Each node of the kd tree corresponds to a k-dimensional hyperrectangular area.

4.2 Search of kd tree

        The use of kd trees can save the search for most data points, thereby reducing the calculation amount of the search. If the instance points are randomly distributed, the average computational complexity of kd-tree search is O(logN).

        The kd tree is more suitable for k-nearest neighbor search when the number of training instances is much larger than the spatial dimension. As the spatial dimension approaches the number of training instances, its efficiency drops rapidly, almost approaching a linear scan.

Guess you like

Origin blog.csdn.net/MusicDancing/article/details/130013989