Data mining: model selection-KNN

Introduction to KNN

KNN is a supervised learning algorithm, and its main idea is to be close to Zhu Zhechi and close to Mohehei . Find the latest K instances of the new sample and the training data, which category has the most number, and determine which category the sample is.
The following uses this figure to explain. If you select the 3 instances closest to the new sample, the circle is judged as a triangle, and if you select 5 instances, then it is judged as a square.
Insert picture description here

How KNN works

The working principle is as follows:

Suppose there is a labeled sample data set (training sample set), which contains the correspondence between each data and the classification to which it belongs.
After inputting new data without labels, compare each feature of the new data with the feature corresponding to the data in the sample set.

  1. Calculate the distance between the new data and each data in the sample data set .
  2. Sort all the obtained distances (from small to large, smaller means more similar).
  3. Take the classification labels corresponding to the first k (k is generally less than or equal to 20) sample data.

Find the classification label with the most occurrences in the k data as the classification of the new data.

Basic elements of KNN

Through the description of the above principles, the main parameters can be summarized into the selection of k value , distance measurement and classification decision rules are the three basic elements of k-nearest neighbor algorithm.

Selection of k value

  • Choosing a smaller value of K reduces the approximation error and increases the estimation error, making the model more complex. Think more extremely, if the K value is equal to 1, then the nearest sample will be fitted, and the prediction result will be sensitive to the neighboring instance points. If the neighboring instance point happens to be noise, the prediction will be wrong and overfitting is prone to occur (susceptible to the overfitting caused by the noise of the training data).
  • Selecting a larger K value increases the approximation error and reduces the estimation error, and the model becomes simple. Here is an extreme thought. If the K value is equal to the sample size, then put a sample and directly count the number of sample categories to get the category of the sample. There is no need to adjust the K value again. Becomes very simple.

Regarding approximation error and estimation error :

  • The approximation error pays attention to the training set, and has a good prediction for the existing training set, but a large deviation prediction will occur for the unknown test sample.
  • The estimation error pays attention to the test set, and has good predictive ability for unknown data, but it will predict a large deviation for known training samples.

In practical applications, the K value generally takes a relatively small value. The cross-validation method is usually used to select the optimal K value (empirical rule: K is generally lower than the square root of the number of training samples).

Distance measure

The distance between two instance points in the feature space can reflect the degree of similarity between the two instance points . The characteristic space of the K-nearest neighbor model is generally N-dimensional real direction space, and the distance used can be Euclidean distance or other distances.

Classification decision rules

Majority voting means that the majority class in the K adjacent training realms of the input instance 例 determines the class of the input real 例. This is also to maximize expectations.

KNN algorithm features

  • Advantages: high accuracy, insensitivity to outliers, no data input assumption
  • Disadvantages: high calculation complexity and high space complexity (because the distance from all points to this point needs to be calculated, even if there is a KD tree to simplify the calculation, the calculation cost of this method is relatively high)
Published 33 original articles · Liked 45 · Visitors 20,000+

Guess you like

Origin blog.csdn.net/AvenueCyy/article/details/105350493