KNN algorithm and its MATLAB code

1. Principle of KNN algorithm

1. Algorithm overview

K-Nearest Neighbor (kNN) learning is a commonly used supervised learning method. Its working mechanism is very simple: given a test sample, find the k training samples closest to it in the training set based on a certain distance metric. Then make predictions based on the information of these k "neighbors". Generally, the "voting method" can be used in the classification task, that is, the category label that appears the most among the k samples is selected as the prediction result; in the regression task, the "average method" is used, that is, the real value output of the k samples is labeled The average value is used as the prediction result; weighted average or weighted voting can also be carried out based on the distance, the closer the distance, the greater the weight of the sample.

The guiding ideology of the kNN algorithm is "the one who is near Zhu is red, and the one who is near ink is black", and your neighbors can infer your category.

Taking two classification as an example, the schematic diagram of k-nearest neighbor classification is shown in Figure 1.

Figure 1 Schematic diagram of k-nearest neighbor classification

 

The dashed line shows the equidistant line; the test sample is judged as a positive example when k=1 or k=5, and a negative example when k=3.

2. The calculation steps of the algorithm

1) Calculate the distance: Given a test object, calculate the distance between it and each object in the training set;

2) Find neighbors: circle the k closest training objects as the nearest neighbors of the test object;

3) Classify: Classify the test object according to the main categories of the k neighbors.

3. Advantages and disadvantages of the algorithm

Advantages: simple, easy to understand, easy to implement, no need to estimate parameters, no training; suitable for categorizing rare events; especially suitable for multi-modal problems (multi-modal, objects with multiple category labels).

Disadvantages: lazy algorithm, large amount of calculation when classifying test samples, large memory overhead, slow scoring; poor interpretability, unable to give rules like decision trees.

4. Common problems with algorithms

1) The size of the K value setting

If k is too small, the classification result is easily affected by noise points; if k is too large, there may be too many other types of points in the neighbors.

The k value is usually determined by cross-checking (based on k=1)

Rule of thumb: k is generally lower than the square root of the number of training samples.

2) Classification

The voting method does not consider the distance of the neighbors. The nearest neighbors with a closer distance should determine the final classification. Therefore, a weighted voting method is used (weighting the distance, the closer the sample is weighted).

3) Choose the right distance measurement

The influence of high dimensions on distance measurement: the more variables there are, the worse the discriminating ability of Euclidean distance;

The influence of variable range on distance: Variables with larger value ranges often dominate the distance calculation, so the variables should be standardized first.

The commonly used distance measurement formula is shown in Figure 2.

Figure 2 Commonly used distance measurement formula

 

4) Should training samples be treated equally

In the training set, some samples may be more dependable. Different weights can be applied to different samples to strengthen the weight of dependent samples and reduce the influence of untrusted samples.

5) Performance issues

KNN is a lazy algorithm. It is usually difficult to study well, and it is only during the exam (classification of test samples) that it will grind the gun (temporarily find k neighbors);

Consequences of laziness: The construction of the model is simple, but the system overhead is high when classifying the test samples, because all training samples have to be scanned and the distance calculated;

There are already some methods to improve the efficiency of calculation, such as compressing the training sample size.

6) Can the training sample size be greatly reduced while maintaining classification accuracy?

Condensing technology (condensing)

Editing technology (editing)

Second, the MATLAB code implementation of the KNN algorithm

Collect 20 groups of 4 different signals, a total of 80 groups, and then through feature extraction (the number of features is 8), an 80x8 matrix is ​​obtained.

Divide the data set: use 64 sets of data as training data and 16 sets of data as test data. The proportions of the 4 types of signals in the training set and the test set are the same. Normalize the training set and the test set as a whole, and then respectively use them as the input of KNN .

The source program uses the KNN algorithm to classify the test data after the overall normalization of the training set and the test set to obtain the classification accuracy.

Name: Based on MATLAB's KNN algorithm to achieve multi-classification (category determination using voting method).

Source code blog address: https://download.csdn.net/download/weixin_45317919/12850227

 

references

[1]Zhou Zhihua. Machine learning [M]. Beijing: Tsinghua University Press, 2017: 225.

[2]KNN algorithm understanding.

https://blog.csdn.net/jmydream/article/details/8644004

[3] Qi Xingmin. Research on PCA-based face recognition technology [D]. Wuhan: Wuhan University of Technology, 2007.

Guess you like

Origin blog.csdn.net/weixin_45317919/article/details/108621639