Machine learning-[3] KNN algorithm theory and examples

1. Speaking from the case

An example of movie classification:

Image.png

This is a classification based on the number of fights and the number of kisses as features. The last record is the data to be classified.

One reason why the KNN classification process is relatively simple is that it does not need to create a model, nor does it need to be trained, and it is very easy to understand. Considering the number of fights and the number of kisses in the example as the x-axis and the y-axis, it is easy to establish a two-dimensional coordinate, and each record is a point in the coordinate. For unknown points, look for the nearest points, which category has more numbers, and which category the unknown point belongs to.

2. Algorithm description

KNN (K Near Neighbor): k nearest neighbors, that is, each sample can be represented by its k nearest neighbors.

The idea of ​​KNN algorithm:
KNN is a non-parametric machine learning algorithm (in machine learning, the model parameters are learned through model training, and the hyperparameters to be adjusted manually, please pay attention to avoid confusion). To use KNN, you must first have a known data set D. For any sample data x with an unknown label in the data set, you can calculate the distance between x and all sample points in D, and take out the first k known distances closest to x For data, use the labels of the k known data to vote for x, which category has the most votes, and which category x is, this is the general idea of ​​KNN.

1. Hyperparameters: are the parameters that must be determined before the model runs, such as the k value and distance in the k-nearest neighbor algorithm;
2. The general methods for determining hyperparameters: domain knowledge; empirical values; experimental exploration.

Note: Generally, the value of K is relatively small and will not exceed 20.

The general steps of the algorithm are:

  • Calculate the distance between the test data and each training data;
  • Sort by value from smallest to largest;
  • Determine the value of k and select the first k points;
  • Determine the frequency of the category of the first k points;
  • Return the category with the highest frequency among the first k points as the predicted category of the test data.

When to use the KNN algorithm?

The KNN algorithm can be used for both classification and regression prediction. However, the industry is mainly used for classification problems. When evaluating an algorithm, we usually start from the following three perspectives:

  • Model explanatory
  • Computing time
  • Ability to predict;
3. How to measure distance

The most critical part of the KNN algorithm is the calculation of distance. Generally speaking, to define a distance function d(x,y), the following criteria need to be met:
1) Identity: d(x,x) = 0 // The distance to oneself is 0
2) Non-negativity: d (x,y) >= 0 //The distance is non-negative
3) Symmetry: d(x,y) = d(y,x) //If the distance from A to B is a, then the distance from B to A should also be a
4) Directness: d(x,k)+d(k,y) >= d(x,y) // Rule of triangle: (The sum of the two sides is greater than the third side)

There are many methods for calculating distance, which can be roughly divided into discrete eigenvalue calculation methods and continuous eigenvalue calculation methods. Below we introduce the commonly used measurement methods for calculating the similarity of continuous data.

Minkowski distance
Min's distance sometimes refers to the space-time interval;

Suppose there are two coordinates x, y, p in n-dimensional space as constants, and the coordinates of numerical points P and Q are P=(x1,x2,...,xu) and Q=(y1,y2,...,yu), then Min-style distance is defined as:
Insert picture description here
Note: u is the subscript variable of x and y, u=1,2,...n;

[1] When p=1, the distance is the Manhattan distance (Manhattan distance);

[2] When p=2, the distance is Euclidean distance, which is the commonly used Euclidean distance;

[3] When p approaches infinity, the distance is transformed into Chebyshev distance (Chebyshev distance);
Insert picture description here

4. Selection of K value

The choice of K value will affect the results. A classic diagram is as follows:

Insert picture description here

The data set in the figure is good data, that is, they are all finished label. One is a blue square, the other is a red triangle, and the green circle is the data to be classified.

  • When k=3, there are many red triangles in the range, and this point to be classified belongs to the red triangle.
  • When k=5, there are many blue squares in the range, and the point to be classified belongs to the blue square.

Summary:
K is too large: resulting in fuzzy classification;
K is too small: affected by individual cases, large fluctuations;

If you choose a smaller K value, it is equivalent to predicting with training examples in a smaller neighborhood, and the approximation error of learning will be reduced. Only training data closer to the input instance will have an effect on the prediction result, but The disadvantage is that the estimation error of learning will increase, and the prediction result will be sensitive to the instance point division of neighbors. If the nearby instance points happen to be noise, the prediction will be wrong. In other words, the decrease of the K value means that the overall model becomes complicated, and the classification is not clear, and overfitting is prone to occur.
If you choose a larger K value, it is equivalent to predicting with training examples in a larger neighborhood. Its advantage is that it can reduce the estimation error of learning, but the approximate error will increase, that is, the prediction of the input example is not accurate, and the K value The increase means that the overall model becomes simpler.

Understanding:
Approximate error: It can be understood as the training error of the existing training set.
Estimated error: can be understood as the test error on the test set.

The approximate error focuses on the training set. If the k value is small, over-fitting will occur. The existing training set can have a good prediction, but the unknown test sample will have a large deviation in the prediction. The model itself is not the closest to the best model.
The estimation error focuses on the test set. A small estimation error indicates a good predictive ability for unknown data. The model itself is closest to the best model.

The selection of parameters in statistical learning methods is generally to strike a balance (Tradeoff) between Bias and Variance. For kNN, the choice of k value should also strike a balance between deviation and variance. If the value of k is small, such as k=1, the classification result is prone to error due to the interference of noise points, and the variance is large at this time; if the value of k is large, such as k=N (N is the number of samples in the training set ), the result is the same for all test samples, the classification result is the category with the most samples, so stability is stable, but the prediction result is too far from the true value, and the deviation is too large. Therefore, the value of k can neither be too large nor too small. The usual method is to evaluate a series of different values ​​of k using cross validation (Cross Validation), and select the best value of k as the training parameter.

In application, the K value generally takes a relatively small value (K is usually less than 20), and cross-validation is usually used to select the optimal K value.

5. Advantages and disadvantages of KNN

advantage:

1. Simple, easy to understand and easy to implement;

2. Only need to save training samples and labels, no need to estimate parameters, no training;

3. Not easily affected by the probability of small errors. The theory proves that the progressive error rate of the nearest neighbor does not exceed twice the Bayesian error rate at the worst, and approaches or reaches the Bayesian error rate at the best;

4. Since the KNN method mainly relies on the limited surrounding samples, rather than the method of discriminating the class domain to determine the category, so for the sample set to be classified with more cross or overlap of the class domain, the KNN method is better than other Method is more suitable;

5. This algorithm is more suitable for the automatic classification of class domains with a relatively large sample size, and those with a relatively small sample size are more prone to misclassification using this algorithm.

Disadvantages:

1. The choice of K is not fixed;

2. When the sample is unbalanced, the prediction accuracy of rare categories is low;

2. The prediction results are easily affected by noisy data;

3. It is a lazy learning method, basically not learning, resulting in slower prediction speed than algorithms such as logistic regression;

4. When the number of features is very large, it has high computational complexity and memory consumption, because for each text to be classified, the distance to all known samples must be calculated to find its K nearest neighbors .

6. The improvement strategy of KNN algorithm

6.1 From the perspective of reducing computational complexity

When the sample size is large and the feature attributes are large, the efficiency of KNN algorithm classification will be greatly reduced. The improvement methods that can be used are as follows.

(1) Perform feature selection. Before the KNN algorithm is used, the feature attributes are reduced, and the features that have a small (or unimportant) impact on the classification result can be deleted, which can speed up the classification of the KNN algorithm.

(2) Reduce the size of the training sample set. In the original training set, delete samples that are not relevant to classification.

(3) Through clustering, the center points generated by clustering are used as new training samples.

6.2 From the perspective of optimizing similarity measurement methods

Many KNN algorithms calculate the similarity of samples based on Euclidean distance, but this method is very sensitive to noise features. In order to change the shortcomings of the same features in the traditional KNN algorithm, you can give different features in the distance formula for measuring similarity. Weight, the weight of the feature is generally set according to the role of each feature in the classification, there are many ways to calculate the weight, such as the method of information gain. In addition, different similarity measurement formulas can be used for different feature types to better reflect the similarity between samples.

6.3 From the perspective of optimizing decision strategies

The disadvantage of the traditional KNN algorithm's decision rule is that when the sample distribution is uneven (the number of training samples is not balanced, or even if the basic number is close, the size of the area is different), only the first k nearest neighbors The order regardless of their distance will cause the classification to be inaccurate, and many methods are adopted. For example, the method of uniformizing the sample distribution density can be used to improve.

6.4 From the perspective of selecting the appropriate K value

Since most of the calculations in the KNN algorithm occur in the classification stage, and the classification effect largely depends on the selection of the K value, so far, there are no mature methods and theories to guide the selection of the k value. In most cases, it needs to be passed Try to adjust the choice of K value.

7. Example demonstration

Just create an example, let’s refer to it and feel the flow of the KNN algorithm.

# coding=utf-8
"""
@Project: 课堂小练习
@Author: 王瑞
@File: KNN1.py
@IDE: PyCharm
@Time: 2020-10-21 20:39:20
"""
import numpy as np
# 给出训练数据以及对应的类别
def createDataSet():
    group = np.array([[1.0, 2.0], [1.2, 0.1], [0.1, 1.4], [0.3, 3.5]])
    labels = ['A', 'A', 'B', 'B']
    return group, labels

# 通过KNN进行分类
def classify(input,dataSet,labels,k):
    datasize = dataSet.shape[0]
    # 计算欧式距离
    dis = np.zeros(datasize, dtype=float)
    for i in range(datasize):
        dis[i] = np.linalg.norm((input - dataSet[i])*(input - dataSet[i]).T)

    # 对距离排序
    sortedindex = np.argsort(dis)

    # 累计label次数
    classcount = {
    
    }
    for i in range(k):
        vote = labels[sortedindex[i]]
        classcount[vote] = classcount.get(vote, 0)+1

    # 对map的value排序
    sortedclass = sorted(classcount.items(), key=lambda x: (x[1]), reverse=True)
    return sortedclass[0][0]


dataSet, labels = createDataSet()
input = np.array([1.1, 0.3])
k = 3
output = classify(input, dataSet, labels, k)
print(f"测试数据为:{input},分类结果为:{output}")

output:

D:\Python\Python37\python.exe E:/PyDataFile/课堂小练习/KNN1.py
测试数据为:[1.1 0.3],分类结果为:A
Process finished with exit code 0

Guess you like

Origin blog.csdn.net/qq_46009608/article/details/109189728