Classification problem study notes-KNN principle

Nearest Neighbor Algorithm-KNN

Case:

The guiding ideology of the kNN algorithm is "the one who is near Zhu is red, and the one who is near ink is black", and your neighbors can infer your category.
As the saying goes, "things gather by kind and people are divided by groups". Imagine that there are two people, A and B. A lives in a Tomson first-class mansion, and B lives in a suburban old man. Then our most intuitive judgment is: A has a high probability of being a rich person, and B is nothing unusual. Even though we haven't seen the bank card balance of A and B, we can still make a judgment when we live in a rich community through A. Introduce the idea of ​​KNN algorithm: "Whoever you live close to is likely to be the same type of person as him".

Definition-From Wikipedia:

In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method proposed by Thomas Cover used for classification and regression.[1] In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

  • In k-NN classification, the output is a class membership. An object
    is classified by a plurality vote of its neighbors, with the object
    being assigned to the class most common among its k nearest neighbors
    (k is a positive integer, typically small). If k = 1, then the object
    is simply assigned to the class of that single nearest neighbor.
  • In k-NN regression, the output is the property value for the object.
    This value is the average of the values of k nearest neighbors.

principle:

To explain the principle of the KNN algorithm in one sentence, it is to find the K samples closest to the new data, and the largest category in the sample is used as the new data category.
Quoting from a picture on Wikipedia:
Insert picture description here
As shown in the picture above, there are two different types of sample data, represented by small blue squares and small red triangles, and the data marked by the green circle in the middle of the picture It is the data to be classified.

  • If K=3, the 3 nearest points of the green dot are 2 small red triangles and 1 small blue square. The minority belongs to the majority. Based on statistical methods, it is determined that the green point to be classified belongs to the red triangle one type.
  • If K=5, the 5 nearest neighbors of the green dot are 2 red triangles and 3 blue squares, or the minority belongs to the majority. Based on statistical methods, it is determined that the green point to be classified belongs to the blue one. One type of square.

Therefore, there are mainly two details, the selection of K value and the calculation of point distance.

Advantages and disadvantages:

advantage:

1. Simple and easy to implement: The KNN algorithm does not actually abstract any model at the end, but treats the entire data set directly as the model itself. When a new piece of data comes, it is compared with each piece of data in the data set. So you can see some of the advantages of the KNN algorithm. The first thing is that the algorithm is simple, so simple that no training is required. As long as the sample data is organized, it is over, and a new piece of data can be used for prediction.
2. The effect is better for data with irregular borders: It can be imagined that our final prediction is to use unknown data as the center point, and then draw a circle so that there are K data in the circle, so for data with irregular borders, it is better than The linear classifier works better. Because linear classifiers can be understood as drawing a line to classify, irregular data is difficult to find a line to divide it into left and right sides.

Disadvantages:

1. Only suitable for small data sets: precisely because this algorithm is too simple, all data sets need to be used every time new data is predicted, so if the data set is too large, it will consume a very long time and occupy a very large storage space .
2. The effect of data imbalance is not good: If the data in the data set is imbalanced, some category data is particularly large, and some category data is particularly small, then this method will fail, because a particularly large amount of data is finally in the voting There will be more competitive advantages.
3. Data standardization must be done: due to the use of distance to calculate, if the data dimension is different, the field with a larger value will have a greater impact, so the data needs to be standardized, for example, all are converted to an interval of 0-1.
4. Not suitable for data with too many feature dimensions: Since we can only deal with small data sets, if the data has too many dimensions, then the distribution of samples in each dimension is very small. For example, we only have three samples, and each sample has only one dimension, which is much more obvious than the three-dimensional features of each sample.

Distance calculation:

The application scenarios of various "distances" are briefly summarized as:
space: Euclidean distance,
path: Manhattan distance,
chess king: Chebyshev distance, the above three unified forms: Minkowski distance,
weighting: standardized Euclidean distance ,
Exclude dimension and dependence: Mahalanobis distance,
vector gap: angle cosine,
coding difference: Hamming distance,
set approximation: Jackard similarity coefficient and distance,
correlation: correlation coefficient and correlation distance.

(Interested students can query by themselves)
Commonly used such as Euclidean distance:
Take a two-dimensional plane as an example, the formula for calculating the Euclidean distance of two points in a two-dimensional space is as follows:

Insert picture description here
This junior high school has contacted it, in fact, it is calculated (x1 ,y1) and (x2,y2) distance. Expanding to a multi-dimensional space, the formula becomes this:
Insert picture description here
The simplest and rude KNN algorithm is to calculate the distance between the predicted point and all points, then save and sort, and select the first K values ​​to see which categories are more numerous.
Expand reading Min's distance:
Insert picture description here

  • When p=1, it is the Manhattan distance
  • When p=2, it is the Euclidean distance
  • When p→∞, it is the Chebyshev distance

According to the different variable parameters, Min's distance can represent a kind of distance.
Standardized Euclidean distance (Standardized Euclidean distance), the standardized Euclidean distance is an improvement scheme for the shortcomings of the simple Euclidean distance. The idea of ​​standard Euclidean distance: Since the distribution of the components of each dimension of the data is different, first "standardize" each component to the same mean and variance.
Assuming that the mathematical expectation or mean (mean) of the sample set X is m and the standard deviation (standard deviation, root of variance) is s, then the "standardized variable" X* of X is expressed as: (Xm)/s, and the value of the standardized variable The mathematical expectation is 0 and the variance is 1.
That is, the standardization process of the sample set is described by the formula:
Insert picture description here

Standardized value = (pre-standardized value-component mean value) / standard deviation of the component
can be obtained by simple derivation to obtain two n-dimensional vectors a(x11,x12,...,x1n) and b(x21,x22,... ,x2n) The formula for the standardized Euclidean distance between:
Insert picture description here

Choice of K value:

Generally through cross-validation (the sample data is divided into training data and verification data according to a certain ratio, such as 6:4 to split part of the training data and verification data), starting from selecting a smaller K value, Continue to increase the value of K, and then calculate the variance of the verification set, and finally find a more appropriate K value. After calculating the variance through cross-validation, you will roughly get the following graph:
Insert picture description here

When you increase k, the general error rate will decrease first, because there are more samples around to learn from, and the classification effect will be better. When the K value is larger, the error rate will be higher. For example, if you have a total of 35 samples, when your K increases to 30, KNN is basically meaningless. So when choosing K point, you can choose a larger critical K point. When it continues to increase or decrease, the error rate will increase.
Rule of thumb: k is generally lower than the square root of the number of training samples

Next, summarize the ideas of the KNN algorithm:

That is, when the data and labels in the training set are known, input the test data, compare the characteristics of the test data with the corresponding characteristics in the training set, and find the top K data in the training set that are most similar to it, then the test data corresponds The category of is the category with the most occurrences in the K data. The algorithm description is:
1) Calculate the distance between the test data and each training data;
2) Sort according to the increasing relationship of distance;
3) Select the smallest distance K points;
4) Determine the frequency of occurrence of the category of the first K points;
5) Return the category with the highest frequency of the first K points as the predicted classification of the test data.

Python iris case:

from sklearn import datasets 
#sklearn模块的KNN类
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
np.random.seed(0)
#设置随机种子,不设置的话默认是按系统时间作为参数,设置后可以保证我们每次产生的随机数是一样的

iris=datasets.load_iris() #获取鸢尾花数据集
iris_x=iris.data #数据部分
iris_y=iris.target #类别部分
#从150条数据中选140条作为训练集,10条作为测试集。permutation 接收一个数作为参数(这里为数据集长度150),产生一个0-149乱序一维数组
randomarr= np.random.permutation(len(iris_x))
iris_x_train = iris_x[randomarr[:-10]] #训练集数据
iris_y_train = iris_y[randomarr[:-10]] #训练集标签
iris_x_test = iris_x[randomarr[-10:]] #测试集数据
iris_y_test = iris_y[randomarr[-10:]] #测试集标签
#定义一个knn分类器对象
knn = KNeighborsClassifier()
#调用该对象的训练方法,主要接收两个参数:训练数据集及其类别标签
knn.fit(iris_x_train, iris_y_train)
#调用预测方法,主要接收一个参数:测试数据集
iris_y_predict = knn.predict(iris_x_test)
#计算各测试样本预测的概率值 这里我们没有用概率值,但是在实际工作中可能会参考概率值来进行最后结果的筛选,而不是直接使用给出的预测标签
probility=knn.predict_proba(iris_x_test)
#计算与最后一个测试样本距离最近的5个点,返回的是这些样本的序号组成的数组
neighborpoint=knn.kneighbors([iris_x_test[-1]],5)
#调用该对象的打分方法,计算出准确率
score=knn.score(iris_x_test,iris_y_test,sample_weight=None)
#输出测试的结果
print('iris_y_predict = ')
print(iris_y_predict)
#输出原始测试数据集的正确标签,以方便对比
print('iris_y_test = ')
print(iris_y_test)
#输出准确率计算结果
print('Accuracy:',score)
"""
输出结果:
iris_y_predict = 
[1 2 1 0 0 0 2 1 2 0]
iris_y_test = 
[1 1 1 0 0 0 2 1 2 0]
Accuracy: 0.9
"""
可以看到,该模型的准确率为0.9,其中第二个数据预测错误了。

Guess you like

Origin blog.csdn.net/Pioo_/article/details/109723433