K nearest neighbor algorithm - KNN
Machine Learning - K Nearest Neighbor Algorithm (KNN)
Basic knowledge
Fundamental
Given a test sample, find out the k closest training samples in the training set based on some distance measure, and then make predictions based on the information of these k "neighbors".
——Zhou Zhihua, Watermelon Book
There is a sample data set, also called a training sample set, and each data in the sample set has a label, that is, we know the corresponding relationship between each data in the sample set and the category it belongs to. After inputting new data without labels, each feature of the new data is compared with the corresponding features of the data in the sample set, and then the algorithm extracts the classification labels (top k) of the most similar data (nearest neighbors) of several features of the sample. ——Machine learning in practice
Self-understanding: That is to say, there are a bunch of well-labeled training set samples, and then you throw a sample for prediction, and judge which category the sample to be predicted belongs to by the labeling of the nearest k training sample points of the sample to be predicted.
example
After reading the principle, you should have a basic understanding of this KNN algorithm, so take a look at the example below! (The type pattern is from the textbook, the data is written by myself, just to understand the algorithm!)
Number of fights, number of kisses, and type of film evaluation per movie
movie title | fight scene | kissing camera | movie type |
---|---|---|---|
love before dawn | 3 | 104 | Romance |
Throbbing | 2 | 100 | Romance |
Listen attentively | 1 | 81 | Romance |
Luo Xiaohei Ji Ji | 101 | 5 | Action movie |
assembly number | 99 | 2 | Action movie |
doomsday war | 98 | 2 | Action movie |
? | 18 | 90 | unknown |
From the above table, we can construct the following coordinate diagram with the 6 pieces of information we have known before:
Then use the distance formula to calculate the k nearest points to " ? ", and judge the movie type of " ? " through these k points. Obviously, it can be determined that it is a romance movie through the k points closest to him.
Next, understand it through graphics, as follows:
The orange square and blue triangle are the results of our training, and the green circle is the sample we need to predict. From the figure, we can find that there are two ⭕, which are used to detect the samples to be tested and train Minimum distance circle for sample distance (myself). It can be found that k=1 and k=3, the obtained results are different. When k=1, the predicted result is a square, and when k=3, the predicted result should be a triangle. We found that different k has a great influence on our prediction results, so how should the value of this k be chosen ? It is also easy to find out from the above table, why k is a base number, why not define an even number?
Basic Questions About KNN
How is the distance calculated?
When I saw this algorithm, the first thing I thought of was, how to calculate the shortest distance of this algorithm? What comes to mind is a blank article, how is the distance calculated, isn't it seen with the eyes? (Then I found out that I'm really old and can't do anything anymore)
Euclidean distance : the straight-line distance between two points
official:
Of course, if you use this formula, you need to calculate the distance between the sample to be tested and each training sample, and then filter to leave the bottom k samples, and use the labels of the k samples to judge the prediction results of the samples to be tested.
Manhattan Distance : Also known as city block range. The sum of the absolute distances between two points on the coordinate axis.
Official :
This is more suitable for predictive classifications with higher dimensions (more features).
Most of the above is to use Euclidean distance. After all, it is simple and direct, and the most important thing is that we all understand it!
Let me tell you which method I prefer:
Directly take the point of the sample to be tested as the center point of the circle, then determine a minimum radius, and gradually expand the radius until the number of training samples in our circle>=k, and then judge the number of training samples to be tested according to the number of training samples in the circle The forecast type for the sample.
How does k define size?
Through the square and triangle cases in the example, we can see that the impact of different values of k is different, and its generalization ability is relatively poor. After all, compared with other algorithms, it does not have a learning (training) the process of.
k value | Influence |
---|---|
is too big | The prediction label is stable, too flat, and the classification is fuzzy, and it will also work for distant neighbor samples |
too small | It is easy to cause overfitting and is too sensitive to the sample points of the neighbors |
The result on the network is: constantly try the optimal K value through cross-validation, start from selecting a smaller K value, continuously increase the value of K, then calculate the variance of the verification set, and finally find a more suitable K value.
Why is k not defined as an even number?
Why not define an even number is entirely to avoid entanglement. The training samples in KNN have neither nor nor , either this or that, it is certain! Define an odd number, then it is impossible to have a tie result. (Of course, we are talking about binary classification here! The rest of the classification needs to be designed for k. For example, 4, 7... can be used for three classifications. In short, it is to avoid relative situations)
Advantages and disadvantages of KNN
Let's first look at the general process of KNN, as follows:
- Collection of data: any legitimate means
- Prepare data: structured data format, that is, the points in the coordinates of the training samples in the binary classification, to determine x, y, and (x, y) of the training samples
- Analyzing data: any legitimate means
- Training Algorithms: Not Applicable! So - no
- Test Algorithm: Calculate Error Rate
- Use algorithm: first input sample data and structured output results, run the knn algorithm to determine which category the input sample belongs to, and then process it.
advantage | shortcoming |
---|---|
high accuracy | no training process |
insensitive to outliers | high computational complexity |
No data entry assumed | high space complexity |
Code
First write (2022.10.25)
Data collection, processing, code writing, please read below:
The headquarters of Jimei University and the surrounding areas are intercepted from Baidu map, and the data is divided into two parts according to the following figure. One part is the data sample in the jmu campus. We define the label as jmu, and the other part is outside the Jimei University campus. For the data sample, we define its label as unjmu, and judge whether it is in the jmu school headquarters or outside the jmu school headquarters by their horizontal and vertical coordinates.
Training set:
Select a building on the map | custom location information | label |
---|---|---|
Yuzhou | (3,85) | jmu |
grandeur | (15,70) | jmu |
Lu Da | (7,58) | jmu |
Lu Zhenwan | (17,62) | jmu |
Atour Hotel | (33,28) | unjmu |
Kah Kee Library | (30,100) | jmu |
Wanda | (10,10) | unjmu |
Zhou Mapo | (2,1) | unjmu |
Xinjie Auto Repair | (45,31) | unjmu |
Jimei District Government | (50,40) | unjmu |
Guangsha Garden | (53,55) | unjmu |
Jimei Radio and Television | (60,58) | unjmu |
Earthquake Bureau | (52,15) | unjmu |
Test set:
Location | label |
---|---|
(5, 7) | unjmu |
(10,100) | jmu |
(49,49) | jmu |
(35,40 ) | unjmu |
Not much to say, post the code:
import matplotlib.pyplot as plt
import numpy as np
import math
class KNN:
def __init__(self, x_train, x_test, k):
# 保留测试点与所以训练样本的距离
self.distance = np.zeros((len(x_test), len(x_train)))
# 保留预测结果
self.predicted = []
# KNN中k的取值(不懂看上面基本知识点)
self.k = k
# KNN核心算法
def knn(self, x_test, x_train, y_train):
print(y_train)
for i in range(len(x_test)):
for j in range(len(x_train)):
self.distance[i][j] = self.knn_distance(x_test[i], x_train[j])
self.predicted.append(self.knn_predicted(self.distance[i], y_train))
return self.predicted
# 利用欧拉公式计算距离
def knn_distance(self, x1, x2):
dis = math.sqrt(math.pow((x1[0]-x2[0]),2) + math.pow((x1[1]-x2[1]),2))
return dis
def knn_predicted(self, distances, y_train):
#利用numpy的argsort方法获取前K小样本的索引
k_predicted_index = distances.argsort()[:self.k]
# 由于对一些库的函数学习不深,所以选择下面我自己可以实现的方法
count_jmu = 0
count_other =0
for i in range(len(k_predicted_index)):
if(y_train[k_predicted_index[i]] == 'jmu'):
count_jmu += 1
else:
count_other += 1
if(count_jmu > count_other):
return 'jmu'
else:
return 'unjmu'
# 自定义训练数据集
x_train = [[3, 85], [15, 70], [7, 58], [17,62], [33,28], [30,100], [10,10], [2,1], [45,31], [50,40], [53,55], [60,58], [52,15]]
y_train = ['jmu', 'jmu', 'jmu', 'jmu', 'unjmu','jmu' ,'unjmu' ,'unjmu' ,'unjmu' ,'unjmu' ,'unjmu' ,'unjmu' ,'unjmu']
# 自定义测试数据集
x_test = [[5,7], [10,100], [19,49], [35,40]]
y_test = ['unjum','jum','unjum','jum']
# 设置KNN中的k
k = 3
knn = KNN(x_train, x_test, k)
# 获得测试集的预测结果
pred = knn.knn(x_test, x_train, y_train)
print(pred)
The output shows:
Enhancement (2022.10.28)
数据集:
链接:https://pan.baidu.com/s/1yrDGiK9yXFxB_JyC3Q5ycg
提取码:1234
如果你觉得上面的描述或者代码不够清晰,请看这里,对于上述的代码,如果想要改变数据集好像很困难,而且变化不大,不易于修改,所以进行了一定的精炼,请看下面:
代码:
首先,对于python来说,典型的黑盒子,我们需要导入我们所需方法的库进行调用。
import matplotlib.pyplot as plt
import numpy as np
import math
import pandas as pd
from sklearn.model_selection import train_test_split
然后,根据KNN的算法思想进行编写KNN主体函数
class KNN:
def __init__(self, x_train, x_test, k):
# 保存距离
self.distance = np.zeros((len(x_test), len(x_train)))
# 预测结果
self.predicted = []
# knn中的k值
self.k = k
# knn的主要函数
def knn(self, x_test, x_train, y_train):
for i in range(len(x_test)):
for j in range(len(x_train)):
self.distance[i][j] = self.knn_distance(x_test[i], x_train[j])
self.predicted.append(self.knn_predicted(self.distance[i], y_train))
return self.predicted
# 欧式距离的计算
def knn_distance(self, x1, x2):
dis = math.sqrt(math.pow((x1[0]-x2[0]),2) + math.pow((x1[1]-x2[1]),2))
return dis
# 预测knn函数
def knn_predicted(self, distances, y_train):
k_predicted_index = distances.argsort()[:self.k]
count_jmu = 0
count_other =0
for i in range(len(k_predicted_index)):
if(y_train[k_predicted_index[i]] == 'jmu'):
count_jmu += 1
else:
count_other += 1
if(count_jmu > count_other):
return 'jmu'
else:
return 'unjmu'
通过绘制测试集和训练集的样本分布来视觉上查看预测结果
# 绘图(看数据集分布)
def paint(x_train, x_test):
# 绘制图像, X、Y是存储unjmu的数据,X1、Y1存储的是jmu的数据,Z是用于过渡
X = []
X1 = []
X2 = []
Y = []
Y1 = []
X2 = []
Z = []
# 根据训练样本获取x、y
x_train = np.array( x_train)
X = x_train[:,0]
Y = x_train[:,1]
# 对数据进行处理,根据训练集的数据以及label划分出jmu的点和unjum的点
for i in range(len(y_train)):
if(y_train[i] == 'jmu'):
Z.append(i)
X1.append(X[i])
Y1.append(Y[i])
X = np.delete(X,Z)
Y = np.delete(Y,Z)
# 绘制测试集的数据准备
x_test = np.array(x_test)
X2 = x_test[:,0]
Y2 = x_test[:,1]
# 绘图,红色为jmu的数据,绿色是unjmu数据,蓝色为测试样本
plt.scatter(X, Y, color = 'g')
plt.scatter(X1, Y1, color ='r')
plt.scatter(X2, Y2, color ='b')
# 数据处理,将csv获得的数据变成列表
def data_tolist(x_train, x_test, y_train, y_test):
x_train = np.array(x_train)
x_train = x_train.tolist()
y_train = np.array(y_train)
y_train = y_train.tolist()
x_test = np.array(x_test)
x_test = x_test.tolist()
y_test = np.array(y_test)
y_test = y_test.tolist()
return x_train, x_test, y_train, y_test
# 计算精确度
def predicted(pred, y_test):
count = 0
for i in range(len(pred)):
if(y_test[i] == pred[i]):
count += 1
pred1 = count / len(y_test)
return pred1
# 利用panda库进行对csv文件的读取和处理操作
data=pd.read_csv("D:/桌面/1.csv")
X = data.iloc[:,:2]
Y = data.iloc[:,2]
# 划分数据集,并且将数据集转换成list类型,0.8的训练集,0.2的测试集
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
x_train, x_test, y_train, y_test = data_tolist(x_train, x_test, y_train, y_test)
paint(x_train, x_test)
for i in range(len(x_train)):
if (i%2 != 0):
k = i
knn = KNN(x_train, x_test, k)
pred = knn.knn(x_test, x_train, y_train)
print(y_test)
print(f"预测结果:{
pred}")
predicte = predicted(pred, y_test)
print(f"k = {
k}时,测试精度为:{
predicte}")
注:红色定义为jmu样本,蓝色为待遇测样本,绿色为unjmu样本
结果分析
以上面增强代码和运行结果进行分析,去十次结果(理应进行对k=0,到k=len(x_train)进行分析),之所以取10十因为k取值越大,其实结果过于模糊,说白了k越大,等于比较数据集那个label的样本数更多了。
k = ? | predicate |
---|---|
1 | 1 |
3 | 0.83333 |
5 | 1 |
7 | 1 |
9 | 1 |
11 | 0.83333 |
13 | 0.83333 |
15 | 0.83333 |
21 | 0.66666 |
23 | 0.66666 |
从上表看:貌似k取越小越好,k越大预测的精度就越差了,这是为什么呢?难道k真的取值越小越好吗?
首先来说第一个问题:
k越大精度就越差,为什么呢?
首先,先分析一下我的数据集,我的数据集中label为unjmu的样本和jmu的样本数量上是不匹配的,unjmu的样本明显大于jmu,那么在k取值越大的情况下unjmu的样本就会在那些label标签为jmu中的作用越大,导致将label将jmu样本预测成unjmu。所以说,当k大于一定的值时,预测结果和样本数据集标签种类的数量关系会被放大。
再说一下第二个问题:
k取越小越好吗?
看下图:
The sample to be predicted in the box is unjmu, but the label closest to him is a label that is mislabeled by humans. If k is smaller, the better, then k=1 (nearest neighbor) should be taken, but in this case, the sample needs 0 errors to a large extent. , but any wrong label may lead to an error in the prediction result, and it is difficult to achieve zero error in a manually labeled data set. (Just like when I first started writing data, there was a label writing error). Therefore, the value of k is not as small as possible.
To sum up: So how should k define the size. In the above proposal of the basic question of KNN, the method of cross-validation is mentioned, you can try it. I personally think that the value of k is mainly related to the following aspects:
- The size of the dataset. (If it is too large, k cannot take too small a value, otherwise overfitting will be serious; if it is too small, k cannot take a large value, otherwise the ambiguity will be too strong)
- Sample label kind. (There are more types, so the possibility of labeling errors is greater)
- The data dimension of the sample. (It is best to use different distance calculation formulas for different dimensions, and the calculation methods are different, so k is worth choosing and needs to be adjusted)