K Nearest Neighbor Algorithm Learning (KNN)

Machine Learning - K Nearest Neighbor Algorithm (KNN)

Basic knowledge

Fundamental

Given a test sample, find out the k closest training samples in the training set based on some distance measure, and then make predictions based on the information of these k "neighbors".

​ ——Zhou Zhihua, Watermelon Book

There is a sample data set, also called a training sample set, and each data in the sample set has a label, that is, we know the corresponding relationship between each data in the sample set and the category it belongs to. After inputting new data without labels, each feature of the new data is compared with the corresponding features of the data in the sample set, and then the algorithm extracts the classification labels (top k) of the most similar data (nearest neighbors) of several features of the sample. ——Machine learning in practice

Self-understanding: That is to say, there are a bunch of well-labeled training set samples, and then you throw a sample for prediction, and judge which category the sample to be predicted belongs to by the labeling of the nearest k training sample points of the sample to be predicted.

example

After reading the principle, you should have a basic understanding of this KNN algorithm, so take a look at the example below! (The type pattern is from the textbook, the data is written by myself, just to understand the algorithm!)


Number of fights, number of kisses, and type of film evaluation per movie

movie title fight scene kissing camera movie type
love before dawn 3 104 Romance
Throbbing 2 100 Romance
Listen attentively 1 81 Romance
Luo Xiaohei Ji Ji 101 5 Action movie
assembly number 99 2 Action movie
doomsday war 98 2 Action movie
18 90 unknown

From the above table, we can construct the following coordinate diagram with the 6 pieces of information we have known before:

insert image description here

Then use the distance formula to calculate the k nearest points to " ? ", and judge the movie type of " ? " through these k points. Obviously, it can be determined that it is a romance movie through the k points closest to him.

Next, understand it through graphics, as follows:

insert image description here

The orange square and blue triangle are the results of our training, and the green circle is the sample we need to predict. From the figure, we can find that there are two ⭕, which are used to detect the samples to be tested and train Minimum distance circle for sample distance (myself). It can be found that k=1 and k=3, the obtained results are different. When k=1, the predicted result is a square, and when k=3, the predicted result should be a triangle. We found that different k has a great influence on our prediction results, so how should the value of this k be chosen ? It is also easy to find out from the above table, why k is a base number, why not define an even number?

Basic Questions About KNN

How is the distance calculated?

When I saw this algorithm, the first thing I thought of was, how to calculate the shortest distance of this algorithm? What comes to mind is a blank article, how is the distance calculated, isn't it seen with the eyes? (Then I found out that I'm really old and can't do anything anymore)

Euclidean distance : the straight-line distance between two points

official:

insert image description here

Of course, if you use this formula, you need to calculate the distance between the sample to be tested and each training sample, and then filter to leave the bottom k samples, and use the labels of the k samples to judge the prediction results of the samples to be tested.

Manhattan Distance : Also known as city block range. The sum of the absolute distances between two points on the coordinate axis.

Official :

insert image description here

This is more suitable for predictive classifications with higher dimensions (more features).

Most of the above is to use Euclidean distance. After all, it is simple and direct, and the most important thing is that we all understand it!

Let me tell you which method I prefer:

Directly take the point of the sample to be tested as the center point of the circle, then determine a minimum radius, and gradually expand the radius until the number of training samples in our circle>=k, and then judge the number of training samples to be tested according to the number of training samples in the circle The forecast type for the sample.

How does k define size?

Through the square and triangle cases in the example, we can see that the impact of different values ​​of k is different, and its generalization ability is relatively poor. After all, compared with other algorithms, it does not have a learning (training) the process of.

k value Influence
is too big The prediction label is stable, too flat, and the classification is fuzzy, and it will also work for distant neighbor samples
too small It is easy to cause overfitting and is too sensitive to the sample points of the neighbors

The result on the network is: constantly try the optimal K value through cross-validation, start from selecting a smaller K value, continuously increase the value of K, then calculate the variance of the verification set, and finally find a more suitable K value.

Why is k not defined as an even number?

Why not define an even number is entirely to avoid entanglement. The training samples in KNN have neither nor nor , either this or that, it is certain! Define an odd number, then it is impossible to have a tie result. (Of course, we are talking about binary classification here! The rest of the classification needs to be designed for k. For example, 4, 7... can be used for three classifications. In short, it is to avoid relative situations)

Advantages and disadvantages of KNN

Let's first look at the general process of KNN, as follows:

  1. Collection of data: any legitimate means
  2. Prepare data: structured data format, that is, the points in the coordinates of the training samples in the binary classification, to determine x, y, and (x, y) of the training samples
  3. Analyzing data: any legitimate means
  4. Training Algorithms: Not Applicable! So - no
  5. Test Algorithm: Calculate Error Rate
  6. Use algorithm: first input sample data and structured output results, run the knn algorithm to determine which category the input sample belongs to, and then process it.
advantage shortcoming
high accuracy no training process
insensitive to outliers high computational complexity
No data entry assumed high space complexity

Code

First write (2022.10.25)

Data collection, processing, code writing, please read below:

The headquarters of Jimei University and the surrounding areas are intercepted from Baidu map, and the data is divided into two parts according to the following figure. One part is the data sample in the jmu campus. We define the label as jmu, and the other part is outside the Jimei University campus. For the data sample, we define its label as unjmu, and judge whether it is in the jmu school headquarters or outside the jmu school headquarters by their horizontal and vertical coordinates.

insert image description here

Training set:

Select a building on the map custom location information label
Yuzhou (3,85) jmu
grandeur (15,70) jmu
Lu Da (7,58) jmu
Lu Zhenwan (17,62) jmu
Atour Hotel (33,28) unjmu
Kah Kee Library (30,100) jmu
Wanda (10,10) unjmu
Zhou Mapo (2,1) unjmu
Xinjie Auto Repair (45,31) unjmu
Jimei District Government (50,40) unjmu
Guangsha Garden (53,55) unjmu
Jimei Radio and Television (60,58) unjmu
Earthquake Bureau (52,15) unjmu

Test set:

Location label
(5, 7) unjmu
(10,100) jmu
(49,49) jmu
(35,40 ) unjmu

Not much to say, post the code:

import matplotlib.pyplot as plt
import numpy as np
import math
class KNN:
    def __init__(self, x_train, x_test, k):
        # 保留测试点与所以训练样本的距离
        self.distance =  np.zeros((len(x_test), len(x_train)))
        # 保留预测结果
        self.predicted = []
        # KNN中k的取值(不懂看上面基本知识点)
        self.k = k

   # KNN核心算法
    def knn(self, x_test, x_train, y_train):
        print(y_train)
        for i in range(len(x_test)):
            for j in range(len(x_train)):
                self.distance[i][j] = self.knn_distance(x_test[i], x_train[j])
            self.predicted.append(self.knn_predicted(self.distance[i], y_train))
        return self.predicted

    # 利用欧拉公式计算距离
    def knn_distance(self, x1, x2):
        dis = math.sqrt(math.pow((x1[0]-x2[0]),2) + math.pow((x1[1]-x2[1]),2))
        return dis

    def knn_predicted(self, distances, y_train):
        #利用numpy的argsort方法获取前K小样本的索引
        k_predicted_index = distances.argsort()[:self.k]
        # 由于对一些库的函数学习不深,所以选择下面我自己可以实现的方法
        count_jmu = 0
        count_other =0
        for i in range(len(k_predicted_index)):
            if(y_train[k_predicted_index[i]] == 'jmu'):
                count_jmu += 1
            else:
                count_other += 1
        if(count_jmu > count_other):
            return 'jmu'
        else:
            return 'unjmu'
# 自定义训练数据集
x_train = [[3, 85], [15, 70], [7, 58], [17,62], [33,28], [30,100], [10,10], [2,1], [45,31], [50,40], [53,55], [60,58], [52,15]]
y_train = ['jmu', 'jmu', 'jmu', 'jmu', 'unjmu','jmu' ,'unjmu' ,'unjmu' ,'unjmu' ,'unjmu' ,'unjmu' ,'unjmu' ,'unjmu']

# 自定义测试数据集
x_test = [[5,7], [10,100], [19,49], [35,40]]
y_test = ['unjum','jum','unjum','jum']

# 设置KNN中的k
k = 3
knn = KNN(x_train, x_test, k)

# 获得测试集的预测结果
pred = knn.knn(x_test, x_train, y_train)
print(pred)

The output shows:

insert image description here

Enhancement (2022.10.28)

数据集:
链接:https://pan.baidu.com/s/1yrDGiK9yXFxB_JyC3Q5ycg
提取码:1234

如果你觉得上面的描述或者代码不够清晰,请看这里,对于上述的代码,如果想要改变数据集好像很困难,而且变化不大,不易于修改,所以进行了一定的精炼,请看下面:

代码:

首先,对于python来说,典型的黑盒子,我们需要导入我们所需方法的库进行调用。

import matplotlib.pyplot as plt
import numpy as np
import math
import pandas as pd
from sklearn.model_selection import train_test_split

然后,根据KNN的算法思想进行编写KNN主体函数

class KNN:
    def __init__(self, x_train, x_test, k):
        # 保存距离
        self.distance =  np.zeros((len(x_test), len(x_train)))
        # 预测结果
        self.predicted = []
        # knn中的k值
        self.k = k  
    # knn的主要函数 
    def knn(self, x_test, x_train, y_train):
        for i in range(len(x_test)):
            for j in range(len(x_train)):
                self.distance[i][j] = self.knn_distance(x_test[i], x_train[j])
            self.predicted.append(self.knn_predicted(self.distance[i], y_train))
        return self.predicted

# 欧式距离的计算
    def knn_distance(self, x1, x2):
        dis = math.sqrt(math.pow((x1[0]-x2[0]),2) + math.pow((x1[1]-x2[1]),2))
        return dis

# 预测knn函数
    def knn_predicted(self, distances, y_train):
        k_predicted_index = distances.argsort()[:self.k]
        count_jmu = 0
        count_other =0
        for i in range(len(k_predicted_index)):
            if(y_train[k_predicted_index[i]] == 'jmu'):
                count_jmu += 1
            else:
                count_other += 1
        if(count_jmu > count_other):
            return 'jmu'
        else:
            return 'unjmu'

通过绘制测试集和训练集的样本分布来视觉上查看预测结果

# 绘图(看数据集分布)
def paint(x_train, x_test):
# 绘制图像, X、Y是存储unjmu的数据,X1、Y1存储的是jmu的数据,Z是用于过渡
    X = []
    X1 = []
    X2 = []
    Y = []
    Y1 = []
    X2 = []
    Z = []
    # 根据训练样本获取x、y
    x_train = np.array( x_train)
    X = x_train[:,0]
    Y = x_train[:,1]

    # 对数据进行处理,根据训练集的数据以及label划分出jmu的点和unjum的点
    for i in range(len(y_train)):
        if(y_train[i] == 'jmu'):
            Z.append(i)
            X1.append(X[i])
            Y1.append(Y[i])
    X = np.delete(X,Z)
    Y = np.delete(Y,Z)
    
    # 绘制测试集的数据准备
    x_test = np.array(x_test)
    X2 = x_test[:,0]
    Y2 = x_test[:,1] 

    # 绘图,红色为jmu的数据,绿色是unjmu数据,蓝色为测试样本
    plt.scatter(X, Y, color = 'g')
    plt.scatter(X1, Y1, color ='r')
    plt.scatter(X2, Y2, color ='b')
# 数据处理,将csv获得的数据变成列表
def data_tolist(x_train, x_test, y_train, y_test):
    x_train = np.array(x_train)
    x_train = x_train.tolist()

    y_train = np.array(y_train)
    y_train = y_train.tolist()

    x_test = np.array(x_test)
    x_test = x_test.tolist()

    y_test = np.array(y_test)
    y_test = y_test.tolist()

    return x_train, x_test, y_train, y_test
# 计算精确度
def predicted(pred, y_test):
    count = 0
    for i in range(len(pred)):
        if(y_test[i] == pred[i]):
            count += 1
    pred1 = count / len(y_test)
    return pred1
# 利用panda库进行对csv文件的读取和处理操作
data=pd.read_csv("D:/桌面/1.csv")
X = data.iloc[:,:2]
Y = data.iloc[:,2]

# 划分数据集,并且将数据集转换成list类型,0.8的训练集,0.2的测试集
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

x_train, x_test, y_train, y_test = data_tolist(x_train, x_test, y_train, y_test)

paint(x_train, x_test)

for i in range(len(x_train)):
    if (i%2 != 0):
        k = i
        knn = KNN(x_train, x_test, k)
        pred = knn.knn(x_test, x_train, y_train)
        print(y_test)
        print(f"预测结果:{
      
      pred}")
        predicte = predicted(pred, y_test)
        print(f"k = {
      
      k}时,测试精度为:{
      
      predicte}")

insert image description here

注:红色定义为jmu样本,蓝色为待遇测样本,绿色为unjmu样本

结果分析

以上面增强代码和运行结果进行分析,去十次结果(理应进行对k=0,到k=len(x_train)进行分析),之所以取10十因为k取值越大,其实结果过于模糊,说白了k越大,等于比较数据集那个label的样本数更多了。

k = ? predicate
1 1
3 0.83333
5 1
7 1
9 1
11 0.83333
13 0.83333
15 0.83333
21 0.66666
23 0.66666

从上表看:貌似k取越小越好,k越大预测的精度就越差了,这是为什么呢?难道k真的取值越小越好吗?

首先来说第一个问题:

k越大精度就越差,为什么呢?

首先,先分析一下我的数据集,我的数据集中label为unjmu的样本和jmu的样本数量上是不匹配的,unjmu的样本明显大于jmu,那么在k取值越大的情况下unjmu的样本就会在那些label标签为jmu中的作用越大,导致将label将jmu样本预测成unjmu。所以说,当k大于一定的值时,预测结果和样本数据集标签种类的数量关系会被放大。

再说一下第二个问题:

k取越小越好吗?
看下图:

insert image description here

The sample to be predicted in the box is unjmu, but the label closest to him is a label that is mislabeled by humans. If k is smaller, the better, then k=1 (nearest neighbor) should be taken, but in this case, the sample needs 0 errors to a large extent. , but any wrong label may lead to an error in the prediction result, and it is difficult to achieve zero error in a manually labeled data set. (Just like when I first started writing data, there was a label writing error). Therefore, the value of k is not as small as possible.

To sum up: So how should k define the size. In the above proposal of the basic question of KNN, the method of cross-validation is mentioned, you can try it. I personally think that the value of k is mainly related to the following aspects:

  1. The size of the dataset. (If it is too large, k cannot take too small a value, otherwise overfitting will be serious; if it is too small, k cannot take a large value, otherwise the ambiguity will be too strong)
  2. Sample label kind. (There are more types, so the possibility of labeling errors is greater)
  3. The data dimension of the sample. (It is best to use different distance calculation formulas for different dimensions, and the calculation methods are different, so k is worth choosing and needs to be adjusted)

Guess you like

Origin blog.csdn.net/weixin_51961968/article/details/127534931