KNN (K nearest neighbor) algorithm for machine learning

1 KNN Algorithm Introduction

The KNN algorithm, also known as the K nearest neighbor algorithm, is the most basic and introductory algorithm among many machine learning algorithms. The KNN algorithm is one of the simplest classification algorithms, and at the same time, it is also one of the most commonly used classification algorithms. The KNN algorithm is a classification algorithm in supervised learning. It looks similar to Kmeans (Kmeans is an unsupervised learning algorithm), but it is essentially different.

The KNN algorithm performs classification or regression prediction based on the similarity between instances. In the KNN algorithm, the problem to be solved is to assign new data points to one of the known categories. The core idea of ​​the algorithm is to determine the nearest neighbor data points by comparing the distance, and then use the category information of these neighbors to determine the category of the data points to be classified. Its core idea is: "Those who are close to vermilion are red and those who are close to ink are black"

4c4b2043083b47c6935026a1ae5f4f69.png

1.1 Three elements of KNN algorithm

  • Distance measurement algorithm : Euclidean distance is generally used. Other distances can also be used: Manhattan distance, Chebyshev distance, Minkowski distance, etc.
  • Determination of k value : The smaller the k value, the more complex the model as a whole becomes, and the easier it is to overfit. Usually use the cross-validation method to select the optimal k value
  • Classification decision : Generally, majority voting is used, that is, the majority class at k adjacent training points determines the class of the input instance. It can be shown that the majority voting rule is equivalent to empirical risk minimization

1.2 KNN is a non-parametric, inert algorithm model.

  • Non-parametric: It does not mean that the algorithm does not require parameters, but that the model does not make any assumptions about the data. In contrast, linear regression always assumes that linear regression is a straight line. The model structure established by KNN is determined based on data, which is more in line with reality.
  • Laziness: It is also a classification algorithm. Logistic regression requires a lot of training on the data first, and finally an algorithm model will be obtained. The KNN algorithm does not need it, it does not have a clear training data process, or the process is very fast.

1.3 Advantages and disadvantages of KNN algorithm

(1) The KNN algorithm has the following advantages:

  • Simple and easy to understand: The basic idea of ​​the KNN algorithm is intuitive and simple, easy to understand and implement.

  • No training process required: The KNN algorithm is an example-based learning method that does not require an explicit training process. It directly utilizes the existing training data for classification or regression prediction.

  • Applicable to multi-category problems: KNN algorithm can be applied to multi-category problems without being limited by the number of categories.

  • Effective for unbalanced datasets: The KNN algorithm is relatively effective when dealing with unbalanced datasets because it does not assume prior knowledge of the data distribution.

(2) Some disadvantages of the KNN algorithm:

  • High computational complexity: When performing classification or regression prediction, the KNN algorithm needs to calculate the distance between the data points to be classified and all training data points. Computational complexity increases significantly when the training dataset is large.

  • Sensitive to the dimension of feature space: KNN algorithm is sensitive to the dimension of feature space. When the dimensionality of the feature space is high, the performance of the KNN algorithm may decrease due to the so-called "curse of dimensionality". In high-dimensional data, the distance measure becomes inaccurate, and all data points become far apart, losing the meaning of close neighbors.

  • It is necessary to choose an appropriate K value: the performance of the KNN algorithm largely depends on the selection of an appropriate number of nearest neighbors K. Choosing too small K value may cause the model to be too sensitive and easily affected by noise; choosing too large K value may cause the model to be too smooth and unable to capture subtle category features.

  • Not suitable for large-scale data sets: Since the KNN algorithm needs to calculate the distance between the data points to be classified and all training data points in the prediction stage, the storage and calculation overhead may be very large for large-scale data sets.

The KNN algorithm is a simple but powerful classification and regression method applicable to a variety of problem domains. However, when using it, you need to pay attention to the computational complexity, dimensionality sensitivity, appropriate K value selection, and the challenges of adapting to large-scale data sets.

2 Application Scenarios of KNN Algorithm

The advantages of the KNN algorithm include simplicity, ease of understanding, no training process, and suitability for multi-category problems. The KNN algorithm is widely used in many fields. The common application scenarios of the KNN algorithm are as follows:

  • Classification problems: KNN algorithm can be used for classification problems, such as text classification, image classification, speech recognition, etc. By comparing the similarity between the data points to be classified and known data points, KNN can assign new data points to the most similar category.

  • Regression problem: KNN algorithm can also be used for regression problems, such as house price prediction, stock price prediction, etc. By calculating the average or weighted average of the nearest neighbor data points, KNN can predict the numerical attributes of the data points to be classified.

  • Recommendation system: The KNN algorithm can be applied to a recommendation system to recommend items of similar interest based on the similarity between users. By comparing behavior patterns or interest preferences among users, KNN can find a group of users most similar to the current user and recommend similar items to them.

  • Anomaly detection: KNN algorithm can be used to detect abnormal data points, such as credit card fraud, network intrusion, etc. By calculating the distance between a data point and its nearest neighbors, KNN can identify anomalous data points that are different from the majority of data points.

  • Text mining: KNN algorithm can be used for text mining tasks, such as text classification, sentiment analysis, etc. By comparing the similarity between texts, KNN can classify new text data points into corresponding categories.

  • Image processing: KNN algorithm can be applied in the field of image processing, such as image recognition, image retrieval, etc. By comparing pixel values ​​or feature vectors between images, KNN can identify and retrieve similar images.

However, the disadvantage of this algorithm is the high computational complexity, especially when the training dataset is large, a large number of distances need to be calculated. In addition, the KNN algorithm is sensitive to the dimension of the feature space, which may cause problems in the processing of high-dimensional data.

For some data (with large feature space dimensions and large data capacity), in order to improve the performance of the KNN algorithm, feature selection and dimensionality reduction techniques can be used to reduce the dimension of the feature space, and data structures such as KD trees can be used to speed up the nearest neighbor search process.

KD Tree is a balanced binary tree, the purpose is to realize the division of k-dimensional space.

9a0907d17e734ab9b81d3ff29e76e9d4.png

KDTree looks like a binary search tree. In fact, KDTree is a variant of a binary search tree. Here K = 3 (dimensions).

Organizing principles of KD tree

Sort each tuple by 0 (the first item number is 0, the second item number is 1, and the third item number is 2), at the nth level of the tree, the n%3th item is displayed in bold, while These trees shown in bold are the key values ​​of the binary search tree. For example, the first item of each node in the left subtree of the root node is smaller than the first item of the root node, and the nodes of the right subtree The first item in is greater than the first item in the root node, and the subtrees are deduced in turn.

For such a tree, it is very easy to search for nodes. Given a tuple, first compare the first item with the root node, if it is less than the left, and greater than the right, the second layer compares the second item, and so on.
 

KD tree retrieval

Suppose our KDTree is created through the sample set {(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)}.
Let's find the point (2.1,3.1), test at (7,2) to reach (5,4), test at (5,4) to reach (2,3), and then the node in search_path is <(7 ,2), (5,4), (2,3)>, take (2,3) from search_path as the current best node nearest, dist is 0.141 (Euclidean distance); then backtrack to (
5,4 ), take (2.1,3.1) as the center, and draw a circle with dist=0.141 as the radius, which does not intersect with the hyperplane y=4, as shown in the figure below, so there is no need to jump to the right subspace of node (5,4) Search, because it is impossible to have a closer sample point in the right subspace.
So backtracking to (7,2), similarly, drawing a circle with (2.1,3.1) as the center and dist=0.141 as the radius does not intersect with the hyperplane x=7, so there is no need to jump to the node (7 ,2) to search the right subspace.
So far, the search_path is empty, the entire search is ended, and nearest(2,3) is returned as the nearest neighbor of (2.1,3.1), and the closest distance is 0.141.
b786abe45e6847c2a8b21642887e0e7e.png

To give another slightly more complicated example, let's find the point (2,4.5), test at (7,2) to reach (5,4), test at (5,4) to reach (4,7), and then search_path The nodes in are <(7,2), (5,4), (4,7)>, take (4,7) from search_path as the current best node nearest, dist is 3.202; then backtrack to
( 5,4), take (2,4.5) as the center, and draw a circle with dist=3.202 as the radius to intersect the hyperplane y=4, as shown in the figure below, so you need to jump to the left subspace of (5,4) to search. So add (2,3) to search_path, now the node in search_path is <(7,2), (2, 3)>; in addition, the distance between (5,4) and (2,4.5) is 3.04 < dist = 3.202, so assign (5,4) to nearest, and dist=3.04.
Go back to (2,3), (2,3) is a leaf node, directly judge whether (2,3) is closer to (2,4.5), and the calculated distance is 1.5, so the nearest update is (2,3) , dist is updated to (1.5)
and goes back to (7,2). Similarly, drawing a circle with (2,4.5) as the center and dist=1.5 as the radius does not intersect the hyperplane x=7, so there is no need to jump to The right subspace of node (7,2) is searched.
So far, search_path is empty, the entire search is ended, and nearest(2,3) is returned as the nearest neighbor point of (2,4.5), and the closest distance is 1.5.

78f376e44c84495eae76808a3464a00a.png

2bc2d2c547724429a7567ed91d80efae.png

 3 Realize data classification on MNIST dataset based on pytorch

3.1 Get the MNIST dataset

(1) Automatic code download

train_dataset = datasets.MNIST(root='data',  # 选择数据的根目录
                            train=True,  # 选择训练集
                            transform=None,  # 不使用任何数据预处理
                            download=True)  # 从网络上下载图片

test_dataset = datasets.MNIST(root='data',  # 选择数据的根目录
                           train=False,  # 选择测试集
                           transform=None,  # 不适用任何数据预处理
                           download=True)  # 从网络上下载图片

But this automatic download may go wrong, the error is as follows:

urllib.error.ContentTooShortError: <urlopen error retrieval incomplete: got only 5303709 out of 9912422 bytes>

 (2) Manually download the dataset

Download address: MNIST data

After the download is complete, put it in the data/MNIST/raw directory

Picture content display:

digit = train_loader.dataset.data[0] 
plt.imshow(digit, cmap=plt.cm.binary)
plt.show()
print(train_loader.dataset.targets[0])

4293984ad8ee4c03a4b9c3d7fb3bc4d1.png

3.2 KNN calculation

Using 60,000 pictures of MNIST as the training set, all 10,000 pictures of the test data set are labeled by KNN calculation. Use the KNN algorithm to compare the test picture with each picture in the training set, and then assign the label of the training set picture that it thinks is the most similar to this test picture. Specifically, how should these two pictures be compared
? In this example, comparing pictures is comparing 28×28 pixel blocks. The easiest way is to compare pixel by pixel, and finally add up all the difference values ​​and use the L1 distance to compare the two images. Differences are done pixel by pixel, and then all differences are added together to get a number. If the two pictures are exactly the same, then the L1 distance is 0, but if the two pictures are very different, then the value of L1 will be very large.

def KNN_classify(k, dis_func, train_data, train_label, test_data):
    num_test = test_data.shape[0]  # 测试样本的数量
    label_list = []
    for idx in range(num_test):
        distances = dis_func(train_data, test_data[idx])
        nearest_k = np.argsort(distances)
        top_k = nearest_k[:k]  # 选取前k个距离
        class_count = {}
        for j in top_k:
            class_count[train_label[j]] = class_count.get(train_label[j], 0) + 1
        sorted_class_count = sorted(class_count.items(), key=operator.itemgetter(1), reverse=True)
        label_list.append(sorted_class_count[0][0])

    return np.array(label_list)

3.3 Complete code

#!/usr/bin/env python
# coding: utf-8


import operator
import matplotlib.pyplot as plt
import numpy as np
from torchvision import datasets, transforms
from torch.utils.data import DataLoader


batch_size = 100
train_dataset = datasets.MNIST(root='data',  # 选择数据的根目录
                            train=True,  # 选择训练集
                            transform=None,  # 不使用任何数据预处理
                            download=True)  # 从网络上下载图片

test_dataset = datasets.MNIST(root='data',  # 选择数据的根目录
                           train=False,  # 选择测试集
                           transform=None,  # 不适用任何数据预处理
                           download=True)  # 从网络上下载图片

train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=True)

print("train_data:", train_dataset.data.size())
print("train_labels:", train_dataset.data.size())
print("test_data:", test_dataset.data.size())
print("test_labels:", test_dataset.data.size())


# digit = train_loader.dataset.data[0]  # 取第一个图片的数据
# plt.imshow(digit, cmap=plt.cm.binary)
# plt.show()
# print(train_loader.dataset.targets[0])

# 欧式顿距离计算
def e_distance(dataset_a, data_b):
    return np.sqrt(np.sum(((dataset_a - np.tile(data_b, (dataset_a.shape[0], 1))) ** 2), axis=1))

# 曼哈顿距离计算
def m_distance(dataset_a, data_b):
    return np.sum(np.abs(train_data - np.tile(test_data[i], (train_data.shape[0], 1))), axis=1)


def KNN_classify(k, dis_func, train_data, train_label, test_data):
    num_test = test_data.shape[0]  # 测试样本的数量
    label_list = []
    for idx in range(num_test):
        distances = dis_func(train_data, test_data[idx])
        nearest_k = np.argsort(distances)
        top_k = nearest_k[:k]  # 选取前k个距离
        class_count = {}
        for j in top_k:
            class_count[train_label[j]] = class_count.get(train_label[j], 0) + 1
        sorted_class_count = sorted(class_count.items(), key=operator.itemgetter(1), reverse=True)
        label_list.append(sorted_class_count[0][0])

    return np.array(label_list)


def get_mean(data):
    data = np.reshape(data, (data.shape[0], -1))
    mean_image = np.mean(data, axis=0)
    return mean_image


def centralized(data, mean_image):
    data = data.reshape((data.shape[0], -1))
    data = data.astype(np.float64)
    data -= mean_image  # 减去图像均值,实现领均值化
    return data


if __name__ == '__main__':
    # 训练数据
    train_data = train_loader.dataset.data.numpy()
    train_data = train_data.reshape(train_data.shape[0], 28 * 28)
    
    # 归一化处理
    mean_image = get_mean(train_data)  # 计算所有图像均值
    train_data = centralized(train_data, mean_image)
    
    print('train_data shape:', train_data.shape)
    train_label = train_loader.dataset.targets.numpy()
    print('train_lable shape', train_label.shape)

    # 测试数据
    test_data = test_loader.dataset.data[:1000].numpy()
    test_data = centralized(test_data, mean_image)
    test_data = test_data.reshape(test_data.shape[0], 28 * 28)
    print('test_data shape', test_data.shape)
    test_label = test_loader.dataset.targets[:1000].numpy()
    print('test_label shape', test_label.shape)

    # 训练
    test_label_pred = KNN_classify(5, e_distance, train_data, train_label, test_data)

    # 得到训练准确率
    num_test = test_data.shape[0]
    num_correct = np.sum(test_label == test_label_pred)
    print(num_correct)
    accuracy = float(num_correct) / num_test
    print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

3.4 Calculation result display

train_data: torch.Size([60000, 28, 28])
train_labels: torch.Size([60000, 28, 28])
test_data: torch.Size([10000, 28, 28])
test_labels: torch.Size([10000, 28, 28])
train_data shape: (60000, 784)
train_lable shape (60000,)
test_data shape (1000, 784)
test_label shape (1000,)
963
Got 963 / 1000 correct => accuracy: 0.963000

Using the Euclidean distance calculation, the final result accuracy rate reached 96.3%

4 Complete project and data download

Download address: code and data

Guess you like

Origin blog.csdn.net/lsb2002/article/details/131267874