An analysis of the K-nearest neighbor algorithm: a guide to machine learning from principle to practice

Overview

Machine Learning has become an indispensable branch in the field of modern technology, involving various applications, from autonomous driving, to face recognition, to recommendation systems to recommend songs for us. Among many machine learning algorithms, K The nearest neighbor algorithm is the simplest one. The K nearest neighbor algorithm has a simple and intuitive principle.

Machine Learning k Nearest Neighbors

Introduction to machine learning

When we refer to machine learning, we actually refer to the process of letting machines learn from data and make decisions or predictions. This is in contrast to traditional programming methods, in which We need to explicitly tell the machine how to perform the task. But inmachine learning, the machine "learns" how to perform the task based on the data provided.

For example, if we want the machine to recognize a cat in a picture, traditional methods may require you to define the characteristics of the cat, such as the shape of the ears, the size of the eyes, etc. But in machine learning, you will provide thousands of Pictures of cats (and pictures of non-cats) and let the machine figure out the characteristics of cats on its own.

K nearest neighbor algorithm

K-Nearest Neighbor (k-Nearest Neighbor) is a basic algorithm in machine learning. The core idea of ​​K-Nearest Neighbor is "birds of a feather flock together". By finding the closest "k" points of an unknown data in the training set, and more closely approximating the label of the data to predict labels for unlabeled data.

For example:
If I arrive in a new city and want to find a place to eat, I will ask several local people. If multiple people recommend the same restaurant , we will most likely choose the restaurant they recommend. In this example, the locals (city residents) are our "neighbors", and the recommendations of these neighbors are based on their experience. This is how the K nearest neighbor algorithm works, by considering surrounding "neighbors" and make decisions based on their "recommendations".

K distance among nearest neighbors

The key to the K nearest neighbor algorithm is how to calculate the distance (Distance) between data. The K nearest neighbor algorithm relies on finding the nearest neighbor of a point, and "nearest" is defined by distance.

Euclidean distance

Euclidean distance is the most common distance calculation method. Simply put, Euclidean distance is the straight-line distance between two points. We can calculate it through the Pythagorean theorem we learned in elementary school. In multi-dimensional space, Euclidean distance is the straight-line distance between two points. The distance is the square and square root of the difference in each dimension.

Euclidean distance
计算公式:
e u c l i d e a n    d i s t a n c e = ( ( x 2 − x 1 ) 2 + ( y 2 − y 1 ) 2 ) euclidean \; distance = \sqrt((x_2 - x_1)^2 + (y_2 - y_1)^2) euclideandistance=( (x2x1)2+(y2and1)2)

Each feature (Feature) in our data can be regarded as a dimension, such as radius_mean, texture_mean in the breast cancer classification data set. If we have multiple features, just extend the above formula.

For example:

  • 2 个特征: e u c l i d e a n    d i s t a n c e = ( ( x 2 − x 1 ) 2 + ( y 2 − y 1 ) 2 ) euclidean \; distance = \sqrt((x_2 - x_1)^2 + (y_2 - y_1)^2) euclideandistance=( (x2x1)2+(y2and1)2)
  • 3 个特征: e u c l i d e a n    d i s t a n c e = ( ( x 2 − x 1 ) 2 + ( y 2 − y 1 ) 2 + ( z 2 − z 1 ) 2 ) euclidean \; distance = \sqrt((x_2 - x_1)^2 + (y_2 - y_1)^2 + (z_2 - z_1)^2) euclideandistance=( (x2x1)2+(y2and1)2+(z2With1)2)
  • 4 个特征: e u c l i d e a n    d i s t a n c e = ( ( x 2 − x 1 ) 2 + ( y 2 − y 1 ) 2 + ( z 2 − z 1 ) 2 + ( a 2 − a 1 ) 2 ) euclidean \; distance = \sqrt((x_2 - x_1)^2 + (y_2 - y_1)^2 + (z_2 - z_1)^2 + (a_2 - a_1)^2) euclideandistance=( (x2x1)2+(y2and1)2+(z2With1)2+(a2a1)2)

example:

from sklearn.neighbors import KNeighborsClassifier


# 使用欧几里得距离
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X, y)

manhattan distance

Manhattan Distance is calculated by calculating the total distance from the first point to another point on a grid-shaped path, and the sum of the absolute differences along the axis.

Insert image description here
计算公式:
m a n h a t t a n    d i s t a n c e = ∣ x 1 − x 2 ∣ + ∣ y 1 − y 2 ∣ manhattan \; distance = |x_1 - x_2| + |y_1 - y_2| manhattandistance=x1x2+y1and2

Similarly, if we have multiple features, we just need to extend the above formula:

  • 2 个特征: m a n h a t t a n    d i s t a n c e = ∣ x 1 − x 2 ∣ + ∣ y 1 − y 2 ∣ manhattan \; distance = |x_1 - x_2| + |y_1 - y_2| manhattandistance=x1x2+y1and2$
  • 3 个特征: m a n h a t t a n    d i s t a n c e = ∣ x 1 − x 2 ∣ + ∣ y 1 − y 2 ∣ + ∣ z 1 − z 2 ∣ manhattan \; distance = |x_1 - x_2| + |y_1 - y_2| + |z_1 - z_2| manhattandistance=x1x2+y1and2+z1With2
  • 4 个特征: m a n h a t t a n    d i s t a n c e = ∣ x 1 − x 2 ∣ + ∣ y 1 − y 2 ∣ + ∣ z 1 − z 2 ∣ + ∣ a 1 − a 2 ∣ manhattan \; distance = |x_1 - x_2| + |y_1 - y_2| + |z_1 - z_2| + |a_1 - a_2| manhattandistance=x1x2+y1and2+z1With2+a1a2

example:

from sklearn.neighbors import KNeighborsClassifier

# 使用曼哈顿距离
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X, y)

cosine similarity

Cosine Similarity determines similarity by calculating the angle between two vectors. Cosine similarity is used when we need the direction between data rather than the absolute distance. For example, in text Determine the similarity (distance) of text during analysis.

Choose an appropriate K value

The choice of the K value is very important for the K nearest neighbor algorithm. The K value determines the number of "neighbors" that the algorithm needs to consider. If k=1, it means that only one "neighbor" is used to predict the result, which can easily lead to overfitting. problem; if the value of k is too large, it means that less similar "neighbors" will also affect the judgment of the model.

For example, when we arrive in a strange city and want to find a place to eat:

  • k=1 (considering a nearest neighbor): We only asked a person passing by his favorite restaurant. There is a probability that this person may like a remote, "unique" restaurant, but there is a high probability that we do not like it. Since We refer to one person's opinion (k=1), so we may miss the really popular restaurants in the city. Similarly, when we choose the k value that is too small, the model will receive different abnormal opinions. Influence, overfitting.
  • k=∞ (considering all neighbors): We asked everyone we passed by, got hundreds of suggestions, and chose the restaurant that most people mentioned (Shaxian Snacks, Lanzhou Beef Noodles). This might be the one A very common and popular chain store, but it will not be the local specialty restaurant you want. In the same way, when the k value we choose is too large, the "neighbors" that are not so close to you will also affect the model, resulting in The model is oversimplified (underfitted).

odd number vs even number

Choosing an odd number as the k value can avoid the situation of a tie in the two-class classification. For example, in a two-class classification problem, when k=2, the two neighbors have different opinions.

Choosing k-values ​​via cross-validation

from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier


# 假设我们有数据 X, y
best_score = 0
best_k = 1
for k in range(1, 31):
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=10)
    mean_score = scores.mean()
    if mean_score > best_score:
        best_score = mean_score
        best_k = k

print(f"Best k value: {best_k}")

Actual combat

Classification problem

Let's take the iris data set to show you how to practice the KNN algorithm. The iris data set contains three types of iris flowers, each consisting of 50 sets of data. Each sample contains 4 features, which are the length and length of the sepals. Width, the length and width of petals. Through the K nearest neighbor algorithm, we can predict the type of flower.

example:

"""
@Module Name: knn分类.py
@Author: CSDN@我是小白呀
@Date: October 16, 2023

Description:
使用 K近邻算法对鸢尾花进行分类
"""
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, recall_score, f1_score

# 加载数据集
iris = load_iris()
X = iris.data
y = iris.target

# 调试输出数据基本信息
print("输出特征:", X[:5])
print("输出标签:", y[:5])

# 分割数据集
X_train, X_vaild, y_train, y_vaild = train_test_split(X, y, test_size=0.2)

# 实例化模型
knn = KNeighborsClassifier(n_neighbors=3)

# 训练模型
knn.fit(X_train, y_train)

# 预测
y_pred = knn.predict(X_vaild)

# 评估指标
print("精度:", accuracy_score(y_vaild, y_pred))
print("召回率:", recall_score(y_vaild, y_pred, average='macro'))  # 多分类问题使用宏平均
print("F1分数:", f1_score(y_vaild, y_pred, average='macro'))  # 多分类问题使用宏平均

Output result:

输出特征: [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
输出标签: [0 0 0 0 0]
精度: 0.9
召回率: 0.8777777777777779
F1分数: 0.8656126482213438

Let’s further optimize the above code:

"""
@Module Name: knn分类.py
@Author: CSDN@我是小白呀
@Date: October 16, 2023

Description:
使用 K近邻算法对鸢尾花进行分类
"""
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from matplotlib import pyplot as plt
plt.style.use("fivethirtyeight")

# 加载数据集
iris = load_iris()
X = iris.data
y = iris.target

# 调试输出数据基本信息
print("输出特征:", X[:5])
print("输出标签:", y[:5])

# 分割数据集
X_train, X_vaild, y_train, y_vaild = train_test_split(X, y, test_size=0.2)

# 通过交叉验证选择k值
k_value_score = []

for k in range(1, 31):
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=10)
    mean_score = scores.mean()
    k_value_score.append(mean_score)

# 绘图
plt.figure(figsize=(12, 8))
plt.plot([i for i in range(1, 31)], k_value_score)
plt.xlabel('Value of K for KNN')
plt.ylabel('Score')
plt.show()

# 实例化模型
knn = KNeighborsClassifier(n_neighbors=12)

# 训练模型
knn.fit(X_train, y_train)

# 预测
y_pred = knn.predict(X_vaild)

# 评估指标
print(classification_report(y_vaild, y_pred))

K value
When k=12, the model prediction results are optimal.

Output result:

输出特征: [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
输出标签: [0 0 0 0 0]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       0.90      1.00      0.95         9
           2       1.00      0.91      0.95        11

    accuracy                           0.97        30
   macro avg       0.97      0.97      0.97        30
weighted avg       0.97      0.97      0.97        30

regression problem

The Boston housing price data set is a classic regression data set, which contains the median house prices in various urban areas of Boston and other related characteristics, such as crime rate, education level, etc. Using K nearest neighbor regression, we can predict the housing prices of new urban areas. .

Code:

"""
@Module Name: knn回归.py
@Author: CSDN@我是小白呀
@Date: October 16, 2023

Description:
使用 K近邻算法预测波士顿房价
"""
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# 加载数据集
boston = load_boston()
X = boston.data
y = boston.target

# 调试输出数据基本信息
print("输出特征:", X[:5])
print("输出标签:", y[:5])

# 分割数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# K 近邻回归
knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train, y_train)

# 预测
y_pred = knn_reg.predict(X_test)
print("平方差:", mean_squared_error(y_test, y_pred))

Output result:

输出特征: [[6.3200e-03 1.8000e+01 2.3100e+00 0.0000e+00 5.3800e-01 6.5750e+00
  6.5200e+01 4.0900e+00 1.0000e+00 2.9600e+02 1.5300e+01 3.9690e+02
  4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 6.4210e+00
  7.8900e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9690e+02
  9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 7.1850e+00
  6.1100e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9283e+02
  4.0300e+00]
 [3.2370e-02 0.0000e+00 2.1800e+00 0.0000e+00 4.5800e-01 6.9980e+00
  4.5800e+01 6.0622e+00 3.0000e+00 2.2200e+02 1.8700e+01 3.9463e+02
  2.9400e+00]
 [6.9050e-02 0.0000e+00 2.1800e+00 0.0000e+00 4.5800e-01 7.1470e+00
  5.4200e+01 6.0622e+00 3.0000e+00 2.2200e+02 1.8700e+01 3.9690e+02
  5.3300e+00]]
输出标签: [24.  21.6 34.7 33.4 36.2]
平方差: 47.335607843137254

Advantages and disadvantages of K nearest neighbor algorithm

advantage

  • Simple and easy to understand: KNN is based on strength learning, the algorithm is simple and intuitive, and easy to understand. For simple classification and regression tasks, only a few lines of code are needed
  • No training step required: KNN is a lazy learner (Lazy Leaner) that is, KNN does not actually train a model on the data, but uses the training data when making predictions
  • Self-heating for handling multi-classification problems: handles multiple categories without additional modifications
  • Can be used for classification and regression: KNN can be used for both classification tasks and regression tasks, giving it versatility

shortcoming

  • Computationally intensive: Since the algorithm needs to search the k nearest values ​​for each new data, the computational cost skyrockets in large data sets
  • Sensitive to imbalanced data: In a data set, if the amount of data of a certain class is greater than that of another class, then the data is likely to be classified into that major class

Step by step implementation of k nearest neighbors

In order to help everyone understand better, I will lead you step by step to implement the knn algorithm.

Hand rubbing algorithm

Hand rub KNN:

"""
@Module Name: 手把手实现knn.py
@Author: CSDN@我是小白呀
@Date: October 16, 2023

Description:
手把手实现knn算法
"""
class KNN:
    def __init__(self, k=3):
        """
        初始化参数
        :param k: k值, 默认为 3
        """
        self.k = k
        self.X_train = None
        self.y_train = None

    def fit(self, X_train, y_train):
        """
        为训练集 / 测试集成员赋值
        :param X_train: 训练集
        :param y_train: 测试集
        :return:
        """
        self.X_train = X_train
        self.y_train = y_train

    def predict(self, X_test):
        """
        预测
        :param X_test: 训练特征集
        :return: 预测值
        """
        y_pred = [self._predict(x) for x in X_test]
        return y_pred

    def _predict(self, x):
        """
        预测
        :param x: 需要预测的数据
        :return: 预测标签
        """
        # 计算距离
        distances = [self._euclidean_distance(x, x_train) for x_train in self.X_train]

        # 得到 k 个最近邻的索引
        k_indices = sorted(range(len(distances)), key=lambda i: distances[i])[:self.k]

        # k个最近邻的标签
        k_nearest_labels = [self.y_train[i] for i in k_indices]

        # 投票
        most_common = self._vote(k_nearest_labels)

        # 返回标签
        return most_common

    def _euclidean_distance(self, x1, x2):
        """
        计算欧式距离
        :param x1: 数据1
        :param x2: 数据2
        :return: 距离
        """
        return sum((xi - xj) ** 2 for xi, xj in zip(x1, x2)) ** 0.5

    def _vote(self, labels):
        # 使用字典统计每个类别的票数
        votes = {}
        for label in labels:
            if label in votes:
                votes[label] += 1
            else:
                votes[label] = 1

        # 根据票数排序并返回得票数最多的类别
        return sorted(votes.items(), key=lambda x: x[1], reverse=True)[0][0]

Actual combat classification

Rub KNN by hand for iris classification:

"""
@Module Name: 手把手实现knn.py
@Author: CSDN@我是小白呀
@Date: October 16, 2023

Description:
手把手实现knn算法
"""
class KNN:
    def __init__(self, k=3):
        """
        初始化参数
        :param k: k值, 默认为 3
        """
        self.k = k
        self.X_train = None
        self.y_train = None

    def fit(self, X_train, y_train):
        """
        为训练集 / 测试集成员赋值
        :param X_train: 训练集
        :param y_train: 测试集
        :return:
        """
        self.X_train = X_train
        self.y_train = y_train

    def predict(self, X_test):
        """
        预测
        :param X_test: 训练特征集
        :return: 预测值
        """
        y_pred = [self._predict(x) for x in X_test]
        return y_pred

    def _predict(self, x):
        """
        预测
        :param x: 需要预测的数据
        :return: 预测标签
        """
        # 计算距离
        distances = [self._euclidean_distance(x, x_train) for x_train in self.X_train]

        # 得到 k 个最近邻的索引
        k_indices = sorted(range(len(distances)), key=lambda i: distances[i])[:self.k]

        # k个最近邻的标签
        k_nearest_labels = [self.y_train[i] for i in k_indices]

        # 投票
        most_common = self._vote(k_nearest_labels)

        # 返回标签
        return most_common

    def _euclidean_distance(self, x1, x2):
        """
        计算欧式距离
        :param x1: 数据1
        :param x2: 数据2
        :return: 距离
        """
        return sum((xi - xj) ** 2 for xi, xj in zip(x1, x2)) ** 0.5

    def _vote(self, labels):
        # 使用字典统计每个类别的票数
        votes = {}
        for label in labels:
            if label in votes:
                votes[label] += 1
            else:
                votes[label] = 1

        # 根据票数排序并返回得票数最多的类别
        return sorted(votes.items(), key=lambda x: x[1], reverse=True)[0][0]

if __name__ == '__main__':

    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report

    # 加载数据集
    iris = load_iris()
    X = iris.data
    y = iris.target

    # 调试输出数据基本信息
    print("输出特征:", X[:5])
    print("输出标签:", y[:5])

    # 分割数据集
    X_train, X_vaild, y_train, y_vaild = train_test_split(X, y, test_size=0.2)

    # 实例化模型
    knn = KNN(12)

    # 训练模型
    knn.fit(X_train, y_train)

    # 预测
    y_pred = knn.predict(X_vaild)

    # 评估指标
    print(classification_report(y_vaild, y_pred))

Output result:

输出特征: [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
输出标签: [0 0 0 0 0]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         8
           1       0.83      1.00      0.91        10
           2       1.00      0.83      0.91        12

    accuracy                           0.93        30
   macro avg       0.94      0.94      0.94        30
weighted avg       0.94      0.93      0.93        30

Guess you like

Origin blog.csdn.net/weixin_46274168/article/details/133870683