"Machine Learning Formula Derivation and Code Implementation" chapter6-k nearest neighbor algorithm

"Machine Learning Formula Derivation and Code Implementation" study notes, record your own learning process, please buy the author's book for detailed content.

k nearest neighbor algorithm

The k nearest neighbor ( k-nearest neighbor, k-NN) algorithm is a classic classification algorithm. The k-nearest neighbor algorithm determines the classification of a new input instance according to the category of the k nearest neighbor instances. Therefore, the k-nearest neighbor algorithm does not have an explicit learning and training process like mainstream machine learning algorithms. Because of this, the implementation of the k-nearest neighbor algorithm is slightly different from the regression model described in the previous chapters. k值的选择, 距离度量方式and 分类决策规则are the three elements of the k-nearest neighbor algorithm.

1 Distance measurement method

In order to measure the similarity between two instances in the feature space, we use distance to describe it. Commonly used distance measures include 闵氏距离and and 马氏距离so on.

(1) 闵氏距离That is 闵可夫斯基距离( Minkowski distance), the distance is defined as follows, given a mset of dimensional vector samples X, for xi, xjX, xi= (x1i,x2i,...xmi)T, then the Min’s distance between sample xi and sample xj can be defined as:
dij = ( ∑ k = 1 m ∣ xki − xkj ∣ p ) 1 p , p ≥ 1 d_{ij}=\left ( \sum_{k=1}^{m}\left | x_{ki}-x_{kj} \right | ^{p} \right )^{\frac{1}{p} }, p\ge 1dij=(k=1mxtoxkjp)p1,p1
It can be easily seen that p=1at that time闵氏距离it becomes曼哈顿距离(Manhatan distance):
dij = ∑ k = 1 m ∣ xki − xkj ∣ d_{ij}=\sum_{k=1}^{m}\left | x_{ki }-x_{kj} \right |dij=k=1mxtoxkj∣At
that timep=2 ,闵氏距离it becomes欧氏距离(Euclidean distance):
dij = ( ∑ k = 1 m ∣ xki − xkj ∣ 2 ) 1 2 d_{ij}=\left ( \sum_{k=1}^{m}\left | x_{ki}-x_{kj} \right | ^{2} \right )^{\frac{1}{2} }dij=(k=1mxtoxkj2)21
At that timep=∞ , 闵氏距离it is also called 切比雪夫距离( Chebyshev distance):
dij = max ∣ xki − xkj ∣ d_{ij}=max\left | x_{ki}-x_{kj} \right |dij=maxxtoxkj
(2)马氏距离The full name马哈拉诺比斯距离(Mahalanobis distance), is a clustering measurement method to measure the correlation between various features. Given a sample setX=(xij)mxn, assuming that the sample covariance matrix isS​​, then the Mahalanobis distance between sample xi and sample xj can be defined as:
dij = [ ( xi − xj ) TS − 1 ( xi − xj ) ] 1 2 d_ {ij}=\left [\left(x_{i}-x_{j}\right)^{T} S^{-1}\left(x_{i}-x_{j}\right)\right] ^{\frac{1}{2}}dij=[(xixj)TS1(xixj)]21
When Sit is an identity matrix, that is, when the features of the sample are independent of each other and the variance is 1, the Mahalanobis distance is the Euclidean distance.

The feature space of the k-nearest neighbor algorithm is an n-dimensional real number vector space, and the Euclidean distance is generally used directly as the distance measure between instances.

The basic principle of 2k nearest neighbor algorithm

Given a training set, for a new input instance, find the k nearest neighbors to the instance in the training set, which class the majority of the k instances belong to, and which class the instance belongs to. Therefore, a few key points are summarized:

  • Find the instance that is the nearest neighbor to this instance. Here is how to find it. Generally, Euclidean distance is used.
  • There are k instances, how to choose the size of k.
  • Which category does the majority of k instances belong to? Generally, the classification rule of the majority vote is selected.

Among the three key points, we need to pay attention to the selection of k value. When the k value is small, the prediction result will be very sensitive to the instance, and the classifier has poor anti-noise ability, which is prone to overfitting; if the k value is large, the corresponding classification The error will increase, and the overall model will become simpler, resulting in a certain degree of underfitting. Generally, cross-validation is used to select the appropriate k value.

knn and k-means
k近邻法 ( knn) are a basic classification and regression method.
k-meansIt is a simple and effective clustering method.

3 Implementation of k nearest neighbor algorithm based on numpy

Define the Euclidean distance function between the new sample instance and the training sample

import numpy as np

# 定义欧式距离函数
def compute_distances(X_test, X_train): # 测试样本实例矩阵,训练样本实例矩阵

    num_test = X_test.shape[0] # 45
    num_train = X_train.shape[0] # 105
    dists = np.zeros((num_test, num_train)) # 基于训练和测试维度的欧式距离初始化 45*105
    M = np.dot(X_test, X_train.T) # 测试样本与训练样本的矩阵点乘 45*105
    te = np.square(X_test).sum(axis=1) # 测试样本矩阵平方 45*feature_dim
    tr = np.square(X_train).sum(axis=1) # 训练样本矩阵平方 105*feature_dim
    dists = np.sqrt(-2 * M + tr + np.matrix(te).T) # 计算欧式距离,广播
    return dists # 欧氏距离
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.utils import shuffle

# 加载鸢尾花数据集
iris = datasets.load_iris() # 加载鸢尾花数据集
X, y = shuffle(iris.data, iris.target, random_state=13) # 打乱数据
X = X.astype(np.float32)

offset = int(X.shape[0] * 0.7)
X_train, y_train, X_test, y_test = X[:offset], y[:offset], X[offset:], y[offset:]
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)

dists = compute_distances(X_test, X_train)
plt.imshow(dists, interpolation='none')
plt.show()

insert image description here
Label prediction function, including default k value and classification decision rule

from collections import Counter

# 标签预测函数
def predict_labels(y_train, dists, k=1): # 训练集标签, 测试集与训练集的欧氏距离, k值

    num_test = dists.shape[0] # 测试样本量
    y_pred = np.zeros(num_test) # 初始化测试集预测结果
    for i in range(num_test):
        closest_y = [] # 初始化最近邻列表
        # 按欧式距离矩阵排序后取索引,并用训练集标签按排序后的索引取值,最后展开列表
        labels = y_train[np.argsort(dists[i, :])].flatten() # argsort函数返回的是数组值从小到大的索引值
        closest_y = labels[0:k] # 取最近的k个值进行计数统计
        c = Counter(closest_y) # Counter
        y_pred[i] = c.most_common(1)[0][0] # 取计数最多的那个类别
    return y_pred # 测试集预测结果
# 尝试对测试集进行预测,在默认k值取1的情况下,观察分类准确率
y_test_pred = predict_labels(y_train, dists, k=10)
y_test_pred = y_test_pred.reshape(-1, 1)
num_correct = np.sum(y_test_pred == y_test) # 找出预测正确的实例
accuracy = float(num_correct / y_test.shape[0])
print(accuracy)
0.9777777777777777

In order to find the optimal k value, we try to use the five-fold cross-validation method to search

from sklearn.metrics import accuracy_score

num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]
X_train_folds = np.array_split(X_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)
k_to_accuracies = dict()

for k in k_choices:
    for fold in range(num_folds):
        # 为传入的训练集单独划分出一个验证集作为测试集
        validation_X_test = X_train_folds[fold]
        validation_y_test = y_train_folds[fold]
        temp_X_train = np.concatenate(X_train_folds[:fold] + X_train_folds[fold + 1:])
        temp_y_train = np.concatenate(y_train_folds[:fold] + y_train_folds[fold + 1:])
        temp_dists = compute_distances(validation_X_test, temp_X_train)
        temp_y_test_pred = predict_labels(temp_y_train, temp_dists, k)
        temp_y_test_pred = temp_y_test_pred.reshape(-1, 1)
        accuracy = accuracy_score(temp_y_test_pred, validation_y_test)
        k_to_accuracies[k] = k_to_accuracies.get(k, []) + [accuracy]
        
# 不同k值下的分类准确率    
for k in k_to_accuracies:
    for accuracy in k_to_accuracies[k]:
        print(f'k = {
      
      k}, accuracy = {
      
      accuracy}')
k = 1, accuracy = 0.9047619047619048
k = 1, accuracy = 1.0
k = 1, accuracy = 0.9523809523809523
k = 1, accuracy = 0.8571428571428571
k = 1, accuracy = 0.9523809523809523
k = 3, accuracy = 0.8571428571428571
k = 3, accuracy = 1.0
k = 3, accuracy = 0.9523809523809523
k = 3, accuracy = 0.8571428571428571
k = 3, accuracy = 0.9523809523809523
k = 5, accuracy = 0.8571428571428571
k = 5, accuracy = 1.0
k = 5, accuracy = 0.9523809523809523
k = 5, accuracy = 0.9047619047619048
k = 5, accuracy = 0.9523809523809523
k = 8, accuracy = 0.9047619047619048
k = 8, accuracy = 1.0
k = 8, accuracy = 0.9523809523809523
k = 8, accuracy = 0.9047619047619048
k = 8, accuracy = 0.9523809523809523
k = 10, accuracy = 0.9523809523809523
k = 10, accuracy = 1.0
k = 10, accuracy = 0.9523809523809523
k = 10, accuracy = 0.9047619047619048
k = 10, accuracy = 0.9523809523809523
...
k = 100, accuracy = 0.38095238095238093
k = 100, accuracy = 0.3333333333333333
k = 100, accuracy = 0.23809523809523808
k = 100, accuracy = 0.19047619047619047

Plot Error Bars with Confidence Intervals

for k in k_choices:
    accuracies = k_to_accuracies[k]
    plt.scatter([k] * len(accuracies), accuracies)
accuracies_mean = np.array([np.mean(v) for k, v in k_to_accuracies.items()]) # 计算标准差
accuracies_std = np.array([np.std(v) for k, v in k_to_accuracies.items()]) # 计算方差
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std) # 误差棒图 
plt.title('cross-validation on k')
plt.xlabel('k')
plt.ylabel('cross_validation accuracy')
plt.show()

insert image description here

4 Realization of k nearest neighbor algorithm based on sklearn

from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=10)
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)
y_pred = y_pred.reshape(-1, 1)
accuracy = accuracy_score(y_pred, y_test)
print(accuracy)
0.9777777777777777

Notebook_Github address

Guess you like

Origin blog.csdn.net/cjw838982809/article/details/131365728