"Machine Learning Formula Derivation and Code Implementation" study notes, record your own learning process, please buy the author's book for detailed content.
k nearest neighbor algorithm
The k nearest neighbor ( k-nearest neighbor
, k-NN
) algorithm is a classic classification algorithm. The k-nearest neighbor algorithm determines the classification of a new input instance according to the category of the k nearest neighbor instances. Therefore, the k-nearest neighbor algorithm does not have an explicit learning and training process like mainstream machine learning algorithms. Because of this, the implementation of the k-nearest neighbor algorithm is slightly different from the regression model described in the previous chapters. k值的选择
, 距离度量方式
and 分类决策规则
are the three elements of the k-nearest neighbor algorithm.
1 Distance measurement method
In order to measure the similarity between two instances in the feature space, we use distance to describe it. Commonly used distance measures include 闵氏距离
and and 马氏距离
so on.
(1) 闵氏距离
That is 闵可夫斯基距离
( Minkowski distance
), the distance is defined as follows, given a m
set of dimensional vector samples X
, for xi
, xj
∈ X
, xi
= (x1i,x2i,...xmi)T
, then the Min’s distance between sample xi and sample xj can be defined as:
dij = ( ∑ k = 1 m ∣ xki − xkj ∣ p ) 1 p , p ≥ 1 d_{ij}=\left ( \sum_{k=1}^{m}\left | x_{ki}-x_{kj} \right | ^{p} \right )^{\frac{1}{p} }, p\ge 1dij=(k=1∑m∣xto−xkj∣p)p1,p≥1
It can be easily seen that p=1
at that time闵氏距离
it becomes曼哈顿距离
(Manhatan distance
):
dij = ∑ k = 1 m ∣ xki − xkj ∣ d_{ij}=\sum_{k=1}^{m}\left | x_{ki }-x_{kj} \right |dij=k=1∑m∣xto−xkj∣At
that timep=2
,闵氏距离
it becomes欧氏距离
(Euclidean distance
):
dij = ( ∑ k = 1 m ∣ xki − xkj ∣ 2 ) 1 2 d_{ij}=\left ( \sum_{k=1}^{m}\left | x_{ki}-x_{kj} \right | ^{2} \right )^{\frac{1}{2} }dij=(k=1∑m∣xto−xkj∣2)21
At that timep=∞
, 闵氏距离
it is also called 切比雪夫距离
( Chebyshev distance
):
dij = max ∣ xki − xkj ∣ d_{ij}=max\left | x_{ki}-x_{kj} \right |dij=max∣xto−xkj∣
(2)马氏距离
The full name马哈拉诺比斯距离
(Mahalanobis distance
), is a clustering measurement method to measure the correlation between various features. Given a sample setX=(xij)mxn
, assuming that the sample covariance matrix isS
, then the Mahalanobis distance between sample xi and sample xj can be defined as:
dij = [ ( xi − xj ) TS − 1 ( xi − xj ) ] 1 2 d_ {ij}=\left [\left(x_{i}-x_{j}\right)^{T} S^{-1}\left(x_{i}-x_{j}\right)\right] ^{\frac{1}{2}}dij=[(xi−xj)TS−1(xi−xj)]21
When S
it is an identity matrix, that is, when the features of the sample are independent of each other and the variance is 1, the Mahalanobis distance is the Euclidean distance.
The feature space of the k-nearest neighbor algorithm is an n-dimensional real number vector space, and the Euclidean distance is generally used directly as the distance measure between instances.
The basic principle of 2k nearest neighbor algorithm
Given a training set, for a new input instance, find the k nearest neighbors to the instance in the training set, which class the majority of the k instances belong to, and which class the instance belongs to. Therefore, a few key points are summarized:
- Find the instance that is the nearest neighbor to this instance. Here is how to find it. Generally, Euclidean distance is used.
- There are k instances, how to choose the size of k.
- Which category does the majority of k instances belong to? Generally, the classification rule of the majority vote is selected.
Among the three key points, we need to pay attention to the selection of k value. When the k value is small, the prediction result will be very sensitive to the instance, and the classifier has poor anti-noise ability, which is prone to overfitting; if the k value is large, the corresponding classification The error will increase, and the overall model will become simpler, resulting in a certain degree of underfitting. Generally, cross-validation is used to select the appropriate k value.
knn and k-means
k近邻法
(knn
) are a basic classification and regression method.
k-means
It is a simple and effective clustering method.
3 Implementation of k nearest neighbor algorithm based on numpy
Define the Euclidean distance function between the new sample instance and the training sample
import numpy as np
# 定义欧式距离函数
def compute_distances(X_test, X_train): # 测试样本实例矩阵,训练样本实例矩阵
num_test = X_test.shape[0] # 45
num_train = X_train.shape[0] # 105
dists = np.zeros((num_test, num_train)) # 基于训练和测试维度的欧式距离初始化 45*105
M = np.dot(X_test, X_train.T) # 测试样本与训练样本的矩阵点乘 45*105
te = np.square(X_test).sum(axis=1) # 测试样本矩阵平方 45*feature_dim
tr = np.square(X_train).sum(axis=1) # 训练样本矩阵平方 105*feature_dim
dists = np.sqrt(-2 * M + tr + np.matrix(te).T) # 计算欧式距离,广播
return dists # 欧氏距离
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.utils import shuffle
# 加载鸢尾花数据集
iris = datasets.load_iris() # 加载鸢尾花数据集
X, y = shuffle(iris.data, iris.target, random_state=13) # 打乱数据
X = X.astype(np.float32)
offset = int(X.shape[0] * 0.7)
X_train, y_train, X_test, y_test = X[:offset], y[:offset], X[offset:], y[offset:]
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)
dists = compute_distances(X_test, X_train)
plt.imshow(dists, interpolation='none')
plt.show()
Label prediction function, including default k value and classification decision rule
from collections import Counter
# 标签预测函数
def predict_labels(y_train, dists, k=1): # 训练集标签, 测试集与训练集的欧氏距离, k值
num_test = dists.shape[0] # 测试样本量
y_pred = np.zeros(num_test) # 初始化测试集预测结果
for i in range(num_test):
closest_y = [] # 初始化最近邻列表
# 按欧式距离矩阵排序后取索引,并用训练集标签按排序后的索引取值,最后展开列表
labels = y_train[np.argsort(dists[i, :])].flatten() # argsort函数返回的是数组值从小到大的索引值
closest_y = labels[0:k] # 取最近的k个值进行计数统计
c = Counter(closest_y) # Counter
y_pred[i] = c.most_common(1)[0][0] # 取计数最多的那个类别
return y_pred # 测试集预测结果
# 尝试对测试集进行预测,在默认k值取1的情况下,观察分类准确率
y_test_pred = predict_labels(y_train, dists, k=10)
y_test_pred = y_test_pred.reshape(-1, 1)
num_correct = np.sum(y_test_pred == y_test) # 找出预测正确的实例
accuracy = float(num_correct / y_test.shape[0])
print(accuracy)
0.9777777777777777
In order to find the optimal k value, we try to use the five-fold cross-validation method to search
from sklearn.metrics import accuracy_score
num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]
X_train_folds = np.array_split(X_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)
k_to_accuracies = dict()
for k in k_choices:
for fold in range(num_folds):
# 为传入的训练集单独划分出一个验证集作为测试集
validation_X_test = X_train_folds[fold]
validation_y_test = y_train_folds[fold]
temp_X_train = np.concatenate(X_train_folds[:fold] + X_train_folds[fold + 1:])
temp_y_train = np.concatenate(y_train_folds[:fold] + y_train_folds[fold + 1:])
temp_dists = compute_distances(validation_X_test, temp_X_train)
temp_y_test_pred = predict_labels(temp_y_train, temp_dists, k)
temp_y_test_pred = temp_y_test_pred.reshape(-1, 1)
accuracy = accuracy_score(temp_y_test_pred, validation_y_test)
k_to_accuracies[k] = k_to_accuracies.get(k, []) + [accuracy]
# 不同k值下的分类准确率
for k in k_to_accuracies:
for accuracy in k_to_accuracies[k]:
print(f'k = {
k}, accuracy = {
accuracy}')
k = 1, accuracy = 0.9047619047619048
k = 1, accuracy = 1.0
k = 1, accuracy = 0.9523809523809523
k = 1, accuracy = 0.8571428571428571
k = 1, accuracy = 0.9523809523809523
k = 3, accuracy = 0.8571428571428571
k = 3, accuracy = 1.0
k = 3, accuracy = 0.9523809523809523
k = 3, accuracy = 0.8571428571428571
k = 3, accuracy = 0.9523809523809523
k = 5, accuracy = 0.8571428571428571
k = 5, accuracy = 1.0
k = 5, accuracy = 0.9523809523809523
k = 5, accuracy = 0.9047619047619048
k = 5, accuracy = 0.9523809523809523
k = 8, accuracy = 0.9047619047619048
k = 8, accuracy = 1.0
k = 8, accuracy = 0.9523809523809523
k = 8, accuracy = 0.9047619047619048
k = 8, accuracy = 0.9523809523809523
k = 10, accuracy = 0.9523809523809523
k = 10, accuracy = 1.0
k = 10, accuracy = 0.9523809523809523
k = 10, accuracy = 0.9047619047619048
k = 10, accuracy = 0.9523809523809523
...
k = 100, accuracy = 0.38095238095238093
k = 100, accuracy = 0.3333333333333333
k = 100, accuracy = 0.23809523809523808
k = 100, accuracy = 0.19047619047619047
Plot Error Bars with Confidence Intervals
for k in k_choices:
accuracies = k_to_accuracies[k]
plt.scatter([k] * len(accuracies), accuracies)
accuracies_mean = np.array([np.mean(v) for k, v in k_to_accuracies.items()]) # 计算标准差
accuracies_std = np.array([np.std(v) for k, v in k_to_accuracies.items()]) # 计算方差
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std) # 误差棒图
plt.title('cross-validation on k')
plt.xlabel('k')
plt.ylabel('cross_validation accuracy')
plt.show()
4 Realization of k nearest neighbor algorithm based on sklearn
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=10)
neigh.fit(X_train, y_train)
y_pred = neigh.predict(X_test)
y_pred = y_pred.reshape(-1, 1)
accuracy = accuracy_score(y_pred, y_test)
print(accuracy)
0.9777777777777777