Basics of machine learning algorithms--KNN classification algorithm

1. Introduction to the principle of KNN algorithm

KNN(K-Nearest Neighbor)工作原理:
在一个存在标签的数据集中,当我们输入一个新的没有标签的样本时候,KNN算法的任务就是将该样本分类,即给定其对应的样本标签。
输入没有标签的数据后,将新数据中的每个特征与样本集中数据对应的特征进行比较,提取出样本集中特征最相似数据(最近邻)的分类标签。
一般来说我们选取的是新样本的最近的k个样本进行"投票"决策,这就是KNN算法中k的意思,
通常k是不大于20的整数。最后选择k个最相似数据中出现次数最多的分类作为新数据的分类。

2.KNN classification decision-making principle

  The KNN algorithm generally uses a majority voting method, that is, the class of the input instance is determined by the K neighboring majority classes of the input instance. This idea is also the result of empirical risk minimization.
  The training sample is ( x i , y i ) (x_{i} ,y_{i}) (xi,andi). When the input instance is x, the mark is c, N k ( x ) N_{_k}(x) Nk(x) is the k-nearest neighbor training sample set of the input instance x. We define the training error rate as the proportion of K nearest neighbor training sample labels that are inconsistent with the input labels. The error rate is expressed as:
1 k ∑ x i ∈ N k ( x ) I ( y i ≠ c j ) = 1 − 1 k ∑ x i ∈ N k ( x ) I ( y i = c j ) \frac1k\sum_{x_i\in N_{k}(x)}I(y_i\neq c_j)=1-\frac1k\sum_{x_i\ in N_{k}(x)}I(y_i=c_j)\quad k1xiNk(x)I(yi=cj)=1k1xiNk(x)I(yi=cj)
  Therefore, to minimize the error rate, that is, to minimize the empirical risk, we must make 1 k ∑ x i ∈ N k ( x ) I ( y i = c j ) \frac{1 }{k}\sum_{x_{i}\in N_{k}(x)}I(y_{i}=c_{j}) k1xiNk(x)I(yi=cj) is as large as possible, that is, the mark value of K nearest neighbors is as consistent as possible with the input mark, so the majority voting rule is equivalent to minimizing the empirical risk.

3. Introduction to KNN metric distance

3.1. Minkowski distance

  Let us also have the following equation:
D ( x , y ) = ∣ x 1 − y 1 ∣ p + ∣ x 2 − y 2 ∣ p + . . . . . . . . . + ∣ x n − y n ∣ p p = ∑ i = 1 n ∣ x i − y i ∣ p p \begin{aligned} D(x,y)& =\sqrt[p]{\mid x_1-y_1\mid^p+\mid x_2-y_2\mid^p+...+\mid x_n-y_n\mid^p} \\ &=\sqrt[p]{ \sum_{i=1}^{n}\mid x_{i}-y_{i}\mid^{p}} \end{aligned} D(x,y)=px1and1p+x2and2p+...+xnandnp =pi=1nxiandip

3.2.Manhattan distance

  曼哈顿距离如下所示:
D ( x , y ) = ∣ x 1 − y 1 ∣ + ∣ x 2 − y 2 ∣ + . . . . + ∣ x n − y n ∣ = ∑ i = 1 n ∣ x i − y i ∣ \begin{aligned} D(x,y)& =\mid x_1-y_1\mid+\mid x_2-y_2\mid+....+\mid x_n-y_n\mid \\ &=\sum_{i=1}^{n}\mid x_{i}-y_{i}\mid \end{aligned} D(x,y)=∣x1and1+x2and2+....+xnandn=i=1nxiandi

3.3.Euclidean distance

  欧式距离如下所示:
D ( x , y ) = ( x 1 − y 1 ) 2 + ( x 2 − y 2 ) 2 + . . . + ( x n − y n ) 2 = ∑ i = 1 n ( x i − y i ) 2 \begin{aligned} D(x,y)& =\sqrt{(x_1-y_1)^2+(x_2-y_2)^2+...+(x_n-y_n)^2} \\ &=\sqrt{\sum_{i=1}^{n}(x_{i}-y_{i})^{2}} \end{aligned} D(x,y)=(x1and1)2+(x2and2)2+...+(xnandn)2 =i=1n(xiandi)2

4.KNN classification algorithm implementation

from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine

# 支持中文
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号

#加载wine数据集
data = load_wine()
X = data.data[:, :2] #取前两列内容作为Alcohol和苹果酸作为样本
y = data.target

#划分数据集和测试集
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

#创建KNN分类器,设置k=6
knn_classifier = KNeighborsClassifier(n_neighbors=10, metric='euclidean')#以欧式距离作为度量距离
knn_classifier.fit(X_train, y_train)

 # 预测测试集
y_pred = knn_classifier.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 0.8888888888888888

5.KNN classification algorithm effect

#可视化绘图
h = .02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# 获取预测结果
Z = knn_classifier.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# 创建颜色地图
cmap_background = ListedColormap(['#FFAAAA', '#AAAAFF', '#AAFFAA'])
cmap_points = ListedColormap(['#FF0000', '#0000FF', '#00FF00'])

# 可视化结果
plt.pcolormesh(xx, yy, Z, cmap=cmap_background)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_points,edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("KNN Classification on Wine Dataset")
plt.xlabel("Alcohol")
plt.ylabel("1Malic acid")
plt.show()

Insert image description here

6. Reference articles and acknowledgments

本章内容的完成离不开大佬文章的启发和帮助,在这里列出名单,如果对于内容还有不懂的,可以移步对应的文章进行进一步的理解分析。
1.KNN算法:https://blog.csdn.net/qq_42722197/article/details/123196332?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522169665324816800182743993%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=169665324816800182743993&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~top_click~default-2-123196332-null-null.142^v95^chatgptT3_1&utm_term=knn%E7%AE%97%E6%B3%95%E5%8E%9F%E7%90%86&spm=1018.2226.3001.4187
如果大家这这篇blog中有什么不明白的可以去他的专栏里面看看,内容非常全面,应该能够有比较好的解答。
在文章的最后再次表达由衷的感谢!!

Guess you like

Origin blog.csdn.net/m0_71819746/article/details/133638698