机器学习入门：KNN（K近邻算法）简介

本篇博客属于机器学习入门系列博客，主要讲述 KNN （K近邻算法）的基本原理和 Python 代码实现，KNN由于思想极度简单，应用数学知识比较少，效果好等优点，常用来作为入门机器学习的第一课，可以完整的解释机器学习算法使用过程中的很多细节问题，更加完整的刻画机器学习应用的流程。

机器学习入门系列博客

KNN 算法

K-Nearest-Neighbors ，中文翻译为 K近邻算法。主要思想就是假设样本在特征空间中的距离越近，属于同一类的可能性就越大。算法在执行过程中，设置一个K值，对于一个新输入的待预测的样本，找到距离它最近的K个点（也就是K个最近的邻居）统计K个点分别所属的类别，数量最多的类别就是最终的预测类别结果。

举例说明如下：

下图表示患者肿瘤类别预测，横轴代表肿瘤大小，纵轴代表肿瘤发现时间，蓝色表示良性肿瘤，橙色表示恶性肿瘤。
在这里插入图片描述
图中的绿色点代表要预测的点，对于该点，我们需要找到距离它最近的K个点，统计这K个点中的类别的数量。根据经验，K往往取值为3 ，在图示中就是找到 3 个距离绿色点最近的点进行统计，易得预测的结果是恶性肿瘤。

上图生成代码如下：

import numpy as np
import matplotlib.pyplot as plot
data=np.asarray([[ 1 , 0],[ 0 ,14],[ 2 ,18],[11 , 6],
 [ 2, 11],[ 0 ,17],[12 ,12],[10 , 5],[18 , 4],[ 8 ,13],
 [14 ,17],[ 8 ,15],[ 0 , 3],[ 3 , 6],[11 , 9],[20 , 8],
 [ 2 , 9],[ 1 , 5],[15, 14],[ 3 ,18],[10 , 6],[11 , 5],
 [15 , 9],[ 9 , 0],[ 0 ,13]])
x=data[:,0]
y=data[:,1]

index=np.argsort(x)
x=x[index]
y=y[index]

Rx=[]
Ry=[]
Gx=[]
Gy=[]
for i in range(len(x)):
    if x[i]+y[i]<20:
        Rx.append(x[i])
        Ry.append(y[i])
    else:
        Gx.append(x[i])
        Gy.append(y[i])

plot.scatter(Rx,Ry)
plot.scatter(Gx,Gy)
plot.scatter([11],[10])
plot.title('肿瘤数据：蓝色良性，橙色恶性')
plot.ylabel('肿瘤的时间')
plot.xlabel('肿瘤的大小')
plot.savefig(figsize=[10,10],fname="haha.png")
plot.show()

KNN 算法实现（python）

仿照sklearn 的写法，自行实现一个KNN 的分类器。
fit 函数用于生成模型（对于KNN算法，其实数据就是模型，与其它算法不大一样），predict函数用于返回测试数据的预测结果。
score 函数用于进行准确度统计，用来评估模型预测结果的好坏。

"""
knn.py
write by qianqianjun
"""
import numpy as np
from math import sqrt
from collections import Counter

class kNNClassifier:
    def __init__(self,k):
        """
        初始化knn 分类器
        :param k: knn 算法中的k
        """
        assert k>=1,"k must be valid"
        self.k=k
        self._x_train=None
        self._y_train=None
    def fit(self,x_train,y_train):
        """
        根据训练集数据和训练集lable 来训练数据
        :param x_train: 训练集的数据 np.array
        :param y_train: 训练集的label np.array
        :return: 返回对象本身
        """
        assert x_train.shape[0]==y_train.shape[0],\
        "the size of the x_train must be equal to the y_train"
        assert self.k<=x_train.shape[0],\
        "the size of the x_train must be k"
        self._x_train=x_train
        self._y_train=y_train
        return self

    def predict(self,x_predict):
        """
        对测试集中的数据进行预测
        :param x_predict:  测试集的数据 np.array
        :return: 返回的是通过模型计算出的预测结果数组 np.array
        """
        assert self._x_train is not None and self._y_train is not None,\
        "must fit before predict"
        assert x_predict.shape[1]==self._x_train.shape[1],\
        "the feature number of the x_train must be equal to the x_preduction"
        y_predict=[self._predict(x) for x in x_predict]
        return np.array(y_predict)

    def _predict(self,x):
        """
        返回一个值
        :param x: 用来计算每一个样例的预测结果。
        :return: 返回每一个样例的预测结果。
        """
        distences=[sqrt(np.sum((train_x-x)**2)) for train_x in self._x_train]
        votes=Counter(self._y_train[np.argsort(distences)[:self.k]])
        return votes.most_common(1)[0][0]

    def score(self,x_test,y_test):
        """
        根据传入的测试集，来预测准确度为多少
        :param x_test: 测试集的数据
        :param y_test: 测试集的label
        :return:  返回的是准确度
        """
        assert x_test.shape[0]==y_test.shape[0],\
        "the size of the x_test must be equal to the y_test"
        y_predict=self.predict(x_test)
        return np.sum(y_predict==y_test)/len(y_test)

使用这个KNN 分类器进行分类预测，代码如下：

首先根据上图的数据构造标记结果，这里设置 x+y<20 就是良性肿瘤。

label=[]
for p in data:
    if p[0]+p[1]<20:
        label.append(0)
    else:
        label.append(1)
print(label)

完整的测试代码如下

import numpy as np
import matplotlib.pyplot as plot
data=np.asarray([[ 1 , 0],[ 0 ,14],[ 2 ,18],[11 , 6],
 [ 2, 11],[ 0 ,17],[12 ,12],[10 , 5],[18 , 4],[ 8 ,13],
 [14 ,17],[ 8 ,15],[ 0 , 3],[ 3 , 6],[11 , 9],[20 , 8],
 [ 2 , 9],[ 1 , 5],[15, 14],[ 3 ,18],[10 , 6],[11 , 5],
 [15 , 9],[ 9 , 0],[ 0 ,13]])
x=data[:,0]
y=data[:,1]

index=np.argsort(x)
x=x[index]
y=y[index]

Bx=[]
By=[]
Gx=[]
Gy=[]
for i in range(len(x)):
    if x[i]+y[i]<20:
        Bx.append(x[i])
        By.append(y[i])
    else:
        Gx.append(x[i])
        Gy.append(y[i])

plot.scatter(Bx,By)
plot.scatter(Gx,Gy)
plot.scatter([11],[10])
plot.title('肿瘤数据：蓝色良性，橙色恶性')
plot.ylabel('肿瘤的时间')
plot.xlabel('肿瘤的大小')
plot.savefig(figsize=[10,10],fname="haha.png")
plot.show()

## 设置label
label=[]
for p in data:
    if p[0]+p[1]<20:
        label.append(0)
    else:
        label.append(1)

print(label)

from knn.py import kNNClassifier

knn=kNNClassifier(3)
knn.fit(data,np.asarray(label))

## 预测一下 (11,10) 和 (5,5) 分别属于哪一类
predict=knn.predict(np.asarray([[11,10],[5,5]]))
print(predict)

使用sklearn 中的KNN分类器

KNN算法由于其简单并且效果好的特点，被很多主流机器学习库实现，下面展示 sklearn 中的KNN算法

这里我们使用的数据集是一个叫做鸢尾花的数据集，数据集的信息如下：

Iris plants dataset
--------------------
**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica

可知，该数据集有四个特征：sepal length，sepal width，petal length，petal width；单位都是厘米，类别一共有 3类。一共有150个样本，每一类样本有50个。

使用 KNN 算法进行预测，并计算最终准确度

import sklearn.datasets as datasets
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

iris=datasets.load_iris()

data=iris.data
label=iris.target
# 打乱顺序
perm=np.random.permutation(data.shape[0])
data=data[perm]
label=label[perm]
# 分开训练集和测试集
x_train,x_test,y_train,y_test=train_test_split(data,label,test_size=0.2,random_state=666)
# 使用 sklearn 的KNN 算法进行预测。
knn=KNeighborsClassifier(3)
knn.fit(x_train,y_train)
score=knn.score(x_test,y_test)
print("准确度为： {0}".format(score))

输出结果： 准确度为： 0.9666666666666667

由输出结果可知，KNN虽然思想简单，但是效果不差。

假装很坏的谦谦君

发布了57 篇原创文章 · 获赞 27 · 访问量 4万+

私信关注